* [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs
@ 2025-03-06 18:18 Andrea Righi
2025-03-06 18:18 ` [PATCH 1/4] sched_ext: idle: Honor idle flags in the built-in idle selection policy Andrea Righi
` (5 more replies)
0 siblings, 6 replies; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 18:18 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min; +Cc: bpf, linux-kernel
Many scx schedulers define their own concept of scheduling domains to
represent topology characteristics, such as heterogeneous architectures
(e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
specific properties (e.g., setting the soft-affinity of certain tasks to a
subset of CPUs).
Currently, there is no mechanism to share these domains with the built-in
idle CPU selection policy. As a result, schedulers often implement their
own idle CPU selection policies, which are typically similar to one
another, leading to a lot of code duplication.
To address this, extend the built-in idle CPU selection policy introducing
the concept of preferred CPUs.
With this concept, BPF schedulers can apply the built-in idle CPU selection
policy to a subset of preferred CPUs, allowing them to implement their own
scheduling domains while still using the topology optimizations
optimizations of the built-in policy, preventing code duplication across
different schedulers.
To implement this, introduce a new helper kfunc scx_bpf_select_cpu_pref()
that allows to specify a cpumask of preferred CPUs:
s32 scx_bpf_select_cpu_pref(struct task_struct *p,
const struct cpumask *preferred_cpus,
s32 prev_cpu, u64 wake_flags, u64 flags);
Moreover, introduce the new idle flag %SCX_PICK_IDLE_IN_PREF that can be
used to enforce selection strictly within the preferred domain.
Example usage
=============
s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
const struct cpumask *dom = task_domain(p) ?: p->cpus_ptr;
s32 cpu;
/*
* Pick an idle CPU in the task's domain. If no CPU is found,
* extend the search outside the domain.
*/
cpu = scx_bpf_select_cpu_pref(p, dom, prev_cpu, wake_flags, 0);
if (cpu >= 0) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
return cpu;
}
return prev_cpu;
}
Results
=======
Load distribution on a 4 sockets / 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses the
new helper scx_bpf_select_cpu_pref() and 0xff00 as preferred domain:
$ vng --cpu 16,sockets=4,cores=4,threads=1
Starting 12 CPU hogs to fill the preferred domain:
$ stress-ng -c 12
...
0[|||||||||||||||||||||||100.0%] 8[||||||||||||||||||||||||100.0%]
1[| 1.3%] 9[||||||||||||||||||||||||100.0%]
2[|||||||||||||||||||||||100.0%] 10[||||||||||||||||||||||||100.0%]
3[|||||||||||||||||||||||100.0%] 11[||||||||||||||||||||||||100.0%]
4[|||||||||||||||||||||||100.0%] 12[||||||||||||||||||||||||100.0%]
5[|| 2.6%] 13[||||||||||||||||||||||||100.0%]
6[| 0.6%] 14[||||||||||||||||||||||||100.0%]
7| 0.0%] 15[||||||||||||||||||||||||100.0%]
Passing %SCX_PICK_IDLE_IN_PREF to scx_bpf_select_cpu_pref() to enforce
strict selection on the preferred CPUs (with the same workload):
0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
Andrea Righi (4):
sched_ext: idle: Honor idle flags in the built-in idle selection policy
sched_ext: idle: Introduce the concept of preferred CPUs
sched_ext: idle: Introduce scx_bpf_select_cpu_pref()
selftests/sched_ext: Add test for scx_bpf_select_cpu_pref()
kernel/sched/ext.c | 4 +-
kernel/sched/ext_idle.c | 235 ++++++++++++++++++----
kernel/sched/ext_idle.h | 3 +-
tools/sched_ext/include/scx/common.bpf.h | 2 +
tools/sched_ext/include/scx/compat.h | 1 +
tools/testing/selftests/sched_ext/Makefile | 1 +
tools/testing/selftests/sched_ext/pref_cpus.bpf.c | 95 +++++++++
tools/testing/selftests/sched_ext/pref_cpus.c | 58 ++++++
8 files changed, 354 insertions(+), 45 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/pref_cpus.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/pref_cpus.c
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/4] sched_ext: idle: Honor idle flags in the built-in idle selection policy
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
@ 2025-03-06 18:18 ` Andrea Righi
2025-03-06 18:18 ` [PATCH 2/4] sched_ext: idle: Introduce the concept of preferred CPUs Andrea Righi
` (4 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 18:18 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min; +Cc: bpf, linux-kernel
Enable passing idle flags (%SCX_PICK_IDLE_*) to scx_select_cpu_dfl(),
to enforce strict selection criteria, such as selecting an idle CPU
strictly within @prev_cpu's node or choosing only a fully idle SMT core.
This functionality will be exposed through a dedicated kfunc in a
separate patch.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 2 +-
kernel/sched/ext_idle.c | 41 ++++++++++++++++++++++++++++++-----------
kernel/sched/ext_idle.h | 2 +-
3 files changed, 32 insertions(+), 13 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index debcd1cf2de9b..5cd878bbd0e39 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3396,7 +3396,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
bool found;
s32 cpu;
- cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, &found);
+ cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, 0, &found);
p->scx.selected_cpu = cpu;
if (found) {
p->scx.slice = SCX_SLICE_DFL;
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 15e9d1c8b2815..16981456ec1ed 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -418,7 +418,7 @@ void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops)
* NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
* we never call ops.select_cpu() for them, see select_task_rq().
*/
-s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *found)
+s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 flags, bool *found)
{
const struct cpumask *llc_cpus = NULL;
const struct cpumask *numa_cpus = NULL;
@@ -455,12 +455,13 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool
* If WAKE_SYNC, try to migrate the wakee to the waker's CPU.
*/
if (wake_flags & SCX_WAKE_SYNC) {
- cpu = smp_processor_id();
+ int waker_node;
/*
* If the waker's CPU is cache affine and prev_cpu is idle,
* then avoid a migration.
*/
+ cpu = smp_processor_id();
if (cpus_share_cache(cpu, prev_cpu) &&
scx_idle_test_and_clear_cpu(prev_cpu)) {
cpu = prev_cpu;
@@ -480,9 +481,11 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool
* piled up on it even if there is an idle core elsewhere on
* the system.
*/
+ waker_node = cpu_to_node(cpu);
if (!(current->flags & PF_EXITING) &&
cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
- !cpumask_empty(idle_cpumask(cpu_to_node(cpu))->cpu)) {
+ (!(flags & SCX_PICK_IDLE_IN_NODE) || (waker_node == node)) &&
+ !cpumask_empty(idle_cpumask(waker_node)->cpu)) {
if (cpumask_test_cpu(cpu, p->cpus_ptr))
goto cpu_found;
}
@@ -521,15 +524,25 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool
}
/*
- * Search for any full idle core usable by the task.
+ * Search for any full-idle core usable by the task.
*
- * If NUMA aware idle selection is enabled, the search will
+ * If the node-aware idle CPU selection policy is enabled
+ * (%SCX_OPS_BUILTIN_IDLE_PER_NODE), the search will always
* begin in prev_cpu's node and proceed to other nodes in
* order of increasing distance.
*/
- cpu = scx_pick_idle_cpu(p->cpus_ptr, node, SCX_PICK_IDLE_CORE);
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, node, flags | SCX_PICK_IDLE_CORE);
if (cpu >= 0)
goto cpu_found;
+
+ /*
+ * Give up if we're strictly looking for a full-idle SMT
+ * core.
+ */
+ if (flags & SCX_PICK_IDLE_CORE) {
+ cpu = prev_cpu;
+ goto out_unlock;
+ }
}
/*
@@ -560,18 +573,24 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool
/*
* Search for any idle CPU usable by the task.
+ *
+ * If the node-aware idle CPU selection policy is enabled
+ * (%SCX_OPS_BUILTIN_IDLE_PER_NODE), the search will always begin
+ * in prev_cpu's node and proceed to other nodes in order of
+ * increasing distance.
*/
- cpu = scx_pick_idle_cpu(p->cpus_ptr, node, 0);
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, node, flags);
if (cpu >= 0)
goto cpu_found;
- rcu_read_unlock();
- return prev_cpu;
+ cpu = prev_cpu;
+ goto out_unlock;
cpu_found:
+ *found = true;
+out_unlock:
rcu_read_unlock();
- *found = true;
return cpu;
}
@@ -810,7 +829,7 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
goto prev_cpu;
#ifdef CONFIG_SMP
- return scx_select_cpu_dfl(p, prev_cpu, wake_flags, is_idle);
+ return scx_select_cpu_dfl(p, prev_cpu, wake_flags, 0, is_idle);
#endif
prev_cpu:
diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h
index 68c4307ce4f6f..5c1db6b315f7a 100644
--- a/kernel/sched/ext_idle.h
+++ b/kernel/sched/ext_idle.h
@@ -27,7 +27,7 @@ static inline s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node
}
#endif /* CONFIG_SMP */
-s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *found);
+s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 flags, bool *found);
void scx_idle_enable(struct sched_ext_ops *ops);
void scx_idle_disable(void);
int scx_idle_init(void);
--
2.48.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/4] sched_ext: idle: Introduce the concept of preferred CPUs
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
2025-03-06 18:18 ` [PATCH 1/4] sched_ext: idle: Honor idle flags in the built-in idle selection policy Andrea Righi
@ 2025-03-06 18:18 ` Andrea Righi
2025-03-06 18:18 ` [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref() Andrea Righi
` (3 subsequent siblings)
5 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 18:18 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min; +Cc: bpf, linux-kernel
Many scx schedulers define their own concept of scheduling domains to
represent topology characteristics, such as heterogeneous architectures
(e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
specific properties (e.g., setting the soft-affinity of certain tasks to
a subset of CPUs).
Currently, there is no mechanism to share these domains with the
built-in idle CPU selection policy. As a result, schedulers often
implement their own idle CPU selection policies, which are typically
similar to one another, leading to a lot of code duplication.
To address this, introduce the concept of preferred domain (represented
as a cpumask) that can be used by the BPF schedulers to apply the
built-in idle CPU selection policy to a subset of preferred CPUs.
With this concept the idle CPU selection policy becomes the following:
- always prioritize CPUs from fully idle SMT cores (if SMT is enabled),
- select the same CPU if it's idle and in the preferred domain,
- select an idle CPU within the same LLC domain, if the LLC domain is a
subset of the preferred domain,
- select an idle CPU within the same node, if the node domain is a
subset of the preferred domain,
- select an idle CPU within the preferred domain,
- select any idle CPU usable by the task.
Moreover, introduce the new idle flag %SCX_PICK_IDLE_IN_PREF, that
enforces strict selection within the preferred domain. Without this
flag, the preferred domain is treated as a soft constraint: idle CPUs
outside the preferred domain can be considered if the preferred domain
is fully busy.
If the preferred domain is empty or NULL, the behavior of the built-in
idle CPU selection policy remains unchanged.
This only introduces the core concept of preferred domain. This
functionality will be exposed through a dedicated kfunc in a separate
patch.
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 3 +-
kernel/sched/ext_idle.c | 142 ++++++++++++++++++++-------
kernel/sched/ext_idle.h | 3 +-
tools/sched_ext/include/scx/compat.h | 1 +
4 files changed, 111 insertions(+), 38 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5cd878bbd0e39..a28ddd7655ba8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -807,6 +807,7 @@ enum scx_deq_flags {
enum scx_pick_idle_cpu_flags {
SCX_PICK_IDLE_CORE = 1LLU << 0, /* pick a CPU whose SMT siblings are also idle */
SCX_PICK_IDLE_IN_NODE = 1LLU << 1, /* pick a CPU in the same target NUMA node */
+ SCX_PICK_IDLE_IN_PREF = 1LLU << 2, /* pick a CPU in the preferred domain */
};
enum scx_kick_flags {
@@ -3396,7 +3397,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
bool found;
s32 cpu;
- cpu = scx_select_cpu_dfl(p, prev_cpu, wake_flags, 0, &found);
+ cpu = scx_select_cpu_dfl(p, NULL, prev_cpu, wake_flags, 0, &found);
p->scx.selected_cpu = cpu;
if (found) {
p->scx.slice = SCX_SLICE_DFL;
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 16981456ec1ed..9b002e109404b 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -46,6 +46,11 @@ static struct scx_idle_cpus scx_idle_global_masks;
*/
static struct scx_idle_cpus **scx_idle_node_masks;
+/*
+ * Local per-CPU cpumasks (used to generate temporary idle cpumasks).
+ */
+static DEFINE_PER_CPU(cpumask_var_t, local_idle_cpumask);
+
/*
* Return the idle masks associated to a target @node.
*
@@ -403,52 +408,80 @@ void scx_idle_update_selcpu_topology(struct sched_ext_ops *ops)
* branch prediction optimizations.
*
* 3. Pick a CPU within the same LLC (Last-Level Cache):
- * - if the above conditions aren't met, pick a CPU that shares the same LLC
- * to maintain cache locality.
+ * - if the above conditions aren't met, pick a CPU that shares the same
+ * LLC, if the LLC domain is a subset of @preferred_cpus, to maintain
+ * cache locality.
*
* 4. Pick a CPU within the same NUMA node, if enabled:
- * - choose a CPU from the same NUMA node to reduce memory access latency.
+ * - choose a CPU from the same NUMA node, if the node domain is a subset
+ * of @preferred_cpus, to reduce memory access latency.
+ *
+ * 5. Pick a CPU within @preferred_cpus.
*
- * 5. Pick any idle CPU usable by the task.
+ * 6. Pick any idle CPU usable by the task.
*
* Step 3 and 4 are performed only if the system has, respectively, multiple
* LLC domains / multiple NUMA nodes (see scx_selcpu_topo_llc and
- * scx_selcpu_topo_numa).
+ * scx_selcpu_topo_numa) and their domains don't overlap.
+ *
+ * If %SCX_OPS_BUILTIN_IDLE_PER_NODE is enabled, the search will always
+ * begin in @prev_cpu's node and proceed to other nodes in order of
+ * increasing distance.
+ *
+ * Return the picked CPU with *@found set, indicating whether the picked
+ * CPU is currently idle, or a negative value otherwise.
*
* NOTE: tasks that can only run on 1 CPU are excluded by this logic, because
* we never call ops.select_cpu() for them, see select_task_rq().
*/
-s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 flags, bool *found)
+s32 scx_select_cpu_dfl(struct task_struct *p, const struct cpumask *preferred_cpus,
+ s32 prev_cpu, u64 wake_flags, u64 flags, bool *found)
{
const struct cpumask *llc_cpus = NULL;
const struct cpumask *numa_cpus = NULL;
- int node = scx_cpu_node_if_enabled(prev_cpu);
+ int node;
s32 cpu;
*found = false;
+ /*
+ * If @prev_cpu is not in the preferred domain, try to assign a new
+ * arbitrary CPU in the preferred domain.
+ */
+ if (preferred_cpus && !cpumask_test_cpu(prev_cpu, preferred_cpus)) {
+ cpu = cpumask_any_and_distribute(p->cpus_ptr, preferred_cpus);
+ if (cpu < nr_cpu_ids) {
+ prev_cpu = cpu;
+ node = scx_cpu_node_if_enabled(prev_cpu);
+ }
+ } else {
+ node = scx_cpu_node_if_enabled(prev_cpu);
+ }
+
/*
* This is necessary to protect llc_cpus.
*/
rcu_read_lock();
/*
- * Determine the scheduling domain only if the task is allowed to run
- * on all CPUs.
- *
- * This is done primarily for efficiency, as it avoids the overhead of
- * updating a cpumask every time we need to select an idle CPU (which
- * can be costly in large SMP systems), but it also aligns logically:
- * if a task's scheduling domain is restricted by user-space (through
- * CPU affinity), the task will simply use the flat scheduling domain
- * defined by user-space.
+ * Consider node/LLC scheduling domains only if the preferred
+ * cpumask contains all the CPUs of each particular domain and if
+ * the domains don't overlap.
*/
- if (p->nr_cpus_allowed >= num_possible_cpus()) {
- if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa))
- numa_cpus = numa_span(prev_cpu);
+ if (static_branch_maybe(CONFIG_NUMA, &scx_selcpu_topo_numa)) {
+ const struct cpumask *cpus = numa_span(prev_cpu);
+ const struct cpumask *pref = preferred_cpus ?: p->cpus_ptr;
- if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc))
- llc_cpus = llc_span(prev_cpu);
+ if (!cpumask_equal(cpus, pref) && cpumask_subset(cpus, pref))
+ numa_cpus = cpus;
+ }
+
+ if (static_branch_maybe(CONFIG_SCHED_MC, &scx_selcpu_topo_llc)) {
+ const struct cpumask *cpus = llc_span(prev_cpu);
+ const struct cpumask *pref = preferred_cpus ?: p->cpus_ptr;
+
+ if (!cpumask_equal(cpus, pref) && cpumask_subset(cpus, pref))
+ llc_cpus = cpus;
}
/*
@@ -486,7 +519,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64
cpu_rq(cpu)->scx.local_dsq.nr == 0 &&
(!(flags & SCX_PICK_IDLE_IN_NODE) || (waker_node == node)) &&
!cpumask_empty(idle_cpumask(waker_node)->cpu)) {
- if (cpumask_test_cpu(cpu, p->cpus_ptr))
+ if (cpumask_test_cpu(cpu, preferred_cpus ?: p->cpus_ptr))
goto cpu_found;
}
}
@@ -523,6 +556,20 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64
goto cpu_found;
}
+ /*
+ * Search for any full-idle core in the preferred domain.
+ *
+ * If the node-aware idle CPU selection policy is enabled
+ * (%SCX_OPS_BUILTIN_IDLE_PER_NODE), the search will always
+ * begin in prev_cpu's node and proceed to other nodes in
+ * order of increasing distance.
+ */
+ if (preferred_cpus) {
+ cpu = scx_pick_idle_cpu(preferred_cpus, node, flags | SCX_PICK_IDLE_CORE);
+ if (cpu >= 0)
+ goto cpu_found;
+ }
+
/*
* Search for any full-idle core usable by the task.
*
@@ -531,9 +578,11 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64
* begin in prev_cpu's node and proceed to other nodes in
* order of increasing distance.
*/
- cpu = scx_pick_idle_cpu(p->cpus_ptr, node, flags | SCX_PICK_IDLE_CORE);
- if (cpu >= 0)
- goto cpu_found;
+ if (!(flags & SCX_PICK_IDLE_IN_PREF)) {
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, node, flags | SCX_PICK_IDLE_CORE);
+ if (cpu >= 0)
+ goto cpu_found;
+ }
/*
* Give up if we're strictly looking for a full-idle SMT
@@ -571,6 +620,20 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64
goto cpu_found;
}
+ /*
+ * Search for any idle CPU in the preferred domain.
+ *
+ * If the node-aware idle CPU selection policy is enabled
+ * (%SCX_OPS_BUILTIN_IDLE_PER_NODE), the search will always begin
+ * in prev_cpu's node and proceed to other nodes in order of
+ * increasing distance.
+ */
+ if (preferred_cpus) {
+ cpu = scx_pick_idle_cpu(preferred_cpus, node, flags);
+ if (cpu >= 0)
+ goto cpu_found;
+ }
+
/*
* Search for any idle CPU usable by the task.
*
@@ -579,9 +642,11 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64
* in prev_cpu's node and proceed to other nodes in order of
* increasing distance.
*/
- cpu = scx_pick_idle_cpu(p->cpus_ptr, node, flags);
- if (cpu >= 0)
- goto cpu_found;
+ if (!(flags & SCX_PICK_IDLE_IN_PREF)) {
+ cpu = scx_pick_idle_cpu(p->cpus_ptr, node, flags);
+ if (cpu >= 0)
+ goto cpu_found;
+ }
cpu = prev_cpu;
goto out_unlock;
@@ -599,7 +664,7 @@ s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64
*/
void scx_idle_init_masks(void)
{
- int node;
+ int i;
/* Allocate global idle cpumasks */
BUG_ON(!alloc_cpumask_var(&scx_idle_global_masks.cpu, GFP_KERNEL));
@@ -610,14 +675,19 @@ void scx_idle_init_masks(void)
sizeof(*scx_idle_node_masks), GFP_KERNEL);
BUG_ON(!scx_idle_node_masks);
- for_each_node(node) {
- scx_idle_node_masks[node] = kzalloc_node(sizeof(**scx_idle_node_masks),
- GFP_KERNEL, node);
- BUG_ON(!scx_idle_node_masks[node]);
+ for_each_node(i) {
+ scx_idle_node_masks[i] = kzalloc_node(sizeof(**scx_idle_node_masks),
+ GFP_KERNEL, i);
+ BUG_ON(!scx_idle_node_masks[i]);
- BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->cpu, GFP_KERNEL, node));
- BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[node]->smt, GFP_KERNEL, node));
+ BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[i]->cpu, GFP_KERNEL, i));
+ BUG_ON(!alloc_cpumask_var_node(&scx_idle_node_masks[i]->smt, GFP_KERNEL, i));
}
+
+ /* Allocate local per-cpu idle cpumasks */
+ for_each_possible_cpu(i)
+ BUG_ON(!alloc_cpumask_var_node(&per_cpu(local_idle_cpumask, i),
+ GFP_KERNEL, cpu_to_node(i)));
}
static void update_builtin_idle(int cpu, bool idle)
@@ -829,7 +899,7 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
goto prev_cpu;
#ifdef CONFIG_SMP
- return scx_select_cpu_dfl(p, prev_cpu, wake_flags, 0, is_idle);
+ return scx_select_cpu_dfl(p, NULL, prev_cpu, wake_flags, 0, is_idle);
#endif
prev_cpu:
diff --git a/kernel/sched/ext_idle.h b/kernel/sched/ext_idle.h
index 5c1db6b315f7a..386bde7e8ee3e 100644
--- a/kernel/sched/ext_idle.h
+++ b/kernel/sched/ext_idle.h
@@ -27,7 +27,8 @@ static inline s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, int node
}
#endif /* CONFIG_SMP */
-s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, u64 flags, bool *found);
+s32 scx_select_cpu_dfl(struct task_struct *p, const struct cpumask *preferred_cpus,
+ s32 prev_cpu, u64 wake_flags, u64 flags, bool *found);
void scx_idle_enable(struct sched_ext_ops *ops);
void scx_idle_disable(void);
int scx_idle_init(void);
diff --git a/tools/sched_ext/include/scx/compat.h b/tools/sched_ext/include/scx/compat.h
index 35c67c5174ac0..f9c06079b3a86 100644
--- a/tools/sched_ext/include/scx/compat.h
+++ b/tools/sched_ext/include/scx/compat.h
@@ -120,6 +120,7 @@ static inline bool __COMPAT_struct_has_field(const char *type, const char *field
#define SCX_PICK_IDLE_CORE SCX_PICK_IDLE_FLAG(SCX_PICK_IDLE_CORE)
#define SCX_PICK_IDLE_IN_NODE SCX_PICK_IDLE_FLAG(SCX_PICK_IDLE_IN_NODE)
+#define SCX_PICK_IDLE_IN_PREF SCX_PICK_IDLE_FLAG(SCX_PICK_IDLE_IN_PREF)
static inline long scx_hotplug_seq(void)
{
--
2.48.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref()
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
2025-03-06 18:18 ` [PATCH 1/4] sched_ext: idle: Honor idle flags in the built-in idle selection policy Andrea Righi
2025-03-06 18:18 ` [PATCH 2/4] sched_ext: idle: Introduce the concept of preferred CPUs Andrea Righi
@ 2025-03-06 18:18 ` Andrea Righi
2025-03-07 3:15 ` Changwoo Min
2025-03-06 18:18 ` [PATCH 4/4] selftests/sched_ext: Add test for scx_bpf_select_cpu_pref() Andrea Righi
` (2 subsequent siblings)
5 siblings, 1 reply; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 18:18 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min; +Cc: bpf, linux-kernel
Provide a new kfunc that can be used to apply the built-in idle CPU
selection policy to a subset of preferred CPU:
s32 scx_bpf_select_cpu_pref(struct task_struct *p,
const struct cpumask *preferred_cpus,
s32 prev_cpu, u64 wake_flags, u64 flags);
This new helper is basically an extension of scx_bpf_select_cpu_dfl().
However, when an idle CPU can't be found, it returns a negative value
instead of @prev_cpu, aligning its behavior more closely with
scx_bpf_pick_idle_cpu().
It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
strict selection to the preferred CPUs (%SCX_PICK_IDLE_IN_PREF) or to
@prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a
full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in
selection logic.
With this helper, BPF schedulers can apply the built-in idle CPU
selection policy to a generic CPU domain, with strict or soft selection
requirements.
In the future we can also consider to deprecate scx_bpf_select_cpu_dfl()
and replace it with scx_bpf_select_cpu_pref(), as the latter provides
the same functionality, with the addition of the preferred domain logic.
Example usage
=============
Possible usage in ops.select_cpu():
s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
s32 prev_cpu, u64 wake_flags)
{
const struct cpumask *dom = task_domain(p) ?: p->cpus_ptr;
s32 cpu;
/*
* Pick an idle CPU in the task's domain. If no CPU is found,
* extend the search outside the domain.
*/
cpu = scx_bpf_select_cpu_pref(p, dom, prev_cpu, wake_flags, 0);
if (cpu >= 0) {
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
return cpu;
}
return prev_cpu;
}
Results
=======
Load distribution on a 4 sockets / 4 cores per socket system, simulated
using virtme-ng, running a modified version of scx_bpfland that uses the
new helper scx_bpf_select_cpu_pref() and 0xff00 as preferred domain:
$ vng --cpu 16,sockets=4,cores=4,threads=1
Starting 12 CPU hogs to fill the preferred domain:
$ stress-ng -c 12
...
0[|||||||||||||||||||||||100.0%] 8[||||||||||||||||||||||||100.0%]
1[| 1.3%] 9[||||||||||||||||||||||||100.0%]
2[|||||||||||||||||||||||100.0%] 10[||||||||||||||||||||||||100.0%]
3[|||||||||||||||||||||||100.0%] 11[||||||||||||||||||||||||100.0%]
4[|||||||||||||||||||||||100.0%] 12[||||||||||||||||||||||||100.0%]
5[|| 2.6%] 13[||||||||||||||||||||||||100.0%]
6[| 0.6%] 14[||||||||||||||||||||||||100.0%]
7| 0.0%] 15[||||||||||||||||||||||||100.0%]
Passing %SCX_PICK_IDLE_IN_PREF to scx_bpf_select_cpu_pref() to enforce
strict selection on the preferred CPUs (with the same workload):
0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext.c | 1 +
kernel/sched/ext_idle.c | 60 ++++++++++++++++++++++++
tools/sched_ext/include/scx/common.bpf.h | 2 +
3 files changed, 63 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index a28ddd7655ba8..8ee4818de908b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -465,6 +465,7 @@ struct sched_ext_ops {
* idle CPU tracking and the following helpers become unavailable:
*
* - scx_bpf_select_cpu_dfl()
+ * - scx_bpf_select_cpu_pref()
* - scx_bpf_test_and_clear_cpu_idle()
* - scx_bpf_pick_idle_cpu()
*
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 9b002e109404b..24cba7ddceec4 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -907,6 +907,65 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
return prev_cpu;
}
+/**
+ * scx_bpf_select_cpu_pref - Pick an idle CPU usable by task @p,
+ * prioritizing those in @preferred_cpus
+ * @p: task_struct to select a CPU for
+ * @preferred_cpus: cpumask of preferred CPUs
+ * @prev_cpu: CPU @p was on previously
+ * @wake_flags: %SCX_WAKE_* flags
+ * @flags: %SCX_PICK_IDLE* flags
+ *
+ * Can only be called from ops.select_cpu() if the built-in CPU selection is
+ * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
+ * @p, @prev_cpu and @wake_flags match ops.select_cpu().
+ *
+ * Returns the selected idle CPU, which will be automatically awakened upon
+ * returning from ops.select_cpu() and can be used for direct dispatch, or
+ * a negative value if no idle CPU is available.
+ */
+__bpf_kfunc s32 scx_bpf_select_cpu_pref(struct task_struct *p,
+ const struct cpumask *preferred_cpus,
+ s32 prev_cpu, u64 wake_flags, u64 flags)
+{
+#ifdef CONFIG_SMP
+ struct cpumask *preferred = NULL;
+ bool is_idle = false;
+#endif
+
+ if (!ops_cpu_valid(prev_cpu, NULL))
+ return -EINVAL;
+
+ if (!check_builtin_idle_enabled())
+ return -EBUSY;
+
+ if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
+ return -EPERM;
+
+#ifdef CONFIG_SMP
+ preempt_disable();
+
+ /*
+ * As an optimization, do not update the local idle mask when
+ * p->cpus_ptr is passed directly in @preferred_cpus.
+ */
+ if (preferred_cpus != p->cpus_ptr) {
+ preferred = this_cpu_cpumask_var_ptr(local_idle_cpumask);
+ if (!cpumask_and(preferred, p->cpus_ptr, preferred_cpus))
+ preferred = NULL;
+ }
+ prev_cpu = scx_select_cpu_dfl(p, preferred, prev_cpu, wake_flags, flags, &is_idle);
+ if (!is_idle)
+ prev_cpu = -EBUSY;
+
+ preempt_enable();
+#else
+ prev_cpu = -EBUSY;
+#endif
+
+ return prev_cpu;
+}
+
/**
* scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the
* idle-tracking per-CPU cpumask of a target NUMA node.
@@ -1215,6 +1274,7 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_pref, KF_RCU)
BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index dc4333d23189f..a33e709ec12ab 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -47,6 +47,8 @@ static inline void ___vmlinux_h_sanity_check___(void)
}
s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
+s32 scx_bpf_select_cpu_pref(struct task_struct *p, const struct cpumask *preferred_cpus,
+ s32 prev_cpu, u64 wake_flags, u64 flags) __ksym __weak;
s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym;
void scx_bpf_dsq_insert(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym __weak;
void scx_bpf_dsq_insert_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym __weak;
--
2.48.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 4/4] selftests/sched_ext: Add test for scx_bpf_select_cpu_pref()
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
` (2 preceding siblings ...)
2025-03-06 18:18 ` [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref() Andrea Righi
@ 2025-03-06 18:18 ` Andrea Righi
2025-03-06 18:34 ` [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Tejun Heo
2025-03-07 3:14 ` Changwoo Min
5 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 18:18 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Changwoo Min; +Cc: bpf, linux-kernel
Add a selftest to validate the behavior of the built-in idle CPU
selection policy with preferred CPUs, using scx_bpf_select_cpu_pref().
Signed-off-by: Andrea Righi <arighi@nvidia.com>
---
tools/testing/selftests/sched_ext/Makefile | 1 +
.../selftests/sched_ext/pref_cpus.bpf.c | 95 +++++++++++++++++++
tools/testing/selftests/sched_ext/pref_cpus.c | 58 +++++++++++
3 files changed, 154 insertions(+)
create mode 100644 tools/testing/selftests/sched_ext/pref_cpus.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/pref_cpus.c
diff --git a/tools/testing/selftests/sched_ext/Makefile b/tools/testing/selftests/sched_ext/Makefile
index f4531327b8e76..44fd180111389 100644
--- a/tools/testing/selftests/sched_ext/Makefile
+++ b/tools/testing/selftests/sched_ext/Makefile
@@ -173,6 +173,7 @@ auto-test-targets := \
maybe_null \
minimal \
numa \
+ pref_cpus \
prog_run \
reload_loop \
select_cpu_dfl \
diff --git a/tools/testing/selftests/sched_ext/pref_cpus.bpf.c b/tools/testing/selftests/sched_ext/pref_cpus.bpf.c
new file mode 100644
index 0000000000000..460f5a54f9749
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/pref_cpus.bpf.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * A scheduler that validates the behavior of scx_bpf_select_cpu_pref() by
+ * selecting idle CPUs strictly within a subset of preferred CPUs.
+ *
+ * Copyright (c) 2025 Andrea Righi <arighi@nvidia.com>
+ */
+
+#include <scx/common.bpf.h>
+
+char _license[] SEC("license") = "GPL";
+
+UEI_DEFINE(uei);
+
+const volatile unsigned int __COMPAT_SCX_PICK_IDLE_IN_PREF;
+
+private(PREF_CPUS) struct bpf_cpumask __kptr * preferred_cpumask;
+
+s32 BPF_STRUCT_OPS(pref_cpus_select_cpu,
+ struct task_struct *p, s32 prev_cpu, u64 wake_flags)
+{
+ const struct cpumask *preferred;
+ s32 cpu;
+
+ preferred = cast_mask(preferred_cpumask);
+ if (!preferred) {
+ scx_bpf_error("preferred domain not initialized");
+ return -EINVAL;
+ }
+
+ /*
+ * Select an idle CPU strictly within the preferred domain.
+ */
+ cpu = scx_bpf_select_cpu_pref(p, preferred, prev_cpu, wake_flags,
+ __COMPAT_SCX_PICK_IDLE_IN_PREF);
+ if (cpu >= 0) {
+ if (scx_bpf_test_and_clear_cpu_idle(cpu))
+ scx_bpf_error("CPU %d should be marked as busy", cpu);
+
+ if (__COMPAT_SCX_PICK_IDLE_IN_PREF &&
+ bpf_cpumask_subset(preferred, p->cpus_ptr) &&
+ !bpf_cpumask_test_cpu(cpu, preferred))
+ scx_bpf_error("CPU %d not in the preferred domain for %d (%s)",
+ cpu, p->pid, p->comm);
+
+ scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
+
+ return cpu;
+ }
+
+ return prev_cpu;
+}
+
+s32 BPF_STRUCT_OPS_SLEEPABLE(pref_cpus_init)
+{
+ struct bpf_cpumask *mask;
+
+ mask = bpf_cpumask_create();
+ if (!mask)
+ return -ENOMEM;
+
+ mask = bpf_kptr_xchg(&preferred_cpumask, mask);
+ if (mask)
+ bpf_cpumask_release(mask);
+
+ bpf_rcu_read_lock();
+
+ /*
+ * Assign the first online CPU to the preferred domain.
+ */
+ mask = preferred_cpumask;
+ if (mask) {
+ const struct cpumask *online = scx_bpf_get_online_cpumask();
+
+ bpf_cpumask_set_cpu(bpf_cpumask_first(online), mask);
+ scx_bpf_put_cpumask(online);
+ }
+
+ bpf_rcu_read_unlock();
+
+ return 0;
+}
+
+void BPF_STRUCT_OPS(pref_cpus_exit, struct scx_exit_info *ei)
+{
+ UEI_RECORD(uei, ei);
+}
+
+SEC(".struct_ops.link")
+struct sched_ext_ops pref_cpus_ops = {
+ .select_cpu = (void *)pref_cpus_select_cpu,
+ .init = (void *)pref_cpus_init,
+ .exit = (void *)pref_cpus_exit,
+ .name = "pref_cpus",
+};
diff --git a/tools/testing/selftests/sched_ext/pref_cpus.c b/tools/testing/selftests/sched_ext/pref_cpus.c
new file mode 100644
index 0000000000000..75a09a355e1db
--- /dev/null
+++ b/tools/testing/selftests/sched_ext/pref_cpus.c
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2025 Andrea Righi <arighi@nvidia.com>
+ */
+#include <bpf/bpf.h>
+#include <scx/common.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include "pref_cpus.bpf.skel.h"
+#include "scx_test.h"
+
+static enum scx_test_status setup(void **ctx)
+{
+ struct pref_cpus *skel;
+
+ skel = pref_cpus__open();
+ SCX_FAIL_IF(!skel, "Failed to open");
+ SCX_ENUM_INIT(skel);
+ skel->rodata->__COMPAT_SCX_PICK_IDLE_IN_PREF = SCX_PICK_IDLE_IN_PREF;
+ SCX_FAIL_IF(pref_cpus__load(skel), "Failed to load skel");
+
+ *ctx = skel;
+
+ return SCX_TEST_PASS;
+}
+
+static enum scx_test_status run(void *ctx)
+{
+ struct pref_cpus *skel = ctx;
+ struct bpf_link *link;
+
+ link = bpf_map__attach_struct_ops(skel->maps.pref_cpus_ops);
+ SCX_FAIL_IF(!link, "Failed to attach scheduler");
+
+ /* Just sleeping is fine, plenty of scheduling events happening */
+ sleep(1);
+
+ SCX_EQ(skel->data->uei.kind, EXIT_KIND(SCX_EXIT_NONE));
+ bpf_link__destroy(link);
+
+ return SCX_TEST_PASS;
+}
+
+static void cleanup(void *ctx)
+{
+ struct pref_cpus *skel = ctx;
+
+ pref_cpus__destroy(skel);
+}
+
+struct scx_test pref_cpus = {
+ .name = "pref_cpus",
+ .description = "Verify scx_bpf_select_cpu_pref()",
+ .setup = setup,
+ .run = run,
+ .cleanup = cleanup,
+};
+REGISTER_SCX_TEST(&pref_cpus)
--
2.48.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
` (3 preceding siblings ...)
2025-03-06 18:18 ` [PATCH 4/4] selftests/sched_ext: Add test for scx_bpf_select_cpu_pref() Andrea Righi
@ 2025-03-06 18:34 ` Tejun Heo
2025-03-06 18:54 ` Andrea Righi
2025-03-07 3:14 ` Changwoo Min
5 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2025-03-06 18:34 UTC (permalink / raw)
To: Andrea Righi; +Cc: David Vernet, Changwoo Min, bpf, linux-kernel
Hello,
On Thu, Mar 06, 2025 at 07:18:03PM +0100, Andrea Righi wrote:
> To implement this, introduce a new helper kfunc scx_bpf_select_cpu_pref()
> that allows to specify a cpumask of preferred CPUs:
>
> s32 scx_bpf_select_cpu_pref(struct task_struct *p,
> const struct cpumask *preferred_cpus,
> s32 prev_cpu, u64 wake_flags, u64 flags);
>
> Moreover, introduce the new idle flag %SCX_PICK_IDLE_IN_PREF that can be
> used to enforce selection strictly within the preferred domain.
Would something like scx_bpf_select_cpu_and() work which is only allowed
pick in the intersection (ie. always SCX_PICK_IDLE_IN_PREF). I'm not sure
how much more beneficial a built-in two-level mechanism is especially given
that it wouldn't be too uncommon to need multi-level pick - e.g. within l3
then within numa node and so on.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs
2025-03-06 18:34 ` [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Tejun Heo
@ 2025-03-06 18:54 ` Andrea Righi
2025-03-06 18:58 ` Tejun Heo
0 siblings, 1 reply; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 18:54 UTC (permalink / raw)
To: Tejun Heo; +Cc: David Vernet, Changwoo Min, bpf, linux-kernel
Hi Tejun,
On Thu, Mar 06, 2025 at 08:34:15AM -1000, Tejun Heo wrote:
> Hello,
>
> On Thu, Mar 06, 2025 at 07:18:03PM +0100, Andrea Righi wrote:
> > To implement this, introduce a new helper kfunc scx_bpf_select_cpu_pref()
> > that allows to specify a cpumask of preferred CPUs:
> >
> > s32 scx_bpf_select_cpu_pref(struct task_struct *p,
> > const struct cpumask *preferred_cpus,
> > s32 prev_cpu, u64 wake_flags, u64 flags);
> >
> > Moreover, introduce the new idle flag %SCX_PICK_IDLE_IN_PREF that can be
> > used to enforce selection strictly within the preferred domain.
>
> Would something like scx_bpf_select_cpu_and() work which is only allowed
> pick in the intersection (ie. always SCX_PICK_IDLE_IN_PREF). I'm not sure
> how much more beneficial a built-in two-level mechanism is especially given
> that it wouldn't be too uncommon to need multi-level pick - e.g. within l3
> then within numa node and so on.
Just to make sure I understand, you mean provide two separate kfuncs:
scx_bpf_select_cpu_and() and scx_bpf_select_cpu_pref(), instead of
introducing the flag?
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs
2025-03-06 18:54 ` Andrea Righi
@ 2025-03-06 18:58 ` Tejun Heo
2025-03-06 19:02 ` Andrea Righi
0 siblings, 1 reply; 12+ messages in thread
From: Tejun Heo @ 2025-03-06 18:58 UTC (permalink / raw)
To: Andrea Righi; +Cc: David Vernet, Changwoo Min, bpf, linux-kernel
On Thu, Mar 06, 2025 at 07:54:34PM +0100, Andrea Righi wrote:
> Just to make sure I understand, you mean provide two separate kfuncs:
> scx_bpf_select_cpu_and() and scx_bpf_select_cpu_pref(), instead of
> introducing the flag?
Oh I meant just having scx_bpf_select_cpu_and(). The caller can just call it
twice for _pref() behavior, right?
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs
2025-03-06 18:58 ` Tejun Heo
@ 2025-03-06 19:02 ` Andrea Righi
0 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2025-03-06 19:02 UTC (permalink / raw)
To: Tejun Heo; +Cc: David Vernet, Changwoo Min, bpf, linux-kernel
On Thu, Mar 06, 2025 at 08:58:27AM -1000, Tejun Heo wrote:
> On Thu, Mar 06, 2025 at 07:54:34PM +0100, Andrea Righi wrote:
> > Just to make sure I understand, you mean provide two separate kfuncs:
> > scx_bpf_select_cpu_and() and scx_bpf_select_cpu_pref(), instead of
> > introducing the flag?
>
> Oh I meant just having scx_bpf_select_cpu_and(). The caller can just call it
> twice for _pref() behavior, right?
Oh I see, you call it for the pref CPUs first and then for all the CPUs to
get the same behavior (similar to what we do with the SMT idle cores).
Yeah, that can work. Good idea!
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
` (4 preceding siblings ...)
2025-03-06 18:34 ` [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Tejun Heo
@ 2025-03-07 3:14 ` Changwoo Min
5 siblings, 0 replies; 12+ messages in thread
From: Changwoo Min @ 2025-03-07 3:14 UTC (permalink / raw)
To: Andrea Righi, Tejun Heo, David Vernet; +Cc: bpf, linux-kernel
Hi Andrea,
Thank you for submitting the patch set.
On 25. 3. 7. 03:18, Andrea Righi wrote:
> Many scx schedulers define their own concept of scheduling domains to
> represent topology characteristics, such as heterogeneous architectures
> (e.g., big.LITTLE, P-cores/E-cores), or to categorize tasks based on
> specific properties (e.g., setting the soft-affinity of certain tasks to a
> subset of CPUs).
>
> Currently, there is no mechanism to share these domains with the built-in
> idle CPU selection policy. As a result, schedulers often implement their
> own idle CPU selection policies, which are typically similar to one
> another, leading to a lot of code duplication.
>
> To address this, extend the built-in idle CPU selection policy introducing
> the concept of preferred CPUs.
>
> With this concept, BPF schedulers can apply the built-in idle CPU selection
> policy to a subset of preferred CPUs, allowing them to implement their own
> scheduling domains while still using the topology optimizations
> optimizations of the built-in policy, preventing code duplication across
Typo here. There are two "optimizations".
> different schedulers.
>
> To implement this, introduce a new helper kfunc scx_bpf_select_cpu_pref()
> that allows to specify a cpumask of preferred CPUs:
>
> s32 scx_bpf_select_cpu_pref(struct task_struct *p,
> const struct cpumask *preferred_cpus,
> s32 prev_cpu, u64 wake_flags, u64 flags);
>
> Moreover, introduce the new idle flag %SCX_PICK_IDLE_IN_PREF that can be
> used to enforce selection strictly within the preferred domain.
>
> Example usage
> =============
>
> s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
> s32 prev_cpu, u64 wake_flags)
> {
> const struct cpumask *dom = task_domain(p) ?: p->cpus_ptr;
> s32 cpu;
>
> /*
> * Pick an idle CPU in the task's domain. If no CPU is found,
> * extend the search outside the domain.
> */
> cpu = scx_bpf_select_cpu_pref(p, dom, prev_cpu, wake_flags, 0);
> if (cpu >= 0) {
> scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> return cpu;
> }
>
> return prev_cpu;
> }
>
> Results
> =======
>
> Load distribution on a 4 sockets / 4 cores per socket system, simulated
> using virtme-ng, running a modified version of scx_bpfland that uses the
> new helper scx_bpf_select_cpu_pref() and 0xff00 as preferred domain:
>
> $ vng --cpu 16,sockets=4,cores=4,threads=1
>
> Starting 12 CPU hogs to fill the preferred domain:
>
> $ stress-ng -c 12
> ...
> 0[|||||||||||||||||||||||100.0%] 8[||||||||||||||||||||||||100.0%]
> 1[| 1.3%] 9[||||||||||||||||||||||||100.0%]
> 2[|||||||||||||||||||||||100.0%] 10[||||||||||||||||||||||||100.0%]
> 3[|||||||||||||||||||||||100.0%] 11[||||||||||||||||||||||||100.0%]
> 4[|||||||||||||||||||||||100.0%] 12[||||||||||||||||||||||||100.0%]
> 5[|| 2.6%] 13[||||||||||||||||||||||||100.0%]
> 6[| 0.6%] 14[||||||||||||||||||||||||100.0%]
> 7| 0.0%] 15[||||||||||||||||||||||||100.0%]
>
> Passing %SCX_PICK_IDLE_IN_PREF to scx_bpf_select_cpu_pref() to enforce
> strict selection on the preferred CPUs (with the same workload):
>
> 0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
> 1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
> 2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
> 3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
> 4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
> 5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
> 6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
> 7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
>
> Andrea Righi (4):
> sched_ext: idle: Honor idle flags in the built-in idle selection policy
> sched_ext: idle: Introduce the concept of preferred CPUs
> sched_ext: idle: Introduce scx_bpf_select_cpu_pref()
> selftests/sched_ext: Add test for scx_bpf_select_cpu_pref()
>
> kernel/sched/ext.c | 4 +-
> kernel/sched/ext_idle.c | 235 ++++++++++++++++++----
> kernel/sched/ext_idle.h | 3 +-
> tools/sched_ext/include/scx/common.bpf.h | 2 +
> tools/sched_ext/include/scx/compat.h | 1 +
> tools/testing/selftests/sched_ext/Makefile | 1 +
> tools/testing/selftests/sched_ext/pref_cpus.bpf.c | 95 +++++++++
> tools/testing/selftests/sched_ext/pref_cpus.c | 58 ++++++
> 8 files changed, 354 insertions(+), 45 deletions(-)
> create mode 100644 tools/testing/selftests/sched_ext/pref_cpus.bpf.c
> create mode 100644 tools/testing/selftests/sched_ext/pref_cpus.c
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref()
2025-03-06 18:18 ` [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref() Andrea Righi
@ 2025-03-07 3:15 ` Changwoo Min
2025-03-07 6:35 ` Andrea Righi
0 siblings, 1 reply; 12+ messages in thread
From: Changwoo Min @ 2025-03-07 3:15 UTC (permalink / raw)
To: Andrea Righi, Tejun Heo, David Vernet; +Cc: bpf, linux-kernel
On 25. 3. 7. 03:18, Andrea Righi wrote:
> Provide a new kfunc that can be used to apply the built-in idle CPU
> selection policy to a subset of preferred CPU:
>
> s32 scx_bpf_select_cpu_pref(struct task_struct *p,
> const struct cpumask *preferred_cpus,
> s32 prev_cpu, u64 wake_flags, u64 flags);
>
> This new helper is basically an extension of scx_bpf_select_cpu_dfl().
> However, when an idle CPU can't be found, it returns a negative value
> instead of @prev_cpu, aligning its behavior more closely with
> scx_bpf_pick_idle_cpu().
>
> It also accepts %SCX_PICK_IDLE_* flags, which can be used to enforce
> strict selection to the preferred CPUs (%SCX_PICK_IDLE_IN_PREF) or to
> @prev_cpu's node (%SCX_PICK_IDLE_IN_NODE), or to request only a
> full-idle SMT core (%SCX_PICK_IDLE_CORE), while applying the built-in
> selection logic.
>
> With this helper, BPF schedulers can apply the built-in idle CPU
> selection policy to a generic CPU domain, with strict or soft selection
> requirements.
>
> In the future we can also consider to deprecate scx_bpf_select_cpu_dfl()
> and replace it with scx_bpf_select_cpu_pref(), as the latter provides
> the same functionality, with the addition of the preferred domain logic.
>
> Example usage
> =============
>
> Possible usage in ops.select_cpu():
>
> s32 BPF_STRUCT_OPS(foo_select_cpu, struct task_struct *p,
> s32 prev_cpu, u64 wake_flags)
> {
> const struct cpumask *dom = task_domain(p) ?: p->cpus_ptr;
> s32 cpu;
>
> /*
> * Pick an idle CPU in the task's domain. If no CPU is found,
> * extend the search outside the domain.
> */
> cpu = scx_bpf_select_cpu_pref(p, dom, prev_cpu, wake_flags, 0);
> if (cpu >= 0) {
> scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, 0);
> return cpu;
> }
>
> return prev_cpu;
> }
>
> Results
> =======
>
> Load distribution on a 4 sockets / 4 cores per socket system, simulated
> using virtme-ng, running a modified version of scx_bpfland that uses the
> new helper scx_bpf_select_cpu_pref() and 0xff00 as preferred domain:
>
> $ vng --cpu 16,sockets=4,cores=4,threads=1
>
> Starting 12 CPU hogs to fill the preferred domain:
>
> $ stress-ng -c 12
> ...
> 0[|||||||||||||||||||||||100.0%] 8[||||||||||||||||||||||||100.0%]
> 1[| 1.3%] 9[||||||||||||||||||||||||100.0%]
> 2[|||||||||||||||||||||||100.0%] 10[||||||||||||||||||||||||100.0%]
> 3[|||||||||||||||||||||||100.0%] 11[||||||||||||||||||||||||100.0%]
> 4[|||||||||||||||||||||||100.0%] 12[||||||||||||||||||||||||100.0%]
> 5[|| 2.6%] 13[||||||||||||||||||||||||100.0%]
> 6[| 0.6%] 14[||||||||||||||||||||||||100.0%]
> 7| 0.0%] 15[||||||||||||||||||||||||100.0%]
>
> Passing %SCX_PICK_IDLE_IN_PREF to scx_bpf_select_cpu_pref() to enforce
> strict selection on the preferred CPUs (with the same workload):
>
> 0[ 0.0%] 8[||||||||||||||||||||||||100.0%]
> 1[ 0.0%] 9[||||||||||||||||||||||||100.0%]
> 2[ 0.0%] 10[||||||||||||||||||||||||100.0%]
> 3[ 0.0%] 11[||||||||||||||||||||||||100.0%]
> 4[ 0.0%] 12[||||||||||||||||||||||||100.0%]
> 5[ 0.0%] 13[||||||||||||||||||||||||100.0%]
> 6[ 0.0%] 14[||||||||||||||||||||||||100.0%]
> 7[ 0.0%] 15[||||||||||||||||||||||||100.0%]
>
> Signed-off-by: Andrea Righi <arighi@nvidia.com>
> ---
> kernel/sched/ext.c | 1 +
> kernel/sched/ext_idle.c | 60 ++++++++++++++++++++++++
> tools/sched_ext/include/scx/common.bpf.h | 2 +
> 3 files changed, 63 insertions(+)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index a28ddd7655ba8..8ee4818de908b 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -465,6 +465,7 @@ struct sched_ext_ops {
> * idle CPU tracking and the following helpers become unavailable:
> *
> * - scx_bpf_select_cpu_dfl()
> + * - scx_bpf_select_cpu_pref()
> * - scx_bpf_test_and_clear_cpu_idle()
> * - scx_bpf_pick_idle_cpu()
> *
> diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
> index 9b002e109404b..24cba7ddceec4 100644
> --- a/kernel/sched/ext_idle.c
> +++ b/kernel/sched/ext_idle.c
> @@ -907,6 +907,65 @@ __bpf_kfunc s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
> return prev_cpu;
> }
>
> +/**
> + * scx_bpf_select_cpu_pref - Pick an idle CPU usable by task @p,
> + * prioritizing those in @preferred_cpus
> + * @p: task_struct to select a CPU for
> + * @preferred_cpus: cpumask of preferred CPUs
> + * @prev_cpu: CPU @p was on previously
> + * @wake_flags: %SCX_WAKE_* flags
> + * @flags: %SCX_PICK_IDLE* flags
> + *
> + * Can only be called from ops.select_cpu() if the built-in CPU selection is
> + * enabled - ops.update_idle() is missing or %SCX_OPS_KEEP_BUILTIN_IDLE is set.
> + * @p, @prev_cpu and @wake_flags match ops.select_cpu().
> + *
> + * Returns the selected idle CPU, which will be automatically awakened upon
> + * returning from ops.select_cpu() and can be used for direct dispatch, or
> + * a negative value if no idle CPU is available.
> + */
> +__bpf_kfunc s32 scx_bpf_select_cpu_pref(struct task_struct *p,
> + const struct cpumask *preferred_cpus,
> + s32 prev_cpu, u64 wake_flags, u64 flags)
> +{
> +#ifdef CONFIG_SMP
> + struct cpumask *preferred = NULL;
> + bool is_idle = false;
> +#endif
> +
> + if (!ops_cpu_valid(prev_cpu, NULL))
> + return -EINVAL;
> +
> + if (!check_builtin_idle_enabled())
> + return -EBUSY;
> +
> + if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
> + return -EPERM;
> +
> +#ifdef CONFIG_SMP
> + preempt_disable();
> +
> + /*
> + * As an optimization, do not update the local idle mask when
> + * p->cpus_ptr is passed directly in @preferred_cpus.
> + */
> + if (preferred_cpus != p->cpus_ptr) {
> + preferred = this_cpu_cpumask_var_ptr(local_idle_cpumask);
> + if (!cpumask_and(preferred, p->cpus_ptr, preferred_cpus))
> + preferred = NULL;
I think it would be better to move cpumask_and() inside
scx_select_cpu_dfl() because scx_select_cpu_dfl() assumes that
anyway. That will make the code easier to read and avoid
potential mistakes when extending scx_select_cpu_dfl() in the
future.
> + }
> + prev_cpu = scx_select_cpu_dfl(p, preferred, prev_cpu, wake_flags, flags, &is_idle);
> + if (!is_idle)
> + prev_cpu = -EBUSY;
> +
> + preempt_enable();
> +#else
> + prev_cpu = -EBUSY;
> +#endif
> +
> + return prev_cpu;
> +}
> +
> /**
> * scx_bpf_get_idle_cpumask_node - Get a referenced kptr to the
> * idle-tracking per-CPU cpumask of a target NUMA node.
> @@ -1215,6 +1274,7 @@ static const struct btf_kfunc_id_set scx_kfunc_set_idle = {
>
> BTF_KFUNCS_START(scx_kfunc_ids_select_cpu)
> BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_select_cpu_pref, KF_RCU)
> BTF_KFUNCS_END(scx_kfunc_ids_select_cpu)
>
> static const struct btf_kfunc_id_set scx_kfunc_set_select_cpu = {
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index dc4333d23189f..a33e709ec12ab 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -47,6 +47,8 @@ static inline void ___vmlinux_h_sanity_check___(void)
> }
>
> s32 scx_bpf_create_dsq(u64 dsq_id, s32 node) __ksym;
> +s32 scx_bpf_select_cpu_pref(struct task_struct *p, const struct cpumask *preferred_cpus,
> + s32 prev_cpu, u64 wake_flags, u64 flags) __ksym __weak;
> s32 scx_bpf_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags, bool *is_idle) __ksym;
> void scx_bpf_dsq_insert(struct task_struct *p, u64 dsq_id, u64 slice, u64 enq_flags) __ksym __weak;
> void scx_bpf_dsq_insert_vtime(struct task_struct *p, u64 dsq_id, u64 slice, u64 vtime, u64 enq_flags) __ksym __weak;
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref()
2025-03-07 3:15 ` Changwoo Min
@ 2025-03-07 6:35 ` Andrea Righi
0 siblings, 0 replies; 12+ messages in thread
From: Andrea Righi @ 2025-03-07 6:35 UTC (permalink / raw)
To: Changwoo Min; +Cc: Tejun Heo, David Vernet, bpf, linux-kernel
Hi Changwoo,
On Fri, Mar 07, 2025 at 12:15:04PM +0900, Changwoo Min wrote:
...
> > +__bpf_kfunc s32 scx_bpf_select_cpu_pref(struct task_struct *p,
> > + const struct cpumask *preferred_cpus,
> > + s32 prev_cpu, u64 wake_flags, u64 flags)
> > +{
> > +#ifdef CONFIG_SMP
> > + struct cpumask *preferred = NULL;
> > + bool is_idle = false;
> > +#endif
> > +
> > + if (!ops_cpu_valid(prev_cpu, NULL))
> > + return -EINVAL;
> > +
> > + if (!check_builtin_idle_enabled())
> > + return -EBUSY;
> > +
> > + if (!scx_kf_allowed(SCX_KF_SELECT_CPU))
> > + return -EPERM;
> > +
> > +#ifdef CONFIG_SMP
> > + preempt_disable();
> > +
> > + /*
> > + * As an optimization, do not update the local idle mask when
> > + * p->cpus_ptr is passed directly in @preferred_cpus.
> > + */
> > + if (preferred_cpus != p->cpus_ptr) {
> > + preferred = this_cpu_cpumask_var_ptr(local_idle_cpumask);
> > + if (!cpumask_and(preferred, p->cpus_ptr, preferred_cpus))
> > + preferred = NULL;
>
> I think it would be better to move cpumask_and() inside
> scx_select_cpu_dfl() because scx_select_cpu_dfl() assumes that
> anyway. That will make the code easier to read and avoid
> potential mistakes when extending scx_select_cpu_dfl() in the
> future.
I agree, will do this in the next version.
Thanks!
-Andrea
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-03-07 6:35 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-06 18:18 [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Andrea Righi
2025-03-06 18:18 ` [PATCH 1/4] sched_ext: idle: Honor idle flags in the built-in idle selection policy Andrea Righi
2025-03-06 18:18 ` [PATCH 2/4] sched_ext: idle: Introduce the concept of preferred CPUs Andrea Righi
2025-03-06 18:18 ` [PATCH 3/4] sched_ext: idle: Introduce scx_bpf_select_cpu_pref() Andrea Righi
2025-03-07 3:15 ` Changwoo Min
2025-03-07 6:35 ` Andrea Righi
2025-03-06 18:18 ` [PATCH 4/4] selftests/sched_ext: Add test for scx_bpf_select_cpu_pref() Andrea Righi
2025-03-06 18:34 ` [PATCHSET sched_ext/for-6.15] sched_ext: Enhance built-in idle selection with preferred CPUs Tejun Heo
2025-03-06 18:54 ` Andrea Righi
2025-03-06 18:58 ` Tejun Heo
2025-03-06 19:02 ` Andrea Righi
2025-03-07 3:14 ` Changwoo Min
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox