* [PATCH 01/17] sched_ext: Add ext_types.h for early subsystem-wide defs
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 02/17] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
` (17 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
Introduce kernel/sched/ext_types.h as the early-def header for the
sched_ext compilation unit. Included from kernel/sched/build_policy.c
before ext_internal.h so every later header and source in the unit
sees its content without re-inclusion. Later patches add their types
here (struct scx_cid_topo, scx_cmask, scx_cid_shard, etc.) so the
subsystem has one place to stash types shared across the TU.
Move enum scx_consts (SCX_DSP_DFL_MAX_BATCH, SCX_WATCHDOG_MAX_TIMEOUT,
SCX_SUB_MAX_DEPTH, etc.) here as the initial content. Ops-facing
content stays in ext_internal.h.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/build_policy.c | 1 +
kernel/sched/ext_internal.h | 32 ---------------------------
kernel/sched/ext_types.h | 43 +++++++++++++++++++++++++++++++++++++
3 files changed, 44 insertions(+), 32 deletions(-)
create mode 100644 kernel/sched/ext_types.h
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index ffb386889218..1d92f7d7a19f 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -59,6 +59,7 @@
#ifdef CONFIG_SCHED_CLASS_EXT
# include <linux/btf_ids.h>
+# include "ext_types.h"
# include "ext_internal.h"
# include "ext_idle.h"
# include "ext.c"
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index a54903bb74b3..1b2ea6fa9fd6 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -8,38 +8,6 @@
#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
#define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void)))
-enum scx_consts {
- SCX_DSP_DFL_MAX_BATCH = 32,
- SCX_DSP_MAX_LOOPS = 32,
- SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,
-
- /* per-CPU chunk size for p->scx.tid allocation, see scx_alloc_tid() */
- SCX_TID_CHUNK = 1024,
-
- SCX_EXIT_BT_LEN = 64,
- SCX_EXIT_MSG_LEN = 1024,
- SCX_EXIT_DUMP_DFL_LEN = 32768,
-
- SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
-
- /*
- * Iterating all tasks may take a while. Periodically drop
- * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
- */
- SCX_TASK_ITER_BATCH = 32,
-
- SCX_BYPASS_HOST_NTH = 2,
-
- SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC,
- SCX_BYPASS_LB_DONOR_PCT = 125,
- SCX_BYPASS_LB_MIN_DELTA_DIV = 4,
- SCX_BYPASS_LB_BATCH = 256,
-
- SCX_REENQ_LOCAL_MAX_REPEAT = 256,
-
- SCX_SUB_MAX_DEPTH = 4,
-};
-
enum scx_exit_kind {
SCX_EXIT_NONE,
SCX_EXIT_DONE,
diff --git a/kernel/sched/ext_types.h b/kernel/sched/ext_types.h
new file mode 100644
index 000000000000..19299ec3920e
--- /dev/null
+++ b/kernel/sched/ext_types.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Early sched_ext type definitions.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_TYPES_H
+#define _KERNEL_SCHED_EXT_TYPES_H
+
+enum scx_consts {
+ SCX_DSP_DFL_MAX_BATCH = 32,
+ SCX_DSP_MAX_LOOPS = 32,
+ SCX_WATCHDOG_MAX_TIMEOUT = 30 * HZ,
+
+ /* per-CPU chunk size for p->scx.tid allocation, see scx_alloc_tid() */
+ SCX_TID_CHUNK = 1024,
+
+ SCX_EXIT_BT_LEN = 64,
+ SCX_EXIT_MSG_LEN = 1024,
+ SCX_EXIT_DUMP_DFL_LEN = 32768,
+
+ SCX_CPUPERF_ONE = SCHED_CAPACITY_SCALE,
+
+ /*
+ * Iterating all tasks may take a while. Periodically drop
+ * scx_tasks_lock to avoid causing e.g. CSD and RCU stalls.
+ */
+ SCX_TASK_ITER_BATCH = 32,
+
+ SCX_BYPASS_HOST_NTH = 2,
+
+ SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC,
+ SCX_BYPASS_LB_DONOR_PCT = 125,
+ SCX_BYPASS_LB_MIN_DELTA_DIV = 4,
+ SCX_BYPASS_LB_BATCH = 256,
+
+ SCX_REENQ_LOCAL_MAX_REPEAT = 256,
+
+ SCX_SUB_MAX_DEPTH = 4,
+};
+
+#endif /* _KERNEL_SCHED_EXT_TYPES_H */
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 02/17] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
2026-04-28 20:35 ` [PATCH 01/17] sched_ext: Add ext_types.h for early subsystem-wide defs Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 03/17] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h Tejun Heo
` (16 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Rename the static ext.c helper and declare it in ext_internal.h so
ext_idle.c and the upcoming cid code can call it directly instead of
relying on build_policy.c textual inclusion.
Pure rename and visibility change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 22 +++++++++++-----------
kernel/sched/ext_idle.c | 6 +++---
kernel/sched/ext_internal.h | 2 ++
3 files changed, 16 insertions(+), 14 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 11893f00be06..980231c547ec 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1062,7 +1062,7 @@ static inline bool __cpu_valid(s32 cpu)
}
/**
- * ops_cpu_valid - Verify a cpu number, to be used on ops input args
+ * scx_cpu_valid - Verify a cpu number, to be used on ops input args
* @sch: scx_sched to abort on error
* @cpu: cpu number which came from a BPF ops
* @where: extra information reported on error
@@ -1071,7 +1071,7 @@ static inline bool __cpu_valid(s32 cpu)
* Verify that it is in range and one of the possible cpus. If invalid, trigger
* an ops error.
*/
-static bool ops_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
+bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
{
if (__cpu_valid(cpu)) {
return true;
@@ -1686,7 +1686,7 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
- if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
+ if (!scx_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
return find_global_dsq(sch, tcpu);
return &cpu_rq(cpu)->scx.local_dsq;
@@ -3269,7 +3269,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
this_rq()->scx.in_select_cpu = false;
p->scx.selected_cpu = cpu;
*ddsp_taskp = NULL;
- if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()"))
+ if (scx_cpu_valid(sch, cpu, "from ops.select_cpu()"))
return cpu;
else
return prev_cpu;
@@ -8791,7 +8791,7 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
struct rq *this_rq;
unsigned long irq_flags;
- if (!ops_cpu_valid(sch, cpu, NULL))
+ if (!scx_cpu_valid(sch, cpu, NULL))
return;
local_irq_save(irq_flags);
@@ -8888,7 +8888,7 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id, const struct bpf_prog_aux *aux
} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
- if (ops_cpu_valid(sch, cpu, NULL)) {
+ if (scx_cpu_valid(sch, cpu, NULL)) {
ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr);
goto out;
}
@@ -9277,7 +9277,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux)
guard(rcu)();
sch = scx_prog_sched(aux);
- if (likely(sch) && ops_cpu_valid(sch, cpu, NULL))
+ if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
return arch_scale_cpu_capacity(cpu);
else
return SCX_CPUPERF_ONE;
@@ -9305,7 +9305,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux)
guard(rcu)();
sch = scx_prog_sched(aux);
- if (likely(sch) && ops_cpu_valid(sch, cpu, NULL))
+ if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
return arch_scale_freq_capacity(cpu);
else
return SCX_CPUPERF_ONE;
@@ -9341,7 +9341,7 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_au
return;
}
- if (ops_cpu_valid(sch, cpu, NULL)) {
+ if (scx_cpu_valid(sch, cpu, NULL)) {
struct rq *rq = cpu_rq(cpu), *locked_rq = scx_locked_rq();
struct rq_flags rf;
@@ -9454,7 +9454,7 @@ __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu, const struct bpf_prog_aux *aux)
if (unlikely(!sch))
return NULL;
- if (!ops_cpu_valid(sch, cpu, NULL))
+ if (!scx_cpu_valid(sch, cpu, NULL))
return NULL;
if (!sch->warned_deprecated_rq) {
@@ -9511,7 +9511,7 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_
if (unlikely(!sch))
return NULL;
- if (!ops_cpu_valid(sch, cpu, NULL))
+ if (!scx_cpu_valid(sch, cpu, NULL))
return NULL;
return rcu_dereference(cpu_rq(cpu)->curr);
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index f0f4d9500997..860c4634f60e 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -916,7 +916,7 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
bool we_locked = false;
s32 cpu;
- if (!ops_cpu_valid(sch, prev_cpu, NULL))
+ if (!scx_cpu_valid(sch, prev_cpu, NULL))
return -EINVAL;
if (!check_builtin_idle_enabled(sch))
@@ -989,7 +989,7 @@ __bpf_kfunc s32 scx_bpf_cpu_node(s32 cpu, const struct bpf_prog_aux *aux)
guard(rcu)();
sch = scx_prog_sched(aux);
- if (unlikely(!sch) || !ops_cpu_valid(sch, cpu, NULL))
+ if (unlikely(!sch) || !scx_cpu_valid(sch, cpu, NULL))
return NUMA_NO_NODE;
return cpu_to_node(cpu);
}
@@ -1271,7 +1271,7 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu, const struct bpf_prog_
if (!check_builtin_idle_enabled(sch))
return false;
- if (!ops_cpu_valid(sch, cpu, NULL))
+ if (!scx_cpu_valid(sch, cpu, NULL))
return false;
return scx_idle_test_and_clear_cpu(cpu);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 1b2ea6fa9fd6..f59cd58b8175 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1352,6 +1352,8 @@ DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id);
+bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where);
+
/*
* Return the rq currently locked from an scx callback, or NULL if no rq is
* locked.
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 03/17] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
2026-04-28 20:35 ` [PATCH 01/17] sched_ext: Add ext_types.h for early subsystem-wide defs Tejun Heo
2026-04-28 20:35 ` [PATCH 02/17] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 04/17] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu() Tejun Heo
` (15 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Things shared across multiple .c files belong in a header. scx_exit() /
scx_error() (and their scx_vexit() / scx_verror() siblings) are already
called from ext_idle.c and the upcoming ext_cid.c, and it was only
build_policy.c's textual inclusion of ext.c that made the references
resolve. Move the whole family to ext_internal.h.
Pure visibility change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 13 ++++---------
kernel/sched/ext_internal.h | 8 ++++++++
2 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 980231c547ec..8d41b4e2cce6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -231,12 +231,10 @@ static void run_deferred(struct rq *rq);
static bool task_dead_and_done(struct task_struct *p);
static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
-static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
- s64 exit_code, const char *fmt, va_list args);
-static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
- enum scx_exit_kind kind, s64 exit_code,
- const char *fmt, ...)
+__printf(4, 5) bool scx_exit(struct scx_sched *sch,
+ enum scx_exit_kind kind, s64 exit_code,
+ const char *fmt, ...)
{
va_list args;
bool ret;
@@ -248,9 +246,6 @@ static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
return ret;
}
-#define scx_error(sch, fmt, args...) scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
-#define scx_verror(sch, fmt, args) scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
-
#define SCX_HAS_OP(sch, op) test_bit(SCX_OP_IDX(op), (sch)->has_op)
static long jiffies_delta_msecs(unsigned long at, unsigned long now)
@@ -6432,7 +6427,7 @@ static void scx_disable_irq_workfn(struct irq_work *irq_work)
kthread_queue_work(sch->helper, &sch->disable_work);
}
-static bool scx_vexit(struct scx_sched *sch,
+bool scx_vexit(struct scx_sched *sch,
enum scx_exit_kind kind, s64 exit_code,
const char *fmt, va_list args)
{
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index f59cd58b8175..d4960df23da4 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1354,6 +1354,14 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id);
bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where);
+bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind, s64 exit_code,
+ const char *fmt, va_list args);
+__printf(4, 5) bool scx_exit(struct scx_sched *sch, enum scx_exit_kind kind,
+ s64 exit_code, const char *fmt, ...);
+
+#define scx_verror(sch, fmt, args) scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
+#define scx_error(sch, fmt, args...) scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
+
/*
* Return the rq currently locked from an scx callback, or NULL if no rq is
* locked.
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 04/17] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu()
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (2 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 03/17] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 05/17] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops Tejun Heo
` (14 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Callers that already know the cpu is valid shouldn't have to pay for a
redundant check. scx_kick_cpu() is called from the in-kernel balance loop
break-out path with the current cpu (trivially valid) and from
scx_bpf_kick_cpu() with a BPF-supplied cpu that does need validation. Move
the check out of scx_kick_cpu() into scx_bpf_kick_cpu() so the backend is
reusable by callers that have already validated.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8d41b4e2cce6..e9cf9d8f4626 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8786,9 +8786,6 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
struct rq *this_rq;
unsigned long irq_flags;
- if (!scx_cpu_valid(sch, cpu, NULL))
- return;
-
local_irq_save(irq_flags);
this_rq = this_rq();
@@ -8851,7 +8848,7 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
guard(rcu)();
sch = scx_prog_sched(aux);
- if (likely(sch))
+ if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
scx_kick_cpu(sch, cpu, flags);
}
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 05/17] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (3 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 04/17] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu() Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 06/17] sched_ext: Make scx_enable() take scx_enable_cmd Tejun Heo
` (13 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
cpu_acquire and cpu_release are deprecated and slated for removal. Move
their declarations to the end of struct sched_ext_ops so an upcoming
cid-form struct (sched_ext_ops_cid) can omit them entirely without
disturbing the offsets of the shared fields.
Switch the two SCX_HAS_OP() callers for these ops to direct field checks
since the relocated ops sit outside the SCX_OPI_END range covered by the
has_op bitmap.
scx_kf_allow_flags[] auto-sizes to the highest used SCX_OP_IDX, so
SCX_OP_IDX(cpu_release) moving to a higher index just enlarges the
sparse array; the lookup logic is unchanged.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 4 +--
kernel/sched/ext_internal.h | 54 ++++++++++++++++++++++---------------
2 files changed, 34 insertions(+), 24 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index e9cf9d8f4626..b197da2b960d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2823,7 +2823,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
* core. This callback complements ->cpu_release(), which is
* emitted in switch_class().
*/
- if (SCX_HAS_OP(sch, cpu_acquire))
+ if (sch->ops.cpu_acquire)
SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL);
rq->scx.cpu_released = false;
}
@@ -2969,7 +2969,7 @@ static void switch_class(struct rq *rq, struct task_struct *next)
* next time that balance_one() is invoked.
*/
if (!rq->scx.cpu_released) {
- if (SCX_HAS_OP(sch, cpu_release)) {
+ if (sch->ops.cpu_release) {
struct scx_cpu_release_args args = {
.reason = preempt_reason_from_class(next_class),
.task = next,
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index d4960df23da4..919d4aa08656 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -523,28 +523,6 @@ struct sched_ext_ops {
*/
void (*update_idle)(s32 cpu, bool idle);
- /**
- * @cpu_acquire: A CPU is becoming available to the BPF scheduler
- * @cpu: The CPU being acquired by the BPF scheduler.
- * @args: Acquire arguments, see the struct definition.
- *
- * A CPU that was previously released from the BPF scheduler is now once
- * again under its control.
- */
- void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
-
- /**
- * @cpu_release: A CPU is taken away from the BPF scheduler
- * @cpu: The CPU being released by the BPF scheduler.
- * @args: Release arguments, see the struct definition.
- *
- * The specified CPU is no longer under the control of the BPF
- * scheduler. This could be because it was preempted by a higher
- * priority sched_class, though there may be other reasons as well. The
- * caller should consult @args->reason to determine the cause.
- */
- void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
-
/**
* @init_task: Initialize a task to run in a BPF scheduler
* @p: task to initialize for BPF scheduling
@@ -835,6 +813,38 @@ struct sched_ext_ops {
/* internal use only, must be NULL */
void __rcu *priv;
+
+ /*
+ * Deprecated callbacks. Kept at the end of the struct so the cid-form
+ * struct (sched_ext_ops_cid) can omit them without affecting the
+ * shared field offsets. Use SCX_ENQ_IMMED instead. Sitting past
+ * SCX_OPI_END means has_op doesn't cover them, so SCX_HAS_OP() cannot
+ * be used; callers must test sch->ops.cpu_acquire / cpu_release
+ * directly.
+ */
+
+ /**
+ * @cpu_acquire: A CPU is becoming available to the BPF scheduler
+ * @cpu: The CPU being acquired by the BPF scheduler.
+ * @args: Acquire arguments, see the struct definition.
+ *
+ * A CPU that was previously released from the BPF scheduler is now once
+ * again under its control. Deprecated; use SCX_ENQ_IMMED instead.
+ */
+ void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
+
+ /**
+ * @cpu_release: A CPU is taken away from the BPF scheduler
+ * @cpu: The CPU being released by the BPF scheduler.
+ * @args: Release arguments, see the struct definition.
+ *
+ * The specified CPU is no longer under the control of the BPF
+ * scheduler. This could be because it was preempted by a higher
+ * priority sched_class, though there may be other reasons as well. The
+ * caller should consult @args->reason to determine the cause.
+ * Deprecated; use SCX_ENQ_IMMED instead.
+ */
+ void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
};
enum scx_opi {
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 06/17] sched_ext: Make scx_enable() take scx_enable_cmd
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (4 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 05/17] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 07/17] sched_ext: Add topological CPU IDs (cids) Tejun Heo
` (12 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Pass struct scx_enable_cmd to scx_enable() rather than unpacking @ops
at every call site and re-packing into a fresh cmd inside. bpf_scx_reg()
now builds the cmd on its stack and hands it in; scx_enable() just
wires up the kthread work and waits.
Relocate struct scx_enable_cmd above scx_alloc_and_add_sched() so
upcoming patches that also want the cmd can see it.
No behavior change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 46 +++++++++++++++++++++++-----------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index b197da2b960d..f9a1f217bc47 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6507,6 +6507,19 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
return pnode;
}
+/*
+ * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
+ * starvation. During the READY -> ENABLED task switching loop, the calling
+ * thread's sched_class gets switched from fair to ext. As fair has higher
+ * priority than ext, the calling thread can be indefinitely starved under
+ * fair-class saturation, leading to a system hang.
+ */
+struct scx_enable_cmd {
+ struct kthread_work work;
+ struct sched_ext_ops *ops;
+ int ret;
+};
+
/*
* Allocate and initialize a new scx_sched. @cgrp's reference is always
* consumed whether the function succeeds or fails.
@@ -6749,19 +6762,6 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
return 0;
}
-/*
- * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
- * starvation. During the READY -> ENABLED task switching loop, the calling
- * thread's sched_class gets switched from fair to ext. As fair has higher
- * priority than ext, the calling thread can be indefinitely starved under
- * fair-class saturation, leading to a system hang.
- */
-struct scx_enable_cmd {
- struct kthread_work work;
- struct sched_ext_ops *ops;
- int ret;
-};
-
static void scx_root_enable_workfn(struct kthread_work *work)
{
struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
@@ -7346,11 +7346,10 @@ static s32 __init scx_cgroup_lifetime_notifier_init(void)
core_initcall(scx_cgroup_lifetime_notifier_init);
#endif /* CONFIG_EXT_SUB_SCHED */
-static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
+static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link)
{
static struct kthread_worker *helper;
static DEFINE_MUTEX(helper_mutex);
- struct scx_enable_cmd cmd;
if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN),
cpu_possible_mask)) {
@@ -7374,16 +7373,15 @@ static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
}
#ifdef CONFIG_EXT_SUB_SCHED
- if (ops->sub_cgroup_id > 1)
- kthread_init_work(&cmd.work, scx_sub_enable_workfn);
+ if (cmd->ops->sub_cgroup_id > 1)
+ kthread_init_work(&cmd->work, scx_sub_enable_workfn);
else
#endif /* CONFIG_EXT_SUB_SCHED */
- kthread_init_work(&cmd.work, scx_root_enable_workfn);
- cmd.ops = ops;
+ kthread_init_work(&cmd->work, scx_root_enable_workfn);
- kthread_queue_work(READ_ONCE(helper), &cmd.work);
- kthread_flush_work(&cmd.work);
- return cmd.ret;
+ kthread_queue_work(READ_ONCE(helper), &cmd->work);
+ kthread_flush_work(&cmd->work);
+ return cmd->ret;
}
@@ -7555,7 +7553,9 @@ static int bpf_scx_check_member(const struct btf_type *t,
static int bpf_scx_reg(void *kdata, struct bpf_link *link)
{
- return scx_enable(kdata, link);
+ struct scx_enable_cmd cmd = { .ops = kdata };
+
+ return scx_enable(&cmd, link);
}
static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 07/17] sched_ext: Add topological CPU IDs (cids)
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (5 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 06/17] sched_ext: Make scx_enable() take scx_enable_cmd Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc Tejun Heo
` (11 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Raw cpu numbers are clumsy for sharding and cross-sched communication,
especially from BPF. The space is sparse, numerical closeness doesn't
track topological closeness (x86 hyperthreading often scatters SMT
siblings), and a range of cpu ids doesn't describe anything meaningful.
Sub-sched support makes this acute: cpu allocation, revocation, and
state constantly flow across sub-scheds. Passing whole cpumasks scales
poorly (every op scans 4K bits) and cpumasks are awkward in BPF.
cids assign every cpu a dense, topology-ordered id. CPUs sharing a core,
LLC, or NUMA node occupy contiguous cid ranges, so a topology unit
becomes a (start, length) slice. Communication passes slices; BPF can
process a u64 word of cids at a time.
Build the mapping once at root enable by walking online cpus node -> LLC
-> core. Possible-but-not-online cpus tail the space with no-topo cids.
Expose kfuncs to map cpu <-> cid in either direction and to query each
cid's topology metadata.
v2: Use kzalloc_objs()/kmalloc_objs() for the three allocs in
scx_cid_arrays_alloc() (Cheng-Yang Chou).
v3: scx_cid_init() failure path now drops cpus_read_lock();
BUILD_BUG_ON tightened to match BPF cmask helpers' NR_CPUS<=8192.
(Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/build_policy.c | 2 +
kernel/sched/ext.c | 18 ++
kernel/sched/ext_cid.c | 301 +++++++++++++++++++++++
kernel/sched/ext_cid.h | 129 ++++++++++
kernel/sched/ext_types.h | 23 ++
tools/sched_ext/include/scx/common.bpf.h | 3 +
6 files changed, 476 insertions(+)
create mode 100644 kernel/sched/ext_cid.c
create mode 100644 kernel/sched/ext_cid.h
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index 1d92f7d7a19f..5e76c9177d54 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -61,8 +61,10 @@
# include <linux/btf_ids.h>
# include "ext_types.h"
# include "ext_internal.h"
+# include "ext_cid.h"
# include "ext_idle.h"
# include "ext.c"
+# include "ext_cid.c"
# include "ext_idle.c"
#endif
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index f9a1f217bc47..2b531256c763 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6820,6 +6820,18 @@ static void scx_root_enable_workfn(struct kthread_work *work)
*/
cpus_read_lock();
+ /*
+ * Build the cid mapping before publishing scx_root. The cid kfuncs
+ * dereference the cid arrays unconditionally once scx_prog_sched()
+ * returns non-NULL; the rcu_assign_pointer() below pairs with their
+ * rcu_dereference() to make the populated arrays visible.
+ */
+ ret = scx_cid_init(sch);
+ if (ret) {
+ cpus_read_unlock();
+ goto err_disable;
+ }
+
/*
* Make the scheduler instance visible. Must be inside cpus_read_lock().
* See handle_hotplug().
@@ -9888,6 +9900,12 @@ static int __init scx_init(void)
return ret;
}
+ ret = scx_cid_kfunc_init();
+ if (ret) {
+ pr_err("sched_ext: Failed to register cid kfuncs (%d)\n", ret);
+ return ret;
+ }
+
ret = register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops);
if (ret) {
pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret);
diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
new file mode 100644
index 000000000000..5b73900edc87
--- /dev/null
+++ b/kernel/sched/ext_cid.c
@@ -0,0 +1,301 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/cacheinfo.h>
+
+/*
+ * cid tables.
+ *
+ * Pointers are published once on first enable and never revoked. The default
+ * mapping is populated before ops.init() runs; scx_bpf_cid_override() commits
+ * before it returns. As long as the BPF scheduler only uses the tables from
+ * those points onward, it sees a consistent view.
+ */
+s16 *scx_cid_to_cpu_tbl;
+s16 *scx_cpu_to_cid_tbl;
+struct scx_cid_topo *scx_cid_topo;
+
+#define SCX_CID_TOPO_NEG (struct scx_cid_topo) { \
+ .core_cid = -1, .core_idx = -1, .llc_cid = -1, .llc_idx = -1, \
+ .node_cid = -1, .node_idx = -1, \
+}
+
+/*
+ * Return @cpu's LLC shared_cpu_map. If cacheinfo isn't populated (offline or
+ * !present), record @cpu in @fallbacks and return its node mask instead - the
+ * worst that can happen is that the cpu's LLC becomes coarser than reality.
+ */
+static const struct cpumask *cpu_llc_mask(int cpu, struct cpumask *fallbacks)
+{
+ struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
+
+ if (!ci || !ci->info_list || !ci->num_leaves) {
+ cpumask_set_cpu(cpu, fallbacks);
+ return cpumask_of_node(cpu_to_node(cpu));
+ }
+ return &ci->info_list[ci->num_leaves - 1].shared_cpu_map;
+}
+
+/* Allocate the cid tables once on first enable; never freed. */
+static s32 scx_cid_arrays_alloc(void)
+{
+ u32 npossible = num_possible_cpus();
+ s16 *cid_to_cpu, *cpu_to_cid;
+ struct scx_cid_topo *cid_topo;
+
+ if (scx_cid_to_cpu_tbl)
+ return 0;
+
+ cid_to_cpu = kzalloc_objs(*scx_cid_to_cpu_tbl, npossible, GFP_KERNEL);
+ cpu_to_cid = kzalloc_objs(*scx_cpu_to_cid_tbl, nr_cpu_ids, GFP_KERNEL);
+ cid_topo = kmalloc_objs(*scx_cid_topo, npossible, GFP_KERNEL);
+
+ if (!cid_to_cpu || !cpu_to_cid || !cid_topo) {
+ kfree(cid_to_cpu);
+ kfree(cpu_to_cid);
+ kfree(cid_topo);
+ return -ENOMEM;
+ }
+
+ WRITE_ONCE(scx_cid_to_cpu_tbl, cid_to_cpu);
+ WRITE_ONCE(scx_cpu_to_cid_tbl, cpu_to_cid);
+ WRITE_ONCE(scx_cid_topo, cid_topo);
+ return 0;
+}
+
+/**
+ * scx_cid_init - build the cid mapping
+ * @sch: the scx_sched being initialized; used as the scx_error() target
+ *
+ * See "Topological CPU IDs" in ext_cid.h for the model. Walk online cpus by
+ * intersection at each level (parent_scratch & this_level_mask), which keeps
+ * containment correct by construction and naturally splits a physical LLC
+ * straddling two NUMA nodes into two LLC units. The caller must hold
+ * cpus_read_lock.
+ */
+s32 scx_cid_init(struct scx_sched *sch)
+{
+ cpumask_var_t to_walk __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ cpumask_var_t node_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ cpumask_var_t llc_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ cpumask_var_t core_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ cpumask_var_t llc_fallback __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ cpumask_var_t online_no_topo __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ u32 next_cid = 0;
+ s32 next_node_idx = 0, next_llc_idx = 0, next_core_idx = 0;
+ s32 cpu, ret;
+
+ /* CMASK_MAX_WORDS in cid.bpf.h covers NR_CPUS up to 8192 */
+ BUILD_BUG_ON(NR_CPUS > 8192);
+
+ lockdep_assert_cpus_held();
+
+ ret = scx_cid_arrays_alloc();
+ if (ret)
+ return ret;
+
+ if (!zalloc_cpumask_var(&to_walk, GFP_KERNEL) ||
+ !zalloc_cpumask_var(&node_scratch, GFP_KERNEL) ||
+ !zalloc_cpumask_var(&llc_scratch, GFP_KERNEL) ||
+ !zalloc_cpumask_var(&core_scratch, GFP_KERNEL) ||
+ !zalloc_cpumask_var(&llc_fallback, GFP_KERNEL) ||
+ !zalloc_cpumask_var(&online_no_topo, GFP_KERNEL))
+ return -ENOMEM;
+
+ /* -1 sentinels for sparse-possible cpu id holes (0 is a valid cid) */
+ for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+ scx_cpu_to_cid_tbl[cpu] = -1;
+
+ cpumask_copy(to_walk, cpu_online_mask);
+
+ while (!cpumask_empty(to_walk)) {
+ s32 next_cpu = cpumask_first(to_walk);
+ s32 nid = cpu_to_node(next_cpu);
+ s32 node_cid = next_cid;
+ s32 node_idx;
+
+ /*
+ * No NUMA info: skip and let the tail loop assign a no-topo
+ * cid. cpumask_of_node(-1) is undefined.
+ */
+ if (nid < 0) {
+ cpumask_clear_cpu(next_cpu, to_walk);
+ continue;
+ }
+
+ node_idx = next_node_idx++;
+
+ /* node_scratch = to_walk & this node */
+ cpumask_and(node_scratch, to_walk, cpumask_of_node(nid));
+ if (WARN_ON_ONCE(!cpumask_test_cpu(next_cpu, node_scratch)))
+ return -EINVAL;
+
+ while (!cpumask_empty(node_scratch)) {
+ s32 ncpu = cpumask_first(node_scratch);
+ const struct cpumask *llc_mask = cpu_llc_mask(ncpu, llc_fallback);
+ s32 llc_cid = next_cid;
+ s32 llc_idx = next_llc_idx++;
+
+ /* llc_scratch = node_scratch & this llc */
+ cpumask_and(llc_scratch, node_scratch, llc_mask);
+ if (WARN_ON_ONCE(!cpumask_test_cpu(ncpu, llc_scratch)))
+ return -EINVAL;
+
+ while (!cpumask_empty(llc_scratch)) {
+ s32 lcpu = cpumask_first(llc_scratch);
+ const struct cpumask *sib = topology_sibling_cpumask(lcpu);
+ s32 core_cid = next_cid;
+ s32 core_idx = next_core_idx++;
+ s32 ccpu;
+
+ /* core_scratch = llc_scratch & this core */
+ cpumask_and(core_scratch, llc_scratch, sib);
+ if (WARN_ON_ONCE(!cpumask_test_cpu(lcpu, core_scratch)))
+ return -EINVAL;
+
+ for_each_cpu(ccpu, core_scratch) {
+ s32 cid = next_cid++;
+
+ scx_cid_to_cpu_tbl[cid] = ccpu;
+ scx_cpu_to_cid_tbl[ccpu] = cid;
+ scx_cid_topo[cid] = (struct scx_cid_topo){
+ .core_cid = core_cid,
+ .core_idx = core_idx,
+ .llc_cid = llc_cid,
+ .llc_idx = llc_idx,
+ .node_cid = node_cid,
+ .node_idx = node_idx,
+ };
+
+ cpumask_clear_cpu(ccpu, llc_scratch);
+ cpumask_clear_cpu(ccpu, node_scratch);
+ cpumask_clear_cpu(ccpu, to_walk);
+ }
+ }
+ }
+ }
+
+ /*
+ * No-topo section: any possible cpu without a cid - normally just the
+ * not-online ones. Collect any currently-online cpus that land here in
+ * @online_no_topo so we can warn about them at the end.
+ */
+ for_each_cpu(cpu, cpu_possible_mask) {
+ s32 cid;
+
+ if (__scx_cpu_to_cid(cpu) != -1)
+ continue;
+ if (cpu_online(cpu))
+ cpumask_set_cpu(cpu, online_no_topo);
+
+ cid = next_cid++;
+ scx_cid_to_cpu_tbl[cid] = cpu;
+ scx_cpu_to_cid_tbl[cpu] = cid;
+ scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+ }
+
+ if (!cpumask_empty(llc_fallback))
+ pr_warn("scx_cid: cpus without cacheinfo, using node mask as llc: %*pbl\n",
+ cpumask_pr_args(llc_fallback));
+ if (!cpumask_empty(online_no_topo))
+ pr_warn("scx_cid: online cpus with no usable topology: %*pbl\n",
+ cpumask_pr_args(online_no_topo));
+
+ return 0;
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_cid_to_cpu - Return the raw CPU id for @cid
+ * @cid: cid to look up
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Return the raw CPU id for @cid. Trigger scx_error() and return -EINVAL if
+ * @cid is invalid. The cid<->cpu mapping is static for the lifetime of the
+ * loaded scheduler, so the BPF side can cache the result to avoid repeated
+ * kfunc invocations.
+ */
+__bpf_kfunc s32 scx_bpf_cid_to_cpu(s32 cid, const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return -EINVAL;
+ return scx_cid_to_cpu(sch, cid);
+}
+
+/**
+ * scx_bpf_cpu_to_cid - Return the cid for @cpu
+ * @cpu: cpu to look up
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Return the cid for @cpu. Trigger scx_error() and return -EINVAL if @cpu is
+ * invalid. The cid<->cpu mapping is static for the lifetime of the loaded
+ * scheduler, so the BPF side can cache the result to avoid repeated kfunc
+ * invocations.
+ */
+__bpf_kfunc s32 scx_bpf_cpu_to_cid(s32 cpu, const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return -EINVAL;
+ return scx_cpu_to_cid(sch, cpu);
+}
+
+/**
+ * scx_bpf_cid_topo - Copy out per-cid topology info
+ * @cid: cid to look up
+ * @out__uninit: where to copy the topology info; fully written by this call
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Fill @out__uninit with the topology info for @cid. Trigger scx_error() if
+ * @cid is out of range. If @cid is valid but in the no-topo section, all fields
+ * are set to -1.
+ */
+__bpf_kfunc void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out__uninit,
+ const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch) || !cid_valid(sch, cid)) {
+ *out__uninit = SCX_CID_TOPO_NEG;
+ return;
+ }
+
+ *out__uninit = READ_ONCE(scx_cid_topo)[cid];
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_cid)
+BTF_ID_FLAGS(func, scx_bpf_cid_to_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_cid, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cid_topo, KF_IMPLICIT_ARGS)
+BTF_KFUNCS_END(scx_kfunc_ids_cid)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_cid,
+};
+
+int scx_cid_kfunc_init(void)
+{
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
+ register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_cid) ?:
+ register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_cid);
+}
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
new file mode 100644
index 000000000000..1dbe8262ccdd
--- /dev/null
+++ b/kernel/sched/ext_cid.h
@@ -0,0 +1,129 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Topological CPU IDs (cids)
+ * --------------------------
+ *
+ * Raw cpu numbers are clumsy for sharding work and communication across
+ * topology units, especially from BPF: the space can be sparse, numerical
+ * closeness doesn't imply topological closeness (x86 hyperthreading often puts
+ * SMT siblings far apart), and a range of cpu ids doesn't mean anything.
+ * Sub-scheds make this acute - cpu allocation, revocation and other state are
+ * constantly communicated across sub-scheds, and passing whole cpumasks scales
+ * poorly with cpu count. cpumasks are also awkward in BPF: a variable-length
+ * kernel type sized for the maximum NR_CPUS (4k), with verbose helper sequences
+ * for every op.
+ *
+ * cids give every cpu a dense, topology-ordered id. CPUs sharing a core, LLC or
+ * NUMA node get contiguous cid ranges, so a topology unit becomes a (start,
+ * length) slice of cid space. Communication can pass a slice instead of a
+ * cpumask, and BPF code can process, for example, a u64 word's worth of cids at
+ * a time.
+ *
+ * The mapping is built once at root scheduler enable time by walking the
+ * topology of online cpus only. Going by online cpus is out of necessity:
+ * depending on the arch, topology info isn't reliably available for offline
+ * cpus. The expected usage model is restarting the scheduler on hotplug events
+ * so the mapping is rebuilt against the new online set. A scheduler that wants
+ * to handle hotplug without a restart can provide its own cid and shard mapping
+ * through the override interface.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_CID_H
+#define _KERNEL_SCHED_EXT_CID_H
+
+struct scx_sched;
+
+/*
+ * Cid space (total is always num_possible_cpus()) is laid out with
+ * topology-annotated cids first, then no-topo cids at the tail. The
+ * topology-annotated block covers the cpus that were online when scx_cid_init()
+ * ran and remains valid even after those cpus go offline. The tail block covers
+ * possible-but-not-online cpus and carries all-(-1) topo info (see
+ * scx_cid_topo); callers detect it via the -1 sentinels.
+ *
+ * See the comment above the table definitions in ext_cid.c for the
+ * memory-ordering and visibility contract.
+ */
+extern s16 *scx_cid_to_cpu_tbl;
+extern s16 *scx_cpu_to_cid_tbl;
+extern struct scx_cid_topo *scx_cid_topo;
+
+s32 scx_cid_init(struct scx_sched *sch);
+int scx_cid_kfunc_init(void);
+
+/**
+ * cid_valid - Verify a cid value, to be used on ops input args
+ * @sch: scx_sched to abort on error
+ * @cid: cid which came from a BPF ops
+ *
+ * Return true if @cid is in [0, num_possible_cpus()). On failure, trigger
+ * scx_error() and return false.
+ */
+static inline bool cid_valid(struct scx_sched *sch, s32 cid)
+{
+ if (likely(cid >= 0 && cid < num_possible_cpus()))
+ return true;
+ scx_error(sch, "invalid cid %d", cid);
+ return false;
+}
+
+/**
+ * __scx_cid_to_cpu - Unchecked cid->cpu table lookup
+ * @cid: cid to look up. Must be in [0, num_possible_cpus()).
+ *
+ * Intended for callsites that have already validated @cid and that hold a
+ * non-NULL @sch from scx_prog_sched() - a live sched implies the table has
+ * been allocated, so no NULL check is needed here.
+ */
+static inline s32 __scx_cid_to_cpu(s32 cid)
+{
+ /* READ_ONCE pairs with WRITE_ONCE in scx_cid_arrays_alloc() */
+ return READ_ONCE(scx_cid_to_cpu_tbl)[cid];
+}
+
+/**
+ * __scx_cpu_to_cid - Unchecked cpu->cid table lookup
+ * @cpu: cpu to look up. Must be a valid possible cpu id.
+ *
+ * Same usage constraints as __scx_cid_to_cpu().
+ */
+static inline s32 __scx_cpu_to_cid(s32 cpu)
+{
+ return READ_ONCE(scx_cpu_to_cid_tbl)[cpu];
+}
+
+/**
+ * scx_cid_to_cpu - Translate @cid to its cpu
+ * @sch: scx_sched for error reporting
+ * @cid: cid to look up
+ *
+ * Return the cpu for @cid or a negative errno on failure. Invalid cid triggers
+ * scx_error() on @sch. The cid arrays are allocated on first scheduler enable
+ * and never freed, so the returned cpu is stable for the lifetime of the loaded
+ * scheduler.
+ */
+static inline s32 scx_cid_to_cpu(struct scx_sched *sch, s32 cid)
+{
+ if (!cid_valid(sch, cid))
+ return -EINVAL;
+ return __scx_cid_to_cpu(cid);
+}
+
+/**
+ * scx_cpu_to_cid - Translate @cpu to its cid
+ * @sch: scx_sched for error reporting
+ * @cpu: cpu to look up
+ *
+ * Return the cid for @cpu or a negative errno on failure. Invalid cpu triggers
+ * scx_error() on @sch. Same lifetime guarantee as scx_cid_to_cpu().
+ */
+static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
+{
+ if (!scx_cpu_valid(sch, cpu, NULL))
+ return -EINVAL;
+ return __scx_cpu_to_cid(cpu);
+}
+
+#endif /* _KERNEL_SCHED_EXT_CID_H */
diff --git a/kernel/sched/ext_types.h b/kernel/sched/ext_types.h
index 19299ec3920e..be4d3565ae8d 100644
--- a/kernel/sched/ext_types.h
+++ b/kernel/sched/ext_types.h
@@ -40,4 +40,27 @@ enum scx_consts {
SCX_SUB_MAX_DEPTH = 4,
};
+/*
+ * Per-cid topology info. For each topology level (core, LLC, node), records
+ * the first cid in the unit and its global index. Global indices are
+ * consecutive integers assigned in cid-walk order, so e.g. core_idx ranges
+ * over [0, nr_cores_at_init) with no gaps. No-topo cids have all fields set
+ * to -1.
+ *
+ * @core_cid: first cid of this cid's core (smt-sibling group)
+ * @core_idx: global index of that core, in [0, nr_cores_at_init)
+ * @llc_cid: first cid of this cid's LLC
+ * @llc_idx: global index of that LLC, in [0, nr_llcs_at_init)
+ * @node_cid: first cid of this cid's NUMA node
+ * @node_idx: global index of that node, in [0, nr_nodes_at_init)
+ */
+struct scx_cid_topo {
+ s32 core_cid;
+ s32 core_idx;
+ s32 llc_cid;
+ s32 llc_idx;
+ s32 node_cid;
+ s32 node_idx;
+};
+
#endif /* _KERNEL_SCHED_EXT_TYPES_H */
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 67b4b179b422..18f823d424cc 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -102,6 +102,9 @@ struct task_struct *scx_bpf_cpu_curr(s32 cpu) __ksym __weak;
struct task_struct *scx_bpf_tid_to_task(u64 tid) __ksym __weak;
u64 scx_bpf_now(void) __ksym __weak;
void scx_bpf_events(struct scx_event_stats *events, size_t events__sz) __ksym __weak;
+s32 scx_bpf_cpu_to_cid(s32 cpu) __ksym __weak;
+s32 scx_bpf_cid_to_cpu(s32 cid) __ksym __weak;
+void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out) __ksym __weak;
/*
* Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (6 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 07/17] sched_ext: Add topological CPU IDs (cids) Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-29 14:07 ` Andrea Righi
2026-04-28 20:35 ` [PATCH 09/17] tools/sched_ext: Add struct_size() helpers to common.bpf.h Tejun Heo
` (10 subsequent siblings)
18 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
The auto-probed cid mapping reflects the kernel's view of topology
(node -> LLC -> core), but a BPF scheduler may want a different layout -
to align cid slices with its own partitioning, or to work around how the
kernel reports a particular machine.
Add scx_bpf_cid_override(), callable from ops.init() of the root
scheduler. It validates the caller-supplied cpu->cid array and replaces
the in-place mapping; topo info is invalidated. A compat.bpf.h wrapper
silently no-ops on kernels that lack the kfunc.
A new SCX_KF_ALLOW_INIT bit in the kfunc context filter restricts the
kfunc to ops.init() at verifier load time.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 16 +++--
kernel/sched/ext_cid.c | 75 +++++++++++++++++++++++-
kernel/sched/ext_cid.h | 1 +
tools/sched_ext/include/scx/compat.bpf.h | 12 ++++
4 files changed, 97 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 2b531256c763..6f0b30fa970f 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9755,10 +9755,11 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = {
*/
enum scx_kf_allow_flags {
SCX_KF_ALLOW_UNLOCKED = 1 << 0,
- SCX_KF_ALLOW_CPU_RELEASE = 1 << 1,
- SCX_KF_ALLOW_DISPATCH = 1 << 2,
- SCX_KF_ALLOW_ENQUEUE = 1 << 3,
- SCX_KF_ALLOW_SELECT_CPU = 1 << 4,
+ SCX_KF_ALLOW_INIT = 1 << 1,
+ SCX_KF_ALLOW_CPU_RELEASE = 1 << 2,
+ SCX_KF_ALLOW_DISPATCH = 1 << 3,
+ SCX_KF_ALLOW_ENQUEUE = 1 << 4,
+ SCX_KF_ALLOW_SELECT_CPU = 1 << 5,
};
/*
@@ -9786,7 +9787,7 @@ static const u32 scx_kf_allow_flags[] = {
[SCX_OP_IDX(sub_detach)] = SCX_KF_ALLOW_UNLOCKED,
[SCX_OP_IDX(cpu_online)] = SCX_KF_ALLOW_UNLOCKED,
[SCX_OP_IDX(cpu_offline)] = SCX_KF_ALLOW_UNLOCKED,
- [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED,
+ [SCX_OP_IDX(init)] = SCX_KF_ALLOW_UNLOCKED | SCX_KF_ALLOW_INIT,
[SCX_OP_IDX(exit)] = SCX_KF_ALLOW_UNLOCKED,
};
@@ -9801,6 +9802,7 @@ static const u32 scx_kf_allow_flags[] = {
int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
{
bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id);
+ bool in_init = btf_id_set8_contains(&scx_kfunc_ids_init, kfunc_id);
bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id);
bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id);
bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id);
@@ -9810,7 +9812,7 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
u32 moff, flags;
/* Not an SCX kfunc - allow. */
- if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch ||
+ if (!(in_unlocked || in_init || in_select_cpu || in_enqueue || in_dispatch ||
in_cpu_release || in_idle || in_any))
return 0;
@@ -9846,6 +9848,8 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked)
return 0;
+ if ((flags & SCX_KF_ALLOW_INIT) && in_init)
+ return 0;
if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release)
return 0;
if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch)
diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
index 5b73900edc87..607937d9e4d1 100644
--- a/kernel/sched/ext_cid.c
+++ b/kernel/sched/ext_cid.c
@@ -210,6 +210,68 @@ s32 scx_cid_init(struct scx_sched *sch)
__bpf_kfunc_start_defs();
+/**
+ * scx_bpf_cid_override - Install an explicit cpu->cid mapping
+ * @cpu_to_cid: array of nr_cpu_ids s32 entries (cid for each cpu)
+ * @cpu_to_cid__sz: must be nr_cpu_ids * sizeof(s32) bytes
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * May only be called from ops.init() of the root scheduler. Replace the
+ * topology-probed cid mapping with the caller-provided one. Each possible cpu
+ * must map to a unique cid in [0, num_possible_cpus()). Topo info is cleared.
+ * On invalid input, trigger scx_error() to abort the scheduler.
+ */
+__bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
+ const struct bpf_prog_aux *aux)
+{
+ cpumask_var_t seen __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+ struct scx_sched *sch;
+ bool alloced;
+ s32 cpu, cid;
+
+ /* GFP_KERNEL alloc must happen before the rcu read section */
+ alloced = zalloc_cpumask_var(&seen, GFP_KERNEL);
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return;
+
+ if (!alloced) {
+ scx_error(sch, "scx_bpf_cid_override: failed to allocate cpumask");
+ return;
+ }
+
+ if (scx_parent(sch)) {
+ scx_error(sch, "scx_bpf_cid_override() only allowed from root sched");
+ return;
+ }
+
+ if (cpu_to_cid__sz != nr_cpu_ids * sizeof(s32)) {
+ scx_error(sch, "scx_bpf_cid_override: expected %zu bytes, got %u",
+ nr_cpu_ids * sizeof(s32), cpu_to_cid__sz);
+ return;
+ }
+
+ for_each_possible_cpu(cpu) {
+ s32 c = cpu_to_cid[cpu];
+
+ if (!cid_valid(sch, c))
+ return;
+ if (cpumask_test_and_set_cpu(c, seen)) {
+ scx_error(sch, "cid %d assigned to multiple cpus", c);
+ return;
+ }
+ scx_cpu_to_cid_tbl[cpu] = c;
+ scx_cid_to_cpu_tbl[c] = cpu;
+ }
+
+ /* Invalidate stale topo info - the override carries no topology. */
+ for (cid = 0; cid < num_possible_cpus(); cid++)
+ scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+}
+
/**
* scx_bpf_cid_to_cpu - Return the raw CPU id for @cid
* @cid: cid to look up
@@ -282,6 +344,16 @@ __bpf_kfunc void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out__uninit,
__bpf_kfunc_end_defs();
+BTF_KFUNCS_START(scx_kfunc_ids_init)
+BTF_ID_FLAGS(func, scx_bpf_cid_override, KF_IMPLICIT_ARGS | KF_SLEEPABLE)
+BTF_KFUNCS_END(scx_kfunc_ids_init)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_init = {
+ .owner = THIS_MODULE,
+ .set = &scx_kfunc_ids_init,
+ .filter = scx_kfunc_context_filter,
+};
+
BTF_KFUNCS_START(scx_kfunc_ids_cid)
BTF_ID_FLAGS(func, scx_bpf_cid_to_cpu, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_cpu_to_cid, KF_IMPLICIT_ARGS)
@@ -295,7 +367,8 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
int scx_cid_kfunc_init(void)
{
- return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
+ return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init) ?:
+ register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_cid) ?:
register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_cid);
}
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
index 1dbe8262ccdd..52edb66b53fd 100644
--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -49,6 +49,7 @@ struct scx_sched;
extern s16 *scx_cid_to_cpu_tbl;
extern s16 *scx_cpu_to_cid_tbl;
extern struct scx_cid_topo *scx_cid_topo;
+extern struct btf_id_set8 scx_kfunc_ids_init;
s32 scx_cid_init(struct scx_sched *sch);
int scx_cid_kfunc_init(void);
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 2808003eef04..6b9d054c3e4f 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -121,6 +121,18 @@ static inline bool scx_bpf_sub_dispatch(u64 cgroup_id)
return false;
}
+/*
+ * v7.2: scx_bpf_cid_override() for explicit cpu->cid mapping. Ignore if
+ * missing.
+ */
+void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz) __ksym __weak;
+
+static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz)
+{
+ if (bpf_ksym_exists(scx_bpf_cid_override___compat))
+ return scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz);
+}
+
/**
* __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on
* in a compatible way. We will preserve this __COMPAT helper until v6.16.
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc
2026-04-28 20:35 ` [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc Tejun Heo
@ 2026-04-29 14:07 ` Andrea Righi
2026-04-29 17:06 ` Tejun Heo
0 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2026-04-29 14:07 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel, Cheng-Yang Chou
Hi Tejun,
On Tue, Apr 28, 2026 at 10:35:36AM -1000, Tejun Heo wrote:
> The auto-probed cid mapping reflects the kernel's view of topology
> (node -> LLC -> core), but a BPF scheduler may want a different layout -
> to align cid slices with its own partitioning, or to work around how the
> kernel reports a particular machine.
>
> Add scx_bpf_cid_override(), callable from ops.init() of the root
> scheduler. It validates the caller-supplied cpu->cid array and replaces
> the in-place mapping; topo info is invalidated. A compat.bpf.h wrapper
> silently no-ops on kernels that lack the kfunc.
>
> A new SCX_KF_ALLOW_INIT bit in the kfunc context filter restricts the
> kfunc to ops.init() at verifier load time.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
...
> +/**
> + * scx_bpf_cid_override - Install an explicit cpu->cid mapping
> + * @cpu_to_cid: array of nr_cpu_ids s32 entries (cid for each cpu)
> + * @cpu_to_cid__sz: must be nr_cpu_ids * sizeof(s32) bytes
> + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
> + *
> + * May only be called from ops.init() of the root scheduler. Replace the
> + * topology-probed cid mapping with the caller-provided one. Each possible cpu
> + * must map to a unique cid in [0, num_possible_cpus()). Topo info is cleared.
> + * On invalid input, trigger scx_error() to abort the scheduler.
> + */
> +__bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
> + const struct bpf_prog_aux *aux)
> +{
> + cpumask_var_t seen __free(free_cpumask_var) = CPUMASK_VAR_NULL;
> + struct scx_sched *sch;
> + bool alloced;
> + s32 cpu, cid;
> +
> + /* GFP_KERNEL alloc must happen before the rcu read section */
> + alloced = zalloc_cpumask_var(&seen, GFP_KERNEL);
> +
> + guard(rcu)();
> +
> + sch = scx_prog_sched(aux);
> + if (unlikely(!sch))
> + return;
> +
> + if (!alloced) {
> + scx_error(sch, "scx_bpf_cid_override: failed to allocate cpumask");
> + return;
> + }
> +
> + if (scx_parent(sch)) {
> + scx_error(sch, "scx_bpf_cid_override() only allowed from root sched");
> + return;
> + }
> +
> + if (cpu_to_cid__sz != nr_cpu_ids * sizeof(s32)) {
> + scx_error(sch, "scx_bpf_cid_override: expected %zu bytes, got %u",
> + nr_cpu_ids * sizeof(s32), cpu_to_cid__sz);
> + return;
> + }
> +
> + for_each_possible_cpu(cpu) {
> + s32 c = cpu_to_cid[cpu];
> +
> + if (!cid_valid(sch, c))
> + return;
> + if (cpumask_test_and_set_cpu(c, seen)) {
> + scx_error(sch, "cid %d assigned to multiple cpus", c);
> + return;
> + }
> + scx_cpu_to_cid_tbl[cpu] = c;
> + scx_cid_to_cpu_tbl[c] = cpu;
> + }
> +
> + /* Invalidate stale topo info - the override carries no topology. */
> + for (cid = 0; cid < num_possible_cpus(); cid++)
> + scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
Considering that the topology info is wiped when scx_bpf_cid_override() is used,
should we error if a scheduler is also trying to use scx_bpf_cid_topo() (i.e.,
setting a flag or similar)?
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc
2026-04-29 14:07 ` Andrea Righi
@ 2026-04-29 17:06 ` Tejun Heo
2026-04-29 17:20 ` Andrea Righi
0 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2026-04-29 17:06 UTC (permalink / raw)
To: Andrea Righi
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel, Cheng-Yang Chou
Hello,
I don't think we need to gate it. The override clears scx_cid_topo[]
to SCX_CID_TOPO_NEG, so subsequent scx_bpf_cid_topo() lookups return
the well-defined "no topo" sentinel. The scheduler that overrode the
mapping has already opted out of the auto-probed topology.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc
2026-04-29 17:06 ` Tejun Heo
@ 2026-04-29 17:20 ` Andrea Righi
0 siblings, 0 replies; 30+ messages in thread
From: Andrea Righi @ 2026-04-29 17:20 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel, Cheng-Yang Chou
On Wed, Apr 29, 2026 at 07:06:46AM -1000, Tejun Heo wrote:
> Hello,
>
> I don't think we need to gate it. The override clears scx_cid_topo[]
> to SCX_CID_TOPO_NEG, so subsequent scx_bpf_cid_topo() lookups return
> the well-defined "no topo" sentinel. The scheduler that overrode the
> mapping has already opted out of the auto-probed topology.
Ok, makes sense. At the end if a scheduler overrides the topology and then uses
scx_bpf_cid_topo(), it's probably reasonable to return "no topology".
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 09/17] tools/sched_ext: Add struct_size() helpers to common.bpf.h
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (7 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 08/17] sched_ext: Add scx_bpf_cid_override() kfunc Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 10/17] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
` (9 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Add flex_array_size(), struct_size() and struct_size_t() to
scx/common.bpf.h so BPF schedulers can size flex-array-containing
structs the same way kernel code does. These are abbreviated forms of
the <linux/overflow.h> macros.
v3: Use offsetof() instead of sizeof() in struct_size() to match kernel
semantics (no inflation from trailing struct padding). (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
tools/sched_ext/include/scx/common.bpf.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 18f823d424cc..087ae4f79c60 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -1043,6 +1043,16 @@ static inline u64 scx_clock_irq(u32 cpu)
return irqt ? BPF_CORE_READ(irqt, total) : 0;
}
+/* Abbreviated forms of <linux/overflow.h>'s struct_size() family. */
+#define flex_array_size(p, member, count) \
+ ((count) * sizeof(*(p)->member))
+
+#define struct_size(p, member, count) \
+ (offsetof(typeof(*(p)), member) + flex_array_size(p, member, count))
+
+#define struct_size_t(type, member, count) \
+ struct_size((type *)NULL, member, count)
+
#include "compat.bpf.h"
#include "enums.bpf.h"
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 10/17] sched_ext: Add cmask, a base-windowed bitmap over cid space
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (8 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 09/17] tools/sched_ext: Add struct_size() helpers to common.bpf.h Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-29 12:47 ` Changwoo Min
2026-04-28 20:35 ` [PATCH 11/17] sched_ext: Add cid-form kfunc wrappers alongside cpu-form Tejun Heo
` (8 subsequent siblings)
18 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid
space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes
most of its bits for a small window and is awkward in BPF.
scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global
64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two
cmasks therefore address bits[] against the same global windows, so
cross-cmask word ops reduce to
dest->bits[i] OP= operand->bits[i - delta]
with no bit-shifting, at the cost of up to one extra storage word for
head misalignment. This alignment guarantee is the reason binary ops
can stay word-level; every mutating helper preserves it.
Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/
cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and
adds the extra helpers that basic idle-cpu selection needs.
No callers yet.
v2: Narrow to helpers that will be used in the planned changes;
set/bit/find/zero ops will be added as usage develops.
v3: cmask_copy_from_kernel: validate src->base == 0 via probe-read;
bit-level nr_bits check instead of round-up word count. (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext_cid.h | 25 +
kernel/sched/ext_types.h | 38 ++
tools/sched_ext/include/scx/cid.bpf.h | 667 +++++++++++++++++++++++
tools/sched_ext/include/scx/common.bpf.h | 1 +
4 files changed, 731 insertions(+)
create mode 100644 tools/sched_ext/include/scx/cid.bpf.h
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
index 52edb66b53fd..c3c429d2c8e2 100644
--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -127,4 +127,29 @@ static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
return __scx_cpu_to_cid(cpu);
}
+static inline bool __scx_cmask_contains(const struct scx_cmask *m, u32 cid)
+{
+ return likely(cid >= m->base && cid < m->base + m->nr_bits);
+}
+
+/* Word in bits[] covering @cid. @cid must satisfy __scx_cmask_contains(). */
+static inline u64 *__scx_cmask_word(const struct scx_cmask *m, u32 cid)
+{
+ return (u64 *)&m->bits[cid / 64 - m->base / 64];
+}
+
+static inline void scx_cmask_init(struct scx_cmask *m, u32 base, u32 nr_bits)
+{
+ m->base = base;
+ m->nr_bits = nr_bits;
+ memset(m->bits, 0, SCX_CMASK_NR_WORDS(nr_bits) * sizeof(u64));
+}
+
+static inline void __scx_cmask_set(struct scx_cmask *m, u32 cid)
+{
+ if (!__scx_cmask_contains(m, cid))
+ return;
+ *__scx_cmask_word(m, cid) |= BIT_U64(cid & 63);
+}
+
#endif /* _KERNEL_SCHED_EXT_CID_H */
diff --git a/kernel/sched/ext_types.h b/kernel/sched/ext_types.h
index be4d3565ae8d..ebb8cdf90612 100644
--- a/kernel/sched/ext_types.h
+++ b/kernel/sched/ext_types.h
@@ -63,4 +63,42 @@ struct scx_cid_topo {
s32 node_idx;
};
+/*
+ * cmask: variable-length, base-windowed bitmap over cid space
+ * -----------------------------------------------------------
+ *
+ * A cmask covers the cid range [base, base + nr_bits). bits[] is aligned to the
+ * global 64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64), so the
+ * first (base & 63) bits of bits[0] are head padding and any tail past base +
+ * nr_bits is tail padding. Both must stay zero for the lifetime of the mask;
+ * all mutating helpers preserve that invariant.
+ *
+ * Grid alignment means two cmasks always address bits[] against the same global
+ * 64-cid windows, so cross-cmask word ops (AND, OR, ...) reduce to
+ *
+ * dst->bits[i] OP= src->bits[i - delta]
+ *
+ * with no bit-shifting, regardless of how the two bases relate mod 64.
+ */
+struct scx_cmask {
+ u32 base;
+ u32 nr_bits;
+ DECLARE_FLEX_ARRAY(u64, bits);
+};
+
+/*
+ * Number of u64 words of bits[] storage that covers @nr_bits regardless of base
+ * alignment. The +1 absorbs up to 63 bits of head padding when base is not
+ * 64-aligned - always allocating one extra word beats branching on base or
+ * splitting the compute.
+ */
+#define SCX_CMASK_NR_WORDS(nr_bits) (((nr_bits) + 63) / 64 + 1)
+
+/*
+ * Define an on-stack cmask for up to @cap_bits. @name is a struct scx_cmask *
+ * aliasing zero-initialized storage; call scx_cmask_init() to set base/nr_bits.
+ */
+#define SCX_CMASK_DEFINE(name, cap_bits) \
+ DEFINE_RAW_FLEX(struct scx_cmask, name, bits, SCX_CMASK_NR_WORDS(cap_bits))
+
#endif /* _KERNEL_SCHED_EXT_TYPES_H */
diff --git a/tools/sched_ext/include/scx/cid.bpf.h b/tools/sched_ext/include/scx/cid.bpf.h
new file mode 100644
index 000000000000..960108708eed
--- /dev/null
+++ b/tools/sched_ext/include/scx/cid.bpf.h
@@ -0,0 +1,667 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF-side helpers for cids and cmasks. See kernel/sched/ext_cid.h for the
+ * authoritative layout and semantics. The BPF-side helpers use the cmask_*
+ * naming (no scx_ prefix); cmask is the SCX bitmap type so the prefix is
+ * redundant in BPF code. Atomics use __sync_val_compare_and_swap and every
+ * helper is inline (no .c counterpart).
+ *
+ * Included by scx/common.bpf.h; don't include directly.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef __SCX_CID_BPF_H
+#define __SCX_CID_BPF_H
+
+#include "bpf_arena_common.bpf.h"
+
+#ifndef BIT_U64
+#define BIT_U64(nr) (1ULL << (nr))
+#endif
+#ifndef GENMASK_U64
+#define GENMASK_U64(h, l) ((~0ULL << (l)) & (~0ULL >> (63 - (h))))
+#endif
+
+/*
+ * Storage cap for bounded loops over bits[]. Sized to cover NR_CPUS=8192 with
+ * one extra word for head-misalignment. Increase if deployment targets larger
+ * NR_CPUS.
+ */
+#ifndef CMASK_MAX_WORDS
+#define CMASK_MAX_WORDS 129
+#endif
+
+#define CMASK_NR_WORDS(nr_bits) (((nr_bits) + 63) / 64 + 1)
+
+static __always_inline bool __cmask_contains(const struct scx_cmask __arena *m, u32 cid)
+{
+ return cid >= m->base && cid < m->base + m->nr_bits;
+}
+
+static __always_inline u64 __arena *__cmask_word(const struct scx_cmask __arena *m, u32 cid)
+{
+ return (u64 __arena *)&m->bits[cid / 64 - m->base / 64];
+}
+
+static __always_inline void cmask_init(struct scx_cmask __arena *m, u32 base, u32 nr_bits)
+{
+ u32 nr_words = CMASK_NR_WORDS(nr_bits), i;
+
+ m->base = base;
+ m->nr_bits = nr_bits;
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ if (i >= nr_words)
+ break;
+ m->bits[i] = 0;
+ }
+}
+
+static __always_inline bool cmask_test(const struct scx_cmask __arena *m, u32 cid)
+{
+ if (!__cmask_contains(m, cid))
+ return false;
+ return *__cmask_word(m, cid) & BIT_U64(cid & 63);
+}
+
+/*
+ * x86 BPF JIT rejects BPF_OR | BPF_FETCH and BPF_AND | BPF_FETCH on arena
+ * pointers (see bpf_jit_supports_insn() in arch/x86/net/bpf_jit_comp.c). Only
+ * BPF_CMPXCHG / BPF_XCHG / BPF_ADD with FETCH are allowed. Implement
+ * test_and_{set,clear} and the atomic set/clear via a cmpxchg loop.
+ *
+ * CMASK_CAS_TRIES is far above what any non-pathological contention needs.
+ * Exhausting it means the bit update was lost, which corrupts the caller's view
+ * of the bitmap, so raise scx_bpf_error() to abort the scheduler.
+ */
+#define CMASK_CAS_TRIES 1024
+
+static __always_inline void cmask_set(struct scx_cmask __arena *m, u32 cid)
+{
+ u64 __arena *w;
+ u64 bit, old, new;
+ u32 i;
+
+ if (!__cmask_contains(m, cid))
+ return;
+ w = __cmask_word(m, cid);
+ bit = BIT_U64(cid & 63);
+ bpf_for(i, 0, CMASK_CAS_TRIES) {
+ old = *w;
+ if (old & bit)
+ return;
+ new = old | bit;
+ if (__sync_val_compare_and_swap(w, old, new) == old)
+ return;
+ }
+ scx_bpf_error("cmask_set CAS exhausted at cid %u", cid);
+}
+
+static __always_inline void cmask_clear(struct scx_cmask __arena *m, u32 cid)
+{
+ u64 __arena *w;
+ u64 bit, old, new;
+ u32 i;
+
+ if (!__cmask_contains(m, cid))
+ return;
+ w = __cmask_word(m, cid);
+ bit = BIT_U64(cid & 63);
+ bpf_for(i, 0, CMASK_CAS_TRIES) {
+ old = *w;
+ if (!(old & bit))
+ return;
+ new = old & ~bit;
+ if (__sync_val_compare_and_swap(w, old, new) == old)
+ return;
+ }
+ scx_bpf_error("cmask_clear CAS exhausted at cid %u", cid);
+}
+
+static __always_inline bool cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
+{
+ u64 __arena *w;
+ u64 bit, old, new;
+ u32 i;
+
+ if (!__cmask_contains(m, cid))
+ return false;
+ w = __cmask_word(m, cid);
+ bit = BIT_U64(cid & 63);
+ bpf_for(i, 0, CMASK_CAS_TRIES) {
+ old = *w;
+ if (old & bit)
+ return true;
+ new = old | bit;
+ if (__sync_val_compare_and_swap(w, old, new) == old)
+ return false;
+ }
+ scx_bpf_error("cmask_test_and_set CAS exhausted at cid %u", cid);
+ return false;
+}
+
+static __always_inline bool cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
+{
+ u64 __arena *w;
+ u64 bit, old, new;
+ u32 i;
+
+ if (!__cmask_contains(m, cid))
+ return false;
+ w = __cmask_word(m, cid);
+ bit = BIT_U64(cid & 63);
+ bpf_for(i, 0, CMASK_CAS_TRIES) {
+ old = *w;
+ if (!(old & bit))
+ return false;
+ new = old & ~bit;
+ if (__sync_val_compare_and_swap(w, old, new) == old)
+ return true;
+ }
+ scx_bpf_error("cmask_test_and_clear CAS exhausted at cid %u", cid);
+ return false;
+}
+
+static __always_inline void __cmask_set(struct scx_cmask __arena *m, u32 cid)
+{
+ if (!__cmask_contains(m, cid))
+ return;
+ *__cmask_word(m, cid) |= BIT_U64(cid & 63);
+}
+
+static __always_inline void __cmask_clear(struct scx_cmask __arena *m, u32 cid)
+{
+ if (!__cmask_contains(m, cid))
+ return;
+ *__cmask_word(m, cid) &= ~BIT_U64(cid & 63);
+}
+
+static __always_inline bool __cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
+{
+ u64 bit = BIT_U64(cid & 63);
+ u64 __arena *w;
+ u64 prev;
+
+ if (!__cmask_contains(m, cid))
+ return false;
+ w = __cmask_word(m, cid);
+ prev = *w & bit;
+ *w |= bit;
+ return prev;
+}
+
+static __always_inline bool __cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
+{
+ u64 bit = BIT_U64(cid & 63);
+ u64 __arena *w;
+ u64 prev;
+
+ if (!__cmask_contains(m, cid))
+ return false;
+ w = __cmask_word(m, cid);
+ prev = *w & bit;
+ *w &= ~bit;
+ return prev;
+}
+
+static __always_inline void cmask_zero(struct scx_cmask __arena *m)
+{
+ u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ if (i >= nr_words)
+ break;
+ m->bits[i] = 0;
+ }
+}
+
+/*
+ * BPF_-prefixed to avoid colliding with the kernel's anonymous CMASK_OP_*
+ * enum in ext_cid.c, which is exported via BTF and reachable through
+ * vmlinux.h.
+ */
+enum {
+ BPF_CMASK_OP_AND,
+ BPF_CMASK_OP_OR,
+ BPF_CMASK_OP_COPY,
+ BPF_CMASK_OP_ANDNOT,
+};
+
+static __always_inline void cmask_op_word(struct scx_cmask __arena *dst,
+ const struct scx_cmask __arena *src,
+ u32 di, u32 si, u64 mask, int op)
+{
+ u64 dv = dst->bits[di];
+ u64 sv = src->bits[si];
+ u64 rv;
+
+ if (op == BPF_CMASK_OP_AND)
+ rv = dv & sv;
+ else if (op == BPF_CMASK_OP_OR)
+ rv = dv | sv;
+ else if (op == BPF_CMASK_OP_ANDNOT)
+ rv = dv & ~sv;
+ else
+ rv = sv;
+
+ dst->bits[di] = (dv & ~mask) | (rv & mask);
+}
+
+static __always_inline void cmask_op(struct scx_cmask __arena *dst,
+ const struct scx_cmask __arena *src, int op)
+{
+ u32 d_end = dst->base + dst->nr_bits;
+ u32 s_end = src->base + src->nr_bits;
+ u32 lo = dst->base > src->base ? dst->base : src->base;
+ u32 hi = d_end < s_end ? d_end : s_end;
+ u32 d_base = dst->base / 64;
+ u32 s_base = src->base / 64;
+ u32 lo_word, hi_word, i;
+ u64 head_mask, tail_mask;
+
+ if (lo >= hi)
+ return;
+
+ lo_word = lo / 64;
+ hi_word = (hi - 1) / 64;
+ head_mask = GENMASK_U64(63, lo & 63);
+ tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ u32 w = lo_word + i;
+ u64 m;
+
+ if (w > hi_word)
+ break;
+
+ m = GENMASK_U64(63, 0);
+ if (w == lo_word)
+ m &= head_mask;
+ if (w == hi_word)
+ m &= tail_mask;
+
+ cmask_op_word(dst, src, w - d_base, w - s_base, m, op);
+ }
+}
+
+/*
+ * cmask_and/or/copy only modify @dst bits that lie in the intersection of
+ * [@dst->base, @dst->base + @dst->nr_bits) and [@src->base,
+ * @src->base + @src->nr_bits). Bits in @dst outside that window
+ * keep their prior values - in particular, cmask_copy() does NOT zero @dst
+ * bits that lie outside @src's range.
+ */
+static __always_inline void cmask_and(struct scx_cmask __arena *dst,
+ const struct scx_cmask __arena *src)
+{
+ cmask_op(dst, src, BPF_CMASK_OP_AND);
+}
+
+static __always_inline void cmask_or(struct scx_cmask __arena *dst,
+ const struct scx_cmask __arena *src)
+{
+ cmask_op(dst, src, BPF_CMASK_OP_OR);
+}
+
+static __always_inline void cmask_copy(struct scx_cmask __arena *dst,
+ const struct scx_cmask __arena *src)
+{
+ cmask_op(dst, src, BPF_CMASK_OP_COPY);
+}
+
+static __always_inline void cmask_andnot(struct scx_cmask __arena *dst,
+ const struct scx_cmask __arena *src)
+{
+ cmask_op(dst, src, BPF_CMASK_OP_ANDNOT);
+}
+
+/*
+ * True iff @a and @b have identical bits over their (assumed equal) range.
+ * Callers are expected to pass same-shape cmasks; differing shapes always
+ * compare unequal.
+ */
+static __always_inline bool cmask_equal(const struct scx_cmask __arena *a,
+ const struct scx_cmask __arena *b)
+{
+ u32 nr_words, i;
+
+ if (a->base != b->base || a->nr_bits != b->nr_bits)
+ return false;
+ nr_words = CMASK_NR_WORDS(a->nr_bits);
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ if (i >= nr_words)
+ break;
+ if (a->bits[i] != b->bits[i])
+ return false;
+ }
+ return true;
+}
+
+/*
+ * True iff every bit set in @a is also set in @b over the intersection of
+ * their ranges. Bits of @a outside @b's range fail the test.
+ */
+static __always_inline bool cmask_subset(const struct scx_cmask __arena *a,
+ const struct scx_cmask __arena *b)
+{
+ u32 a_end = a->base + a->nr_bits;
+ u32 b_end = b->base + b->nr_bits;
+ u32 a_wbase = a->base / 64;
+ u32 b_wbase = b->base / 64;
+ u32 nr_words, i;
+
+ /* any bit of @a outside @b's range is a subset violation */
+ if (a->base < b->base || a_end > b_end)
+ return false;
+
+ nr_words = CMASK_NR_WORDS(a->nr_bits);
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ u32 wi_b;
+
+ if (i >= nr_words)
+ break;
+ wi_b = a_wbase + i - b_wbase;
+ if (a->bits[i] & ~b->bits[wi_b])
+ return false;
+ }
+ return true;
+}
+
+/**
+ * cmask_next_set - find the first set bit at or after @cid
+ * @m: cmask to search
+ * @cid: starting cid (clamped to @m->base if below)
+ *
+ * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
+ * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
+ * termination condition used by cmask_for_each()).
+ */
+static __always_inline u32 cmask_next_set(const struct scx_cmask __arena *m, u32 cid)
+{
+ u32 end = m->base + m->nr_bits;
+ u32 base = m->base / 64;
+ u32 last_wi = (end - 1) / 64 - base;
+ u32 start_wi, start_bit, i;
+
+ if (cid < m->base)
+ cid = m->base;
+ if (cid >= end)
+ return end;
+
+ start_wi = cid / 64 - base;
+ start_bit = cid & 63;
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ u32 wi = start_wi + i;
+ u64 word;
+ u32 found;
+
+ if (wi > last_wi)
+ break;
+
+ word = m->bits[wi];
+ if (i == 0)
+ word &= GENMASK_U64(63, start_bit);
+ if (!word)
+ continue;
+
+ found = (base + wi) * 64 + __builtin_ctzll(word);
+ if (found >= end)
+ return end;
+ return found;
+ }
+ return end;
+}
+
+static __always_inline u32 cmask_first_set(const struct scx_cmask __arena *m)
+{
+ return cmask_next_set(m, m->base);
+}
+
+#define cmask_for_each(cid, m) \
+ for ((cid) = cmask_first_set(m); \
+ (cid) < (m)->base + (m)->nr_bits; \
+ (cid) = cmask_next_set((m), (cid) + 1))
+
+/*
+ * Population count over [base, base + nr_bits). Padding bits in the head/tail
+ * words are guaranteed zero by the mutating helpers, so a flat popcount over
+ * all words is correct.
+ */
+static __always_inline u32 cmask_weight(const struct scx_cmask __arena *m)
+{
+ u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
+ u32 count = 0;
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ if (i >= nr_words)
+ break;
+ count += __builtin_popcountll(m->bits[i]);
+ }
+ return count;
+}
+
+/*
+ * True if @a and @b share any set bit. Walk only the intersection of their
+ * ranges, matching the semantics of cmask_and().
+ */
+static __always_inline bool cmask_intersects(const struct scx_cmask __arena *a,
+ const struct scx_cmask __arena *b)
+{
+ u32 a_end = a->base + a->nr_bits;
+ u32 b_end = b->base + b->nr_bits;
+ u32 lo = a->base > b->base ? a->base : b->base;
+ u32 hi = a_end < b_end ? a_end : b_end;
+ u32 a_base = a->base / 64;
+ u32 b_base = b->base / 64;
+ u32 lo_word, hi_word, i;
+ u64 head_mask, tail_mask;
+
+ if (lo >= hi)
+ return false;
+
+ lo_word = lo / 64;
+ hi_word = (hi - 1) / 64;
+ head_mask = GENMASK_U64(63, lo & 63);
+ tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ u32 w = lo_word + i;
+ u64 mask, av, bv;
+
+ if (w > hi_word)
+ break;
+
+ mask = GENMASK_U64(63, 0);
+ if (w == lo_word)
+ mask &= head_mask;
+ if (w == hi_word)
+ mask &= tail_mask;
+
+ av = a->bits[w - a_base] & mask;
+ bv = b->bits[w - b_base] & mask;
+ if (av & bv)
+ return true;
+ }
+ return false;
+}
+
+/*
+ * Find the next cid set in both @a and @b at or after @start, bounded by the
+ * intersection of the two ranges. Return a->base + a->nr_bits if none found.
+ *
+ * Building block for cmask_next_and_set_wrap(). Callers that want a bounded
+ * scan without wrap call this directly.
+ */
+static __always_inline u32 cmask_next_and_set(const struct scx_cmask __arena *a,
+ const struct scx_cmask __arena *b,
+ u32 start)
+{
+ u32 a_end = a->base + a->nr_bits;
+ u32 b_end = b->base + b->nr_bits;
+ u32 a_wbase = a->base / 64;
+ u32 b_wbase = b->base / 64;
+ u32 lo = a->base > b->base ? a->base : b->base;
+ u32 hi = a_end < b_end ? a_end : b_end;
+ u32 last_wi, start_wi, start_bit, i;
+
+ if (lo >= hi)
+ return a_end;
+ if (start < lo)
+ start = lo;
+ if (start >= hi)
+ return a_end;
+
+ last_wi = (hi - 1) / 64;
+ start_wi = start / 64;
+ start_bit = start & 63;
+
+ bpf_for(i, 0, CMASK_MAX_WORDS) {
+ u32 abs_wi = start_wi + i;
+ u64 word;
+ u32 found;
+
+ if (abs_wi > last_wi)
+ break;
+
+ word = a->bits[abs_wi - a_wbase] & b->bits[abs_wi - b_wbase];
+ if (i == 0)
+ word &= GENMASK_U64(63, start_bit);
+ if (!word)
+ continue;
+
+ found = abs_wi * 64 + __builtin_ctzll(word);
+ if (found >= hi)
+ return a_end;
+ return found;
+ }
+ return a_end;
+}
+
+/*
+ * Find the next set cid in @m at or after @start, wrapping to @m->base if no
+ * set bit is found in [start, m->base + m->nr_bits). Return m->base +
+ * m->nr_bits if @m is empty.
+ *
+ * Callers do round-robin distribution by passing (last_cid + 1) as @start.
+ */
+static __always_inline u32 cmask_next_set_wrap(const struct scx_cmask __arena *m,
+ u32 start)
+{
+ u32 end = m->base + m->nr_bits;
+ u32 found;
+
+ found = cmask_next_set(m, start);
+ if (found < end || start <= m->base)
+ return found;
+
+ found = cmask_next_set(m, m->base);
+ return found < start ? found : end;
+}
+
+/*
+ * Find the next cid set in both @a and @b at or after @start, wrapping to
+ * @a->base if none found in the forward half. Return a->base + a->nr_bits
+ * if the intersection is empty.
+ *
+ * Callers do round-robin distribution by passing (last_cid + 1) as @start.
+ */
+static __always_inline u32 cmask_next_and_set_wrap(const struct scx_cmask __arena *a,
+ const struct scx_cmask __arena *b,
+ u32 start)
+{
+ u32 a_end = a->base + a->nr_bits;
+ u32 found;
+
+ found = cmask_next_and_set(a, b, start);
+ if (found < a_end || start <= a->base)
+ return found;
+
+ found = cmask_next_and_set(a, b, a->base);
+ return found < start ? found : a_end;
+}
+
+/**
+ * cmask_from_cpumask - translate a kernel cpumask to a cid-space cmask
+ * @m: cmask to fill. Zeroed first; only bits within [@m->base, @m->base +
+ * @m->nr_bits) are updated - cpus mapping to cids outside that range
+ * are ignored.
+ * @cpumask: kernel cpumask to translate
+ *
+ * For each cpu in @cpumask, set the cpu's cid in @m. Caller must ensure
+ * @cpumask stays stable across the call (e.g. RCU read lock for
+ * task->cpus_ptr).
+ */
+static __always_inline void cmask_from_cpumask(struct scx_cmask __arena *m,
+ const struct cpumask *cpumask)
+{
+ u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
+ s32 cpu;
+
+ cmask_zero(m);
+ bpf_for(cpu, 0, nr_cpu_ids) {
+ s32 cid;
+
+ if (!bpf_cpumask_test_cpu(cpu, cpumask))
+ continue;
+ cid = scx_bpf_cpu_to_cid(cpu);
+ if (cid >= 0)
+ __cmask_set(m, cid);
+ }
+}
+
+/**
+ * cmask_copy_from_kernel - probe-read a kernel cmask into an arena cmask
+ * @dst: arena cmask to fill; must have @dst->base == 0 and be sized for @src.
+ * @src: kernel-memory cmask (e.g. ops.set_cmask() arg); @src->base must be 0.
+ *
+ * Word-for-word copy; @src and @dst must share base 0 alignment. Triggers
+ * scx_bpf_error() on probe failure or precondition violation.
+ */
+static __always_inline void cmask_copy_from_kernel(struct scx_cmask __arena *dst,
+ const struct scx_cmask *src)
+{
+ u32 base = 0, nr_bits = 0, nr_words, wi;
+
+ if (dst->base != 0) {
+ scx_bpf_error("cmask_copy_from_kernel requires dst->base == 0");
+ return;
+ }
+
+ if (bpf_probe_read_kernel(&base, sizeof(base), &src->base)) {
+ scx_bpf_error("probe-read cmask->base failed");
+ return;
+ }
+ if (base != 0) {
+ scx_bpf_error("cmask_copy_from_kernel requires src->base == 0");
+ return;
+ }
+
+ if (bpf_probe_read_kernel(&nr_bits, sizeof(nr_bits), &src->nr_bits)) {
+ scx_bpf_error("probe-read cmask->nr_bits failed");
+ return;
+ }
+
+ if (nr_bits > dst->nr_bits) {
+ scx_bpf_error("src cmask nr_bits=%u exceeds dst nr_bits=%u",
+ nr_bits, dst->nr_bits);
+ return;
+ }
+
+ nr_words = CMASK_NR_WORDS(nr_bits);
+ cmask_zero(dst);
+ bpf_for(wi, 0, CMASK_MAX_WORDS) {
+ u64 word = 0;
+ if (wi >= nr_words)
+ break;
+ if (bpf_probe_read_kernel(&word, sizeof(u64), &src->bits[wi])) {
+ scx_bpf_error("probe-read cmask->bits[%u] failed", wi);
+ return;
+ }
+ dst->bits[wi] = word;
+ }
+}
+
+#endif /* __SCX_CID_BPF_H */
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 087ae4f79c60..ff57a7acdbeb 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -1055,5 +1055,6 @@ static inline u64 scx_clock_irq(u32 cpu)
#include "compat.bpf.h"
#include "enums.bpf.h"
+#include "cid.bpf.h"
#endif /* __SCX_COMMON_BPF_H */
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 10/17] sched_ext: Add cmask, a base-windowed bitmap over cid space
2026-04-28 20:35 ` [PATCH 10/17] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
@ 2026-04-29 12:47 ` Changwoo Min
2026-04-29 17:16 ` Tejun Heo
0 siblings, 1 reply; 30+ messages in thread
From: Changwoo Min @ 2026-04-29 12:47 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Andrea Righi
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Cheng-Yang Chou
On 4/29/26 5:35 AM, Tejun Heo wrote:
> +/*
> + * x86 BPF JIT rejects BPF_OR | BPF_FETCH and BPF_AND | BPF_FETCH on arena
> + * pointers (see bpf_jit_supports_insn() in arch/x86/net/bpf_jit_comp.c). Only
> + * BPF_CMPXCHG / BPF_XCHG / BPF_ADD with FETCH are allowed. Implement
> + * test_and_{set,clear} and the atomic set/clear via a cmpxchg loop.
> + *
> + * CMASK_CAS_TRIES is far above what any non-pathological contention needs.
> + * Exhausting it means the bit update was lost, which corrupts the caller's view
> + * of the bitmap, so raise scx_bpf_error() to abort the scheduler.
> + */
> +#define CMASK_CAS_TRIES 1024
> +
> +static __always_inline void cmask_set(struct scx_cmask __arena *m, u32 cid)
> +{
> + u64 __arena *w;
> + u64 bit, old, new;
> + u32 i;
> +
> + if (!__cmask_contains(m, cid))
> + return;
> + w = __cmask_word(m, cid);
> + bit = BIT_U64(cid & 63);
> + bpf_for(i, 0, CMASK_CAS_TRIES) {
> + old = *w;
> + if (old & bit)
> + return;
> + new = old | bit;
> + if (__sync_val_compare_and_swap(w, old, new) == old)
> + return;
> + }
> + scx_bpf_error("cmask_set CAS exhausted at cid %u", cid);
> +}
> +
> +static __always_inline void cmask_clear(struct scx_cmask __arena *m, u32 cid)
> +{
> + u64 __arena *w;
> + u64 bit, old, new;
> + u32 i;
> +
> + if (!__cmask_contains(m, cid))
> + return;
> + w = __cmask_word(m, cid);
> + bit = BIT_U64(cid & 63);
> + bpf_for(i, 0, CMASK_CAS_TRIES) {
> + old = *w;
> + if (!(old & bit))
> + return;
> + new = old & ~bit;
> + if (__sync_val_compare_and_swap(w, old, new) == old)
> + return;
> + }
> + scx_bpf_error("cmask_clear CAS exhausted at cid %u", cid);
> +}
> +
> +static __always_inline bool cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
> +{
> + u64 __arena *w;
> + u64 bit, old, new;
> + u32 i;
> +
> + if (!__cmask_contains(m, cid))
> + return false;
> + w = __cmask_word(m, cid);
> + bit = BIT_U64(cid & 63);
> + bpf_for(i, 0, CMASK_CAS_TRIES) {
> + old = *w;
> + if (old & bit)
> + return true;
> + new = old | bit;
> + if (__sync_val_compare_and_swap(w, old, new) == old)
> + return false;
> + }
> + scx_bpf_error("cmask_test_and_set CAS exhausted at cid %u", cid);
> + return false;
> +}
> +
> +static __always_inline bool cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
> +{
> + u64 __arena *w;
> + u64 bit, old, new;
> + u32 i;
> +
> + if (!__cmask_contains(m, cid))
> + return false;
> + w = __cmask_word(m, cid);
> + bit = BIT_U64(cid & 63);
> + bpf_for(i, 0, CMASK_CAS_TRIES) {
> + old = *w;
> + if (!(old & bit))
> + return false;
> + new = old & ~bit;
> + if (__sync_val_compare_and_swap(w, old, new) == old)
> + return true;
> + }
> + scx_bpf_error("cmask_test_and_clear CAS exhausted at cid %u", cid);
> + return false;
> +}
Exiting a BPF scheduler when CAS retries fail seems too brutal, while it
is extremely rare to happen. What about adding a kfunc for the slow path
that runs only when the CAS retry fails? For example,
scx_bpf_cmask_test_and_clear( ) does the same thing as
cmask_test_and_clear(), but it never fails. cmask_test_and_clear() calls
scx_bpf_cmask_test_and_clear( ) only when the CAS retry fails. In this
way, we can use the fast path in the BPF implementation while ensuring
the operation succeeds.
> +/**
> + * cmask_next_set - find the first set bit at or after @cid
> + * @m: cmask to search
> + * @cid: starting cid (clamped to @m->base if below)
> + *
> + * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
> + * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
> + * termination condition used by cmask_for_each()).
> + */
> +static __always_inline u32 cmask_next_set(const struct scx_cmask __arena *m, u32 cid)
> +{
> + u32 end = m->base + m->nr_bits;
> + u32 base = m->base / 64;
> + u32 last_wi = (end - 1) / 64 - base;
> + u32 start_wi, start_bit, i;
> +
> + if (cid < m->base)
> + cid = m->base;
> + if (cid >= end)
> + return end;
> +
> + start_wi = cid / 64 - base;
> + start_bit = cid & 63;
> +
> + bpf_for(i, 0, CMASK_MAX_WORDS) {
> + u32 wi = start_wi + i;
> + u64 word;
> + u32 found;
> +
> + if (wi > last_wi)
> + break;
> +
> + word = m->bits[wi];
> + if (i == 0)
> + word &= GENMASK_U64(63, start_bit);
> + if (!word)
> + continue;
> +
> + found = (base + wi) * 64 + __builtin_ctzll(word);
Some compiler versions (e.g., clang-18 or older) don’t support
__builtin_ctzll(). To handle this gracefully, there is already a
wrapper, ctzll(), in common.bpf.h. So, I suggest using ctzll()
for compatibility.
Reviewed-by: Changwoo Min <changwoo@igalia.com>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 10/17] sched_ext: Add cmask, a base-windowed bitmap over cid space
2026-04-29 12:47 ` Changwoo Min
@ 2026-04-29 17:16 ` Tejun Heo
0 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-29 17:16 UTC (permalink / raw)
To: Changwoo Min
Cc: David Vernet, Andrea Righi, sched-ext, Emil Tsalapatis,
linux-kernel, Cheng-Yang Chou
Hello,
I'd rather raise the loop count than punt to a kernel slow-path. On
multi-socket Sapphire Rapids, banging on a shared cacheline with
kernel atomics hard enough can stall the machine to the point of
hard lockups. We don't know what BPF will do with these helpers and
they can pretty easily trigger similar conditions, so giving them
the ability to loop indefinitely in the kernel is exactly what we
want to avoid - we want to fail hard when this happens.
Bumping CMASK_CAS_TRIES to 1<<23 in v4 so abort fires only after
seconds of real spinning. As a follow-up, I want to add a kfunc to
let the BPF loops bail immediately when sch->aborting is set, so the
abort path doesn't keep banging the cacheline while the kernel is
tearing the scheduler down.
Switching to ctzll() too, thanks.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH 11/17] sched_ext: Add cid-form kfunc wrappers alongside cpu-form
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (9 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 10/17] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 12/17] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type Tejun Heo
` (7 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without full cid coverage a
scheduler has to mix cid and cpu forms, which is a subtle-bug factory.
Close the gap with a cid-native interface.
Pair every cpu-form kfunc that takes a cpu id with a cid-form
equivalent (kick, task placement, cpuperf query/set, per-cpu current
task, nr-cpu-ids). Add two cid-natives with no cpu-form sibling:
scx_bpf_this_cid() (cid of the running cpu, scx equivalent of
bpf_get_smp_processor_id) and scx_bpf_nr_online_cids().
scx_bpf_cpu_rq is deprecated; no cid-form counterpart. NUMA node info
is reachable via scx_bpf_cid_topo() on the BPF side.
Each cid-form wrapper is a thin cid -> cpu translation that delegates
to the cpu path, registered in the same context sets so usage
constraints match.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 184 +++++++++++++++++++++++
tools/sched_ext/include/scx/common.bpf.h | 9 ++
2 files changed, 193 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 6f0b30fa970f..3f77e83044a1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8864,6 +8864,28 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
scx_kick_cpu(sch, cpu, flags);
}
+/**
+ * scx_bpf_kick_cid - Trigger reschedule on the CPU mapped to @cid
+ * @cid: cid to kick
+ * @flags: %SCX_KICK_* flags
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_kick_cpu().
+ */
+__bpf_kfunc void scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+ s32 cpu;
+
+ guard(rcu)();
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return;
+ cpu = scx_cid_to_cpu(sch, cid);
+ if (cpu >= 0)
+ scx_kick_cpu(sch, cpu, flags);
+}
+
/**
* scx_bpf_dsq_nr_queued - Return the number of queued tasks
* @dsq_id: id of the DSQ
@@ -9287,6 +9309,29 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux)
return SCX_CPUPERF_ONE;
}
+/**
+ * scx_bpf_cidperf_cap - Query the maximum relative capacity of the CPU at @cid
+ * @cid: cid of the CPU to query
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpuperf_cap().
+ */
+__bpf_kfunc u32 scx_bpf_cidperf_cap(s32 cid, const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+ s32 cpu;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return SCX_CPUPERF_ONE;
+ cpu = scx_cid_to_cpu(sch, cid);
+ if (cpu < 0)
+ return SCX_CPUPERF_ONE;
+ return arch_scale_cpu_capacity(cpu);
+}
+
/**
* scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
* @cpu: CPU of interest
@@ -9315,6 +9360,29 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux)
return SCX_CPUPERF_ONE;
}
+/**
+ * scx_bpf_cidperf_cur - Query the current performance of the CPU at @cid
+ * @cid: cid of the CPU to query
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cidperf_cur(s32 cid, const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+ s32 cpu;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return SCX_CPUPERF_ONE;
+ cpu = scx_cid_to_cpu(sch, cid);
+ if (cpu < 0)
+ return SCX_CPUPERF_ONE;
+ return arch_scale_freq_capacity(cpu);
+}
+
/**
* scx_bpf_cpuperf_set - Set the relative performance target of a CPU
* @cpu: CPU of interest
@@ -9375,6 +9443,31 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_au
}
}
+/**
+ * scx_bpf_cidperf_set - Set the performance target of the CPU at @cid
+ * @cid: cid of the CPU to target
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpuperf_set().
+ */
+__bpf_kfunc void scx_bpf_cidperf_set(s32 cid, u32 perf,
+ const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+ s32 cpu;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return;
+ cpu = scx_cid_to_cpu(sch, cid);
+ if (cpu < 0)
+ return;
+ scx_bpf_cpuperf_set(cpu, perf, aux);
+}
+
/**
* scx_bpf_nr_node_ids - Return the number of possible node IDs
*
@@ -9395,6 +9488,47 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
return nr_cpu_ids;
}
+/**
+ * scx_bpf_nr_cids - Return the size of the cid space
+ *
+ * Equals num_possible_cpus(). All valid cids are in [0, return value).
+ */
+__bpf_kfunc u32 scx_bpf_nr_cids(void)
+{
+ return num_possible_cpus();
+}
+
+/**
+ * scx_bpf_nr_online_cids - Return current count of online CPUs in cid space
+ *
+ * Return num_online_cpus(). The standard model restarts the scheduler on
+ * hotplug, which lets schedulers treat [0, nr_online_cids) as the online
+ * range. Schedulers that prefer to handle hotplug without a restart should
+ * install a custom mapping via scx_bpf_cid_override() and track onlining
+ * through the ops.cid_online / ops.cid_offline callbacks.
+ */
+__bpf_kfunc u32 scx_bpf_nr_online_cids(void)
+{
+ return num_online_cpus();
+}
+
+/**
+ * scx_bpf_this_cid - Return the cid of the CPU this program is running on
+ *
+ * cid-addressed equivalent of bpf_get_smp_processor_id() for scx programs.
+ * The current cpu is trivially valid, so this is just a table lookup. Return
+ * -EINVAL if called from a non-SCX program before any scheduler has ever
+ * been enabled (the cid table is still unallocated at that point).
+ */
+__bpf_kfunc s32 scx_bpf_this_cid(void)
+{
+ s16 *tbl = READ_ONCE(scx_cpu_to_cid_tbl);
+
+ if (!tbl)
+ return -EINVAL;
+ return tbl[raw_smp_processor_id()];
+}
+
/**
* scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
*/
@@ -9443,6 +9577,23 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p)
return task_cpu(p);
}
+/**
+ * scx_bpf_task_cid - cid a task is currently associated with
+ * @p: task of interest
+ *
+ * cid-addressed equivalent of scx_bpf_task_cpu(). task_cpu(p) is always a
+ * valid cpu, so this is just a table lookup. Return -EINVAL if called from
+ * a non-SCX program before any scheduler has ever been enabled.
+ */
+__bpf_kfunc s32 scx_bpf_task_cid(const struct task_struct *p)
+{
+ s16 *tbl = READ_ONCE(scx_cpu_to_cid_tbl);
+
+ if (!tbl)
+ return -EINVAL;
+ return tbl[task_cpu(p)];
+}
+
/**
* scx_bpf_cpu_rq - Fetch the rq of a CPU
* @cpu: CPU of the rq
@@ -9521,6 +9672,30 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_
return rcu_dereference(cpu_rq(cpu)->curr);
}
+/**
+ * scx_bpf_cid_curr - Return the curr task on the CPU at @cid
+ * @cid: cid of interest
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpu_curr(). Callers must hold RCU
+ * read lock (KF_RCU).
+ */
+__bpf_kfunc struct task_struct *scx_bpf_cid_curr(s32 cid, const struct bpf_prog_aux *aux)
+{
+ struct scx_sched *sch;
+ s32 cpu;
+
+ guard(rcu)();
+
+ sch = scx_prog_sched(aux);
+ if (unlikely(!sch))
+ return NULL;
+ cpu = scx_cid_to_cpu(sch, cid);
+ if (cpu < 0)
+ return NULL;
+ return rcu_dereference(cpu_rq(cpu)->curr);
+}
+
/**
* scx_bpf_tid_to_task - Look up a task by its scx tid
* @tid: task ID previously read from p->scx.tid
@@ -9708,6 +9883,7 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU);
BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU);
BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_kick_cid, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_destroy_dsq, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL)
@@ -9722,16 +9898,24 @@ BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cidperf_cap, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cidperf_cur, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cidperf_set, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_nr_node_ids)
BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
+BTF_ID_FLAGS(func, scx_bpf_nr_cids)
+BTF_ID_FLAGS(func, scx_bpf_nr_online_cids)
+BTF_ID_FLAGS(func, scx_bpf_this_cid)
BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_cid, KF_RCU)
BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS)
BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL)
BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
+BTF_ID_FLAGS(func, scx_bpf_cid_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
BTF_ID_FLAGS(func, scx_bpf_tid_to_task, KF_RET_NULL | KF_RCU_PROTECTED)
BTF_ID_FLAGS(func, scx_bpf_now)
BTF_ID_FLAGS(func, scx_bpf_events)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index ff57a7acdbeb..5f715d69cde6 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -105,6 +105,15 @@ void scx_bpf_events(struct scx_event_stats *events, size_t events__sz) __ksym __
s32 scx_bpf_cpu_to_cid(s32 cpu) __ksym __weak;
s32 scx_bpf_cid_to_cpu(s32 cid) __ksym __weak;
void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out) __ksym __weak;
+void scx_bpf_kick_cid(s32 cid, u64 flags) __ksym __weak;
+s32 scx_bpf_task_cid(const struct task_struct *p) __ksym __weak;
+s32 scx_bpf_this_cid(void) __ksym __weak;
+struct task_struct *scx_bpf_cid_curr(s32 cid) __ksym __weak;
+u32 scx_bpf_nr_cids(void) __ksym __weak;
+u32 scx_bpf_nr_online_cids(void) __ksym __weak;
+u32 scx_bpf_cidperf_cap(s32 cid) __ksym __weak;
+u32 scx_bpf_cidperf_cur(s32 cid) __ksym __weak;
+void scx_bpf_cidperf_set(s32 cid, u32 perf) __ksym __weak;
/*
* Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 12/17] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (10 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 11/17] sched_ext: Add cid-form kfunc wrappers alongside cpu-form Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 13/17] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers Tejun Heo
` (6 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without a full cid interface,
schedulers end up mixing forms - a subtle-bug factory.
Add sched_ext_ops_cid, which mirrors sched_ext_ops with cid/cmask
replacing cpu/cpumask in the topology-carrying callbacks.
cpu_acquire/cpu_release are deprecated and absent; a prior patch
moved them past @priv so the cid-form can omit them without
disturbing shared-field offsets.
The two structs share byte-identical layout up to @priv, so the
existing bpf_scx init/check hooks, has_op bitmap, and
scx_kf_allow_flags[] are offset-indexed and apply to both.
BUILD_BUG_ON in scx_init() pins the shared-field and renamed-callback
offsets so any future drift trips at boot.
The kernel<->BPF boundary translates between cpu and cid:
- A static key, enabled on cid-form sched load, gates the translation
so cpu-form schedulers pay nothing.
- dispatch, update_idle, cpu_online/offline and dump_cpu translate
the cpu arg at the callsite.
- select_cpu also translates the returned cid back to a cpu.
- set_cpumask is wrapped to synthesize a cmask in a per-cpu scratch
before calling the cid-form callback.
All scheds in a hierarchy share one form. The static key drives the
hot-path branch.
v2: Use struct_size() for the set_cmask_scratch percpu alloc. Move
cid-shard fields and assertions into the later cid-shard patch.
v3: Drop `static` on scx_set_cmask_scratch; add extern in ext_internal.h.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 281 +++++++++++++++++++++--
kernel/sched/ext_cid.c | 37 ++-
kernel/sched/ext_cid.h | 9 +
kernel/sched/ext_idle.c | 2 +-
kernel/sched/ext_internal.h | 109 ++++++++-
tools/sched_ext/include/scx/compat.bpf.h | 12 +
6 files changed, 429 insertions(+), 21 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 3f77e83044a1..8e3c60affc0b 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -513,6 +513,33 @@ do { \
update_locked_rq(__prev_locked_rq); \
} while (0)
+/*
+ * Flipped on enable per sch->is_cid_type. Declared in ext_internal.h so
+ * subsystem inlines can read it.
+ */
+DEFINE_STATIC_KEY_FALSE(__scx_is_cid_type);
+
+/*
+ * scx_cpu_arg() wraps a cpu arg being handed to an SCX op. For cid-form
+ * schedulers it resolves to the matching cid; for cpu-form it passes @cpu
+ * through. scx_cpu_ret() is the inverse for a cpu/cid returned from an op
+ * (currently only ops.select_cpu); it validates the BPF-supplied cid and
+ * triggers scx_error() on @sch if invalid.
+ */
+static s32 scx_cpu_arg(s32 cpu)
+{
+ if (scx_is_cid_type())
+ return __scx_cpu_to_cid(cpu);
+ return cpu;
+}
+
+static s32 scx_cpu_ret(struct scx_sched *sch, s32 cpu_or_cid)
+{
+ if (cpu_or_cid < 0 || !scx_is_cid_type())
+ return cpu_or_cid;
+ return scx_cid_to_cpu(sch, cpu_or_cid);
+}
+
#define SCX_CALL_OP_RET(sch, op, locked_rq, args...) \
({ \
struct rq *__prev_locked_rq; \
@@ -574,6 +601,41 @@ do { \
__ret; \
})
+/**
+ * scx_call_op_set_cpumask - invoke ops.set_cpumask / ops_cid.set_cmask for @task
+ * @sch: scx_sched being invoked
+ * @rq: rq to update as the currently-locked rq, or NULL
+ * @task: task whose affinity is changing
+ * @cpumask: new cpumask
+ *
+ * For cid-form schedulers, translate @cpumask to a cmask via the per-cpu
+ * scratch in ext_cid.c and dispatch through the ops_cid union view. Caller
+ * must hold @rq's rq lock so this_cpu_ptr is stable across the call.
+ */
+static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
+ struct task_struct *task,
+ const struct cpumask *cpumask)
+{
+ WARN_ON_ONCE(current->scx.kf_tasks[0]);
+ current->scx.kf_tasks[0] = task;
+ if (rq)
+ update_locked_rq(rq);
+
+ if (scx_is_cid_type()) {
+ struct scx_cmask *cmask = this_cpu_ptr(scx_set_cmask_scratch);
+
+ lockdep_assert_irqs_disabled();
+ scx_cpumask_to_cmask(cpumask, cmask);
+ sch->ops_cid.set_cmask(task, cmask);
+ } else {
+ sch->ops.set_cpumask(task, cpumask);
+ }
+
+ if (rq)
+ update_locked_rq(NULL);
+ current->scx.kf_tasks[0] = NULL;
+}
+
/* see SCX_CALL_OP_TASK() */
static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch,
struct task_struct *p)
@@ -1679,7 +1741,7 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
return &rq->scx.local_dsq;
if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
- s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+ s32 cpu = scx_cpu_ret(sch, dsq_id & SCX_DSQ_LOCAL_CPU_MASK);
if (!scx_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
return find_global_dsq(sch, tcpu);
@@ -2761,11 +2823,13 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
dspc->nr_tasks = 0;
if (nested) {
- SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL);
+ SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+ prev_on_sch ? prev : NULL);
} else {
/* stash @prev so that nested invocations can access it */
rq->scx.sub_dispatch_prev = prev;
- SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL);
+ SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+ prev_on_sch ? prev : NULL);
rq->scx.sub_dispatch_prev = NULL;
}
@@ -3260,7 +3324,9 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
*ddsp_taskp = p;
this_rq()->scx.in_select_cpu = true;
- cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags);
+ cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p,
+ scx_cpu_arg(prev_cpu), wake_flags);
+ cpu = scx_cpu_ret(sch, cpu);
this_rq()->scx.in_select_cpu = false;
p->scx.selected_cpu = cpu;
*ddsp_taskp = NULL;
@@ -3310,7 +3376,7 @@ static void set_cpus_allowed_scx(struct task_struct *p,
* designation pointless. Cast it away when calling the operation.
*/
if (SCX_HAS_OP(sch, set_cpumask))
- SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr);
+ scx_call_op_set_cpumask(sch, task_rq(p), p, (struct cpumask *)p->cpus_ptr);
}
static void handle_hotplug(struct rq *rq, bool online)
@@ -3332,9 +3398,9 @@ static void handle_hotplug(struct rq *rq, bool online)
scx_idle_update_selcpu_topology(&sch->ops);
if (online && SCX_HAS_OP(sch, cpu_online))
- SCX_CALL_OP(sch, cpu_online, NULL, cpu);
+ SCX_CALL_OP(sch, cpu_online, NULL, scx_cpu_arg(cpu));
else if (!online && SCX_HAS_OP(sch, cpu_offline))
- SCX_CALL_OP(sch, cpu_offline, NULL, cpu);
+ SCX_CALL_OP(sch, cpu_offline, NULL, scx_cpu_arg(cpu));
else
scx_exit(sch, SCX_EXIT_UNREG_KERN,
SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
@@ -3919,7 +3985,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
* different scheduler class. Keep the BPF scheduler up-to-date.
*/
if (SCX_HAS_OP(sch, set_cpumask))
- SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr);
+ scx_call_op_set_cpumask(sch, rq, p, (struct cpumask *)p->cpus_ptr);
}
static void switched_from_scx(struct rq *rq, struct task_struct *p)
@@ -5945,6 +6011,7 @@ static void scx_root_disable(struct scx_sched *sch)
/* no task is on scx, turn off all the switches and flush in-progress calls */
static_branch_disable(&__scx_enabled);
+ static_branch_disable(&__scx_is_cid_type);
if (sch->ops.flags & SCX_OPS_TID_TO_TASK)
static_branch_disable(&__scx_tid_to_task_enabled);
bitmap_zero(sch->has_op, SCX_OPI_END);
@@ -6360,7 +6427,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
used = seq_buf_used(&ns);
if (SCX_HAS_OP(sch, dump_cpu)) {
ops_dump_init(&ns, " ");
- SCX_CALL_OP(sch, dump_cpu, rq, &dctx, cpu, idle);
+ SCX_CALL_OP(sch, dump_cpu, rq, &dctx, scx_cpu_arg(cpu), idle);
ops_dump_exit();
}
@@ -6516,7 +6583,11 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
*/
struct scx_enable_cmd {
struct kthread_work work;
- struct sched_ext_ops *ops;
+ union {
+ struct sched_ext_ops *ops;
+ struct sched_ext_ops_cid *ops_cid;
+ };
+ bool is_cid_type;
int ret;
};
@@ -6524,10 +6595,11 @@ struct scx_enable_cmd {
* Allocate and initialize a new scx_sched. @cgrp's reference is always
* consumed whether the function succeeds or fails.
*/
-static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
+static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
struct cgroup *cgrp,
struct scx_sched *parent)
{
+ struct sched_ext_ops *ops = cmd->ops;
struct scx_sched *sch;
s32 level = parent ? parent->level + 1 : 0;
s32 node, cpu, ret, bypass_fail_cpu = nr_cpu_ids;
@@ -6619,7 +6691,18 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
ret = -ENOMEM;
goto err_free_lb_cpumask;
}
- sch->ops = *ops;
+ /*
+ * Copy ops through the right union view. For cid-form the source is
+ * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/
+ * cpu_release; those stay zero from kzalloc.
+ */
+ if (cmd->is_cid_type) {
+ sch->ops_cid = *cmd->ops_cid;
+ sch->is_cid_type = true;
+ } else {
+ sch->ops = *cmd->ops;
+ }
+
rcu_assign_pointer(ops->priv, sch);
sch->kobj.kset = scx_kset;
@@ -6756,7 +6839,12 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
return -EINVAL;
}
- if (ops->cpu_acquire || ops->cpu_release)
+ /*
+ * cid-form's struct is shorter and doesn't include the cpu_acquire /
+ * cpu_release tail; reading those fields off a cid-form @ops would
+ * run past the BPF allocation. Skip for cid-form.
+ */
+ if (!sch->is_cid_type && (ops->cpu_acquire || ops->cpu_release))
pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
return 0;
@@ -6792,12 +6880,15 @@ static void scx_root_enable_workfn(struct kthread_work *work)
#ifdef CONFIG_EXT_SUB_SCHED
cgroup_get(cgrp);
#endif
- sch = scx_alloc_and_add_sched(ops, cgrp, NULL);
+ sch = scx_alloc_and_add_sched(cmd, cgrp, NULL);
if (IS_ERR(sch)) {
ret = PTR_ERR(sch);
goto err_free_tid_hash;
}
+ if (sch->is_cid_type)
+ static_branch_enable(&__scx_is_cid_type);
+
/*
* Transition to ENABLING and clear exit info to arm the disable path.
* Failure triggers full disabling from here on.
@@ -7119,7 +7210,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
raw_spin_unlock_irq(&scx_sched_lock);
/* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
- sch = scx_alloc_and_add_sched(ops, cgrp, parent);
+ sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
kobject_put(&parent->kobj);
if (IS_ERR(sch)) {
ret = PTR_ERR(sch);
@@ -7570,6 +7661,13 @@ static int bpf_scx_reg(void *kdata, struct bpf_link *link)
return scx_enable(&cmd, link);
}
+static int bpf_scx_reg_cid(void *kdata, struct bpf_link *link)
+{
+ struct scx_enable_cmd cmd = { .ops_cid = kdata, .is_cid_type = true };
+
+ return scx_enable(&cmd, link);
+}
+
static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
{
struct sched_ext_ops *ops = kdata;
@@ -7701,6 +7799,73 @@ static struct bpf_struct_ops bpf_sched_ext_ops = {
.cfi_stubs = &__bpf_ops_sched_ext_ops
};
+/*
+ * cid-form cfi stubs. Stubs whose signatures match the cpu-form (param types
+ * identical, only param names differ across structs) are reused; only
+ * set_cmask needs a fresh stub since the second argument type differs.
+ */
+static void sched_ext_ops_cid__set_cmask(struct task_struct *p,
+ const struct scx_cmask *cmask) {}
+
+static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
+ .select_cid = sched_ext_ops__select_cpu,
+ .enqueue = sched_ext_ops__enqueue,
+ .dequeue = sched_ext_ops__dequeue,
+ .dispatch = sched_ext_ops__dispatch,
+ .tick = sched_ext_ops__tick,
+ .runnable = sched_ext_ops__runnable,
+ .running = sched_ext_ops__running,
+ .stopping = sched_ext_ops__stopping,
+ .quiescent = sched_ext_ops__quiescent,
+ .yield = sched_ext_ops__yield,
+ .core_sched_before = sched_ext_ops__core_sched_before,
+ .set_weight = sched_ext_ops__set_weight,
+ .set_cmask = sched_ext_ops_cid__set_cmask,
+ .update_idle = sched_ext_ops__update_idle,
+ .init_task = sched_ext_ops__init_task,
+ .exit_task = sched_ext_ops__exit_task,
+ .enable = sched_ext_ops__enable,
+ .disable = sched_ext_ops__disable,
+#ifdef CONFIG_EXT_GROUP_SCHED
+ .cgroup_init = sched_ext_ops__cgroup_init,
+ .cgroup_exit = sched_ext_ops__cgroup_exit,
+ .cgroup_prep_move = sched_ext_ops__cgroup_prep_move,
+ .cgroup_move = sched_ext_ops__cgroup_move,
+ .cgroup_cancel_move = sched_ext_ops__cgroup_cancel_move,
+ .cgroup_set_weight = sched_ext_ops__cgroup_set_weight,
+ .cgroup_set_bandwidth = sched_ext_ops__cgroup_set_bandwidth,
+ .cgroup_set_idle = sched_ext_ops__cgroup_set_idle,
+#endif
+ .sub_attach = sched_ext_ops__sub_attach,
+ .sub_detach = sched_ext_ops__sub_detach,
+ .cid_online = sched_ext_ops__cpu_online,
+ .cid_offline = sched_ext_ops__cpu_offline,
+ .init = sched_ext_ops__init,
+ .exit = sched_ext_ops__exit,
+ .dump = sched_ext_ops__dump,
+ .dump_cid = sched_ext_ops__dump_cpu,
+ .dump_task = sched_ext_ops__dump_task,
+};
+
+/*
+ * The cid-form struct_ops shares all bpf_struct_ops hooks with the cpu form.
+ * init_member, check_member, reg, unreg, etc. process kdata as the byte block
+ * verified to match by the BUILD_BUG_ON checks in scx_init().
+ */
+static struct bpf_struct_ops bpf_sched_ext_ops_cid = {
+ .verifier_ops = &bpf_scx_verifier_ops,
+ .reg = bpf_scx_reg_cid,
+ .unreg = bpf_scx_unreg,
+ .check_member = bpf_scx_check_member,
+ .init_member = bpf_scx_init_member,
+ .init = bpf_scx_init,
+ .update = bpf_scx_update,
+ .validate = bpf_scx_validate,
+ .name = "sched_ext_ops_cid",
+ .owner = THIS_MODULE,
+ .cfi_stubs = &__bpf_ops_sched_ext_ops_cid
+};
+
/********************************************************************************
* System integration and init.
@@ -8912,7 +9077,7 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id, const struct bpf_prog_aux *aux
ret = READ_ONCE(this_rq()->scx.local_dsq.nr);
goto out;
} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
- s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+ s32 cpu = scx_cpu_ret(sch, dsq_id & SCX_DSQ_LOCAL_CPU_MASK);
if (scx_cpu_valid(sch, cpu, NULL)) {
ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr);
@@ -10019,8 +10184,15 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
/*
* Non-SCX struct_ops: SCX kfuncs are not permitted.
- */
- if (prog->aux->st_ops != &bpf_sched_ext_ops)
+ *
+ * Both bpf_sched_ext_ops (cpu-form) and bpf_sched_ext_ops_cid
+ * (cid-form) are valid SCX struct_ops. Member offsets match between
+ * the two (verified by BUILD_BUG_ON in scx_init()), so the shared
+ * scx_kf_allow_flags[] table indexed by SCX_MOFF_IDX(moff) applies to
+ * both.
+ */
+ if (prog->aux->st_ops != &bpf_sched_ext_ops &&
+ prog->aux->st_ops != &bpf_sched_ext_ops_cid)
return -EACCES;
/* SCX struct_ops: check the per-op allow list. */
@@ -10050,6 +10222,73 @@ static int __init scx_init(void)
{
int ret;
+ /*
+ * sched_ext_ops_cid mirrors sched_ext_ops up to and including @priv.
+ * Both bpf_scx_init_member() and bpf_scx_check_member() use offsets
+ * from struct sched_ext_ops; sched_ext_ops_cid relies on those offsets
+ * matching for the shared fields. Catch any drift at boot.
+ */
+#define CID_OFFSET_MATCH(cpu_field, cid_field) \
+ BUILD_BUG_ON(offsetof(struct sched_ext_ops, cpu_field) != \
+ offsetof(struct sched_ext_ops_cid, cid_field))
+ /* data fields used by bpf_scx_init_member() */
+ CID_OFFSET_MATCH(dispatch_max_batch, dispatch_max_batch);
+ CID_OFFSET_MATCH(flags, flags);
+ CID_OFFSET_MATCH(name, name);
+ CID_OFFSET_MATCH(timeout_ms, timeout_ms);
+ CID_OFFSET_MATCH(exit_dump_len, exit_dump_len);
+ CID_OFFSET_MATCH(hotplug_seq, hotplug_seq);
+ CID_OFFSET_MATCH(sub_cgroup_id, sub_cgroup_id);
+ /* shared callbacks: the union view requires byte-for-byte offset match */
+ CID_OFFSET_MATCH(enqueue, enqueue);
+ CID_OFFSET_MATCH(dequeue, dequeue);
+ CID_OFFSET_MATCH(dispatch, dispatch);
+ CID_OFFSET_MATCH(tick, tick);
+ CID_OFFSET_MATCH(runnable, runnable);
+ CID_OFFSET_MATCH(running, running);
+ CID_OFFSET_MATCH(stopping, stopping);
+ CID_OFFSET_MATCH(quiescent, quiescent);
+ CID_OFFSET_MATCH(yield, yield);
+ CID_OFFSET_MATCH(core_sched_before, core_sched_before);
+ CID_OFFSET_MATCH(set_weight, set_weight);
+ CID_OFFSET_MATCH(update_idle, update_idle);
+ CID_OFFSET_MATCH(init_task, init_task);
+ CID_OFFSET_MATCH(exit_task, exit_task);
+ CID_OFFSET_MATCH(enable, enable);
+ CID_OFFSET_MATCH(disable, disable);
+ CID_OFFSET_MATCH(dump, dump);
+ CID_OFFSET_MATCH(dump_task, dump_task);
+ CID_OFFSET_MATCH(sub_attach, sub_attach);
+ CID_OFFSET_MATCH(sub_detach, sub_detach);
+ CID_OFFSET_MATCH(init, init);
+ CID_OFFSET_MATCH(exit, exit);
+#ifdef CONFIG_EXT_GROUP_SCHED
+ CID_OFFSET_MATCH(cgroup_init, cgroup_init);
+ CID_OFFSET_MATCH(cgroup_exit, cgroup_exit);
+ CID_OFFSET_MATCH(cgroup_prep_move, cgroup_prep_move);
+ CID_OFFSET_MATCH(cgroup_move, cgroup_move);
+ CID_OFFSET_MATCH(cgroup_cancel_move, cgroup_cancel_move);
+ CID_OFFSET_MATCH(cgroup_set_weight, cgroup_set_weight);
+ CID_OFFSET_MATCH(cgroup_set_bandwidth, cgroup_set_bandwidth);
+ CID_OFFSET_MATCH(cgroup_set_idle, cgroup_set_idle);
+#endif
+ /* renamed callbacks must occupy the same slot as their cpu-form sibling */
+ CID_OFFSET_MATCH(select_cpu, select_cid);
+ CID_OFFSET_MATCH(set_cpumask, set_cmask);
+ CID_OFFSET_MATCH(cpu_online, cid_online);
+ CID_OFFSET_MATCH(cpu_offline, cid_offline);
+ CID_OFFSET_MATCH(dump_cpu, dump_cid);
+ /* @priv tail must align since both share the same data block */
+ CID_OFFSET_MATCH(priv, priv);
+ /*
+ * cid-form must end exactly at @priv - validate_ops() skips
+ * cpu_acquire/cpu_release for cid-form because reading those fields
+ * past the BPF allocation would be UB.
+ */
+ BUILD_BUG_ON(sizeof(struct sched_ext_ops_cid) !=
+ offsetofend(struct sched_ext_ops, priv));
+#undef CID_OFFSET_MATCH
+
/*
* kfunc registration can't be done from init_sched_ext_class() as
* register_btf_kfunc_id_set() needs most of the system to be up.
@@ -10100,6 +10339,12 @@ static int __init scx_init(void)
return ret;
}
+ ret = register_bpf_struct_ops(&bpf_sched_ext_ops_cid, sched_ext_ops_cid);
+ if (ret) {
+ pr_err("sched_ext: Failed to register cid struct_ops (%d)\n", ret);
+ return ret;
+ }
+
ret = register_pm_notifier(&scx_pm_notifier);
if (ret) {
pr_err("sched_ext: Failed to register PM notifier (%d)\n", ret);
diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
index 607937d9e4d1..bdd8ef8eae3d 100644
--- a/kernel/sched/ext_cid.c
+++ b/kernel/sched/ext_cid.c
@@ -7,6 +7,14 @@
*/
#include <linux/cacheinfo.h>
+/*
+ * Per-cpu scratch cmask used by scx_call_op_set_cpumask() to synthesize a
+ * cmask from a cpumask. Allocated alongside the cid arrays on first enable
+ * and never freed. Sized to the full cid space. Caller holds rq lock so
+ * this_cpu_ptr is safe.
+ */
+struct scx_cmask __percpu *scx_set_cmask_scratch;
+
/*
* cid tables.
*
@@ -46,6 +54,7 @@ static s32 scx_cid_arrays_alloc(void)
u32 npossible = num_possible_cpus();
s16 *cid_to_cpu, *cpu_to_cid;
struct scx_cid_topo *cid_topo;
+ struct scx_cmask __percpu *set_cmask_scratch;
if (scx_cid_to_cpu_tbl)
return 0;
@@ -53,17 +62,22 @@ static s32 scx_cid_arrays_alloc(void)
cid_to_cpu = kzalloc_objs(*scx_cid_to_cpu_tbl, npossible, GFP_KERNEL);
cpu_to_cid = kzalloc_objs(*scx_cpu_to_cid_tbl, nr_cpu_ids, GFP_KERNEL);
cid_topo = kmalloc_objs(*scx_cid_topo, npossible, GFP_KERNEL);
+ set_cmask_scratch = __alloc_percpu(struct_size(set_cmask_scratch, bits,
+ SCX_CMASK_NR_WORDS(npossible)),
+ sizeof(u64));
- if (!cid_to_cpu || !cpu_to_cid || !cid_topo) {
+ if (!cid_to_cpu || !cpu_to_cid || !cid_topo || !set_cmask_scratch) {
kfree(cid_to_cpu);
kfree(cpu_to_cid);
kfree(cid_topo);
+ free_percpu(set_cmask_scratch);
return -ENOMEM;
}
WRITE_ONCE(scx_cid_to_cpu_tbl, cid_to_cpu);
WRITE_ONCE(scx_cpu_to_cid_tbl, cpu_to_cid);
WRITE_ONCE(scx_cid_topo, cid_topo);
+ WRITE_ONCE(scx_set_cmask_scratch, set_cmask_scratch);
return 0;
}
@@ -208,6 +222,27 @@ s32 scx_cid_init(struct scx_sched *sch)
return 0;
}
+/**
+ * scx_cpumask_to_cmask - Translate a kernel cpumask into a cmask
+ * @src: source cpumask
+ * @dst: cmask to write
+ *
+ * Initialize @dst to cover the full cid space [0, num_possible_cpus()) and
+ * set the bit for each cid whose cpu is in @src.
+ */
+void scx_cpumask_to_cmask(const struct cpumask *src, struct scx_cmask *dst)
+{
+ s32 cpu;
+
+ scx_cmask_init(dst, 0, num_possible_cpus());
+ for_each_cpu(cpu, src) {
+ s32 cid = __scx_cpu_to_cid(cpu);
+
+ if (cid >= 0)
+ __scx_cmask_set(dst, cid);
+ }
+}
+
__bpf_kfunc_start_defs();
/**
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
index c3c429d2c8e2..f41d48afb7d1 100644
--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -53,6 +53,7 @@ extern struct btf_id_set8 scx_kfunc_ids_init;
s32 scx_cid_init(struct scx_sched *sch);
int scx_cid_kfunc_init(void);
+void scx_cpumask_to_cmask(const struct cpumask *src, struct scx_cmask *dst);
/**
* cid_valid - Verify a cid value, to be used on ops input args
@@ -127,6 +128,14 @@ static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
return __scx_cpu_to_cid(cpu);
}
+/**
+ * scx_is_cid_type - Test whether the active scheduler hierarchy is cid-form
+ */
+static inline bool scx_is_cid_type(void)
+{
+ return static_branch_unlikely(&__scx_is_cid_type);
+}
+
static inline bool __scx_cmask_contains(const struct scx_cmask *m, u32 cid)
{
return likely(cid >= m->base && cid < m->base + m->nr_bits);
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 860c4634f60e..41785f65bbb2 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -788,7 +788,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
*/
if (SCX_HAS_OP(sch, update_idle) && do_notify &&
!scx_bypassing(sch, cpu_of(rq)))
- SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle);
+ SCX_CALL_OP(sch, update_idle, rq, scx_cpu_arg(cpu_of(rq)), idle);
}
static void reset_idle_masks(struct sched_ext_ops *ops)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 919d4aa08656..e5f52986d317 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -847,6 +847,93 @@ struct sched_ext_ops {
void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
};
+/**
+ * struct sched_ext_ops_cid - cid-form alternative to struct sched_ext_ops
+ *
+ * Mirrors struct sched_ext_ops with cpu/cpumask substituted with cid/cmask
+ * where applicable. Layout up to and including @priv matches sched_ext_ops
+ * byte-for-byte (verified by BUILD_BUG_ON checks at scx_init() time) so
+ * shared field offsets work for both struct types in bpf_scx_init_member()
+ * and bpf_scx_check_member(). The deprecated cpu_acquire/cpu_release
+ * callbacks at the tail of sched_ext_ops are omitted here entirely.
+ *
+ * Differences from sched_ext_ops:
+ * - select_cpu -> select_cid (returns cid)
+ * - dispatch -> dispatch (cpu arg is now cid)
+ * - update_idle -> update_idle (cpu arg is now cid)
+ * - set_cpumask -> set_cmask (cmask instead of cpumask)
+ * - cpu_online -> cid_online
+ * - cpu_offline -> cid_offline
+ * - dump_cpu -> dump_cid
+ * - cpu_acquire/cpu_release -> not present (deprecated in sched_ext_ops)
+ *
+ * BPF schedulers using this type cannot call cpu-form scx_bpf_* kfuncs;
+ * use the cid-form variants instead. Enforced at BPF verifier time via
+ * scx_kfunc_context_filter() branching on prog->aux->st_ops.
+ *
+ * See sched_ext_ops for callback documentation.
+ */
+struct sched_ext_ops_cid {
+ s32 (*select_cid)(struct task_struct *p, s32 prev_cid, u64 wake_flags);
+ void (*enqueue)(struct task_struct *p, u64 enq_flags);
+ void (*dequeue)(struct task_struct *p, u64 deq_flags);
+ void (*dispatch)(s32 cid, struct task_struct *prev);
+ void (*tick)(struct task_struct *p);
+ void (*runnable)(struct task_struct *p, u64 enq_flags);
+ void (*running)(struct task_struct *p);
+ void (*stopping)(struct task_struct *p, bool runnable);
+ void (*quiescent)(struct task_struct *p, u64 deq_flags);
+ bool (*yield)(struct task_struct *from, struct task_struct *to);
+ bool (*core_sched_before)(struct task_struct *a,
+ struct task_struct *b);
+ void (*set_weight)(struct task_struct *p, u32 weight);
+ void (*set_cmask)(struct task_struct *p,
+ const struct scx_cmask *cmask);
+ void (*update_idle)(s32 cid, bool idle);
+ s32 (*init_task)(struct task_struct *p,
+ struct scx_init_task_args *args);
+ void (*exit_task)(struct task_struct *p,
+ struct scx_exit_task_args *args);
+ void (*enable)(struct task_struct *p);
+ void (*disable)(struct task_struct *p);
+ void (*dump)(struct scx_dump_ctx *ctx);
+ void (*dump_cid)(struct scx_dump_ctx *ctx, s32 cid, bool idle);
+ void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p);
+#ifdef CONFIG_EXT_GROUP_SCHED
+ s32 (*cgroup_init)(struct cgroup *cgrp,
+ struct scx_cgroup_init_args *args);
+ void (*cgroup_exit)(struct cgroup *cgrp);
+ s32 (*cgroup_prep_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+ void (*cgroup_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+ void (*cgroup_cancel_move)(struct task_struct *p,
+ struct cgroup *from, struct cgroup *to);
+ void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);
+ void (*cgroup_set_bandwidth)(struct cgroup *cgrp,
+ u64 period_us, u64 quota_us, u64 burst_us);
+ void (*cgroup_set_idle)(struct cgroup *cgrp, bool idle);
+#endif /* CONFIG_EXT_GROUP_SCHED */
+ s32 (*sub_attach)(struct scx_sub_attach_args *args);
+ void (*sub_detach)(struct scx_sub_detach_args *args);
+ void (*cid_online)(s32 cid);
+ void (*cid_offline)(s32 cid);
+ s32 (*init)(void);
+ void (*exit)(struct scx_exit_info *info);
+
+ /* Data fields - must match sched_ext_ops layout exactly */
+ u32 dispatch_max_batch;
+ u64 flags;
+ u32 timeout_ms;
+ u32 exit_dump_len;
+ u64 hotplug_seq;
+ u64 sub_cgroup_id;
+ char name[SCX_OPS_NAME_LEN];
+
+ /* internal use only, must be NULL */
+ void __rcu *priv;
+};
+
enum scx_opi {
SCX_OPI_BEGIN = 0,
SCX_OPI_NORMAL_BEGIN = 0,
@@ -1003,7 +1090,18 @@ struct scx_sched_pnode {
};
struct scx_sched {
- struct sched_ext_ops ops;
+ /*
+ * cpu-form and cid-form ops share field offsets up to .priv (verified
+ * by BUILD_BUG_ON in scx_init()). The anonymous union lets the kernel
+ * access either view of the same storage without function-pointer
+ * casts: use .ops for cpu-form and shared fields, .ops_cid for the
+ * cid-renamed callbacks (set_cmask, select_cid, cid_online, ...).
+ */
+ union {
+ struct sched_ext_ops ops;
+ struct sched_ext_ops_cid ops_cid;
+ };
+ bool is_cid_type; /* true if registered via bpf_sched_ext_ops_cid */
DECLARE_BITMAP(has_op, SCX_OPI_END);
/*
@@ -1360,6 +1458,15 @@ enum scx_ops_state {
extern struct scx_sched __rcu *scx_root;
DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
+extern struct scx_cmask __percpu *scx_set_cmask_scratch;
+
+/*
+ * True when the currently loaded scheduler hierarchy is cid-form. All scheds
+ * in a hierarchy share one form, so this single key tells callsites which
+ * view to use without per-sch dereferences. Use scx_is_cid_type() to test.
+ */
+DECLARE_STATIC_KEY_FALSE(__scx_is_cid_type);
+
int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id);
bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where);
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 6b9d054c3e4f..87f15f296234 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -446,4 +446,16 @@ static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags)
__VA_ARGS__, \
};
+/*
+ * Define a cid-form sched_ext_ops. Programs targeting this struct_ops type
+ * use cid-form callback signatures (select_cid, set_cmask, cid_online/offline,
+ * dispatch with cid arg, etc.) and may only call the cid-form scx_bpf_*
+ * kfuncs (kick_cid, task_cid, this_cid, ...).
+ */
+#define SCX_OPS_CID_DEFINE(__name, ...) \
+ SEC(".struct_ops.link") \
+ struct sched_ext_ops_cid __name = { \
+ __VA_ARGS__, \
+ };
+
#endif /* __SCX_COMPAT_BPF_H */
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 13/17] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (11 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 12/17] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 14/17] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline Tejun Heo
` (5 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou, Zhao Mengmeng
cid and cpu are both small s32s, trivially confused when a cid-form
scheduler calls a cpu-keyed kfunc. Reject cid-form programs that
reference any kfunc in the new scx_kfunc_ids_cpu_only at verifier load
time.
The reverse direction is intentionally permissive: cpu-form schedulers
can freely call cid-form kfuncs to ease a gradual cpumask -> cid
migration.
The check sits in scx_kfunc_context_filter() right after the SCX
struct_ops gate and before the any/idle allow and per-op allow-list
checks, so it catches cpu-only kfuncs regardless of which set they
belong to (any, idle, or select_cpu).
v2: Sync per-entry kfunc flags with their primary declarations (Zhao).
pahole intersects flags across BTF_ID_FLAGS() occurrences, so
omitting them drops the flags globally.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
Reviewed-by: Zhao Mengmeng <zhaomengmeng@kylinos.cn>
---
kernel/sched/ext.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8e3c60affc0b..d8f8fca5ded9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -10095,6 +10095,47 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = {
.filter = scx_kfunc_context_filter,
};
+/*
+ * cpu-form kfuncs that are forbidden from cid-form schedulers
+ * (bpf_sched_ext_ops_cid). Programs targeting the cid struct_ops type must
+ * use the cid-form alternative (cid/cmask kfuncs).
+ *
+ * Membership overlaps with scx_kfunc_ids_{any,idle,select_cpu}; the filter
+ * tests this set independently and rejects matches before the per-op
+ * allow-list check runs.
+ *
+ * pahole/resolve_btfids scans every BTF_ID_FLAGS() at build time and
+ * intersects flags across duplicate entries, so each entry must carry the
+ * same flags as the kfunc's primary declaration; otherwise the flags get
+ * dropped globally.
+ */
+BTF_KFUNCS_START(scx_kfunc_ids_cpu_only)
+BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
+BTF_ID_FLAGS(func, scx_bpf_cpu_node, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl, KF_IMPLICIT_ARGS | KF_RCU)
+BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and, KF_IMPLICIT_ARGS | KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_and, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_IMPLICIT_ARGS | KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_IMPLICIT_ARGS | KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node, KF_IMPLICIT_ARGS | KF_ACQUIRE)
+BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_IMPLICIT_ARGS | KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node, KF_IMPLICIT_ARGS | KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_IMPLICIT_ARGS | KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node, KF_IMPLICIT_ARGS | KF_RCU)
+BTF_KFUNCS_END(scx_kfunc_ids_cpu_only)
+
/*
* Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc
* group; an op may permit zero or more groups, with the union expressed in
@@ -10158,6 +10199,7 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id);
bool in_idle = btf_id_set8_contains(&scx_kfunc_ids_idle, kfunc_id);
bool in_any = btf_id_set8_contains(&scx_kfunc_ids_any, kfunc_id);
+ bool in_cpu_only = btf_id_set8_contains(&scx_kfunc_ids_cpu_only, kfunc_id);
u32 moff, flags;
/* Not an SCX kfunc - allow. */
@@ -10195,6 +10237,15 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
prog->aux->st_ops != &bpf_sched_ext_ops_cid)
return -EACCES;
+ /*
+ * cid-form schedulers must use cid/cmask kfuncs. cid and cpu are both
+ * small s32s and trivially confused, so cpu-only kfuncs are rejected at
+ * load time. The reverse (cpu-form calling cid-form kfuncs) is
+ * intentionally permissive to ease gradual cpumask -> cid migration.
+ */
+ if (prog->aux->st_ops == &bpf_sched_ext_ops_cid && in_cpu_only)
+ return -EACCES;
+
/* SCX struct_ops: check the per-op allow list. */
if (in_any || in_idle)
return 0;
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 14/17] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (12 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 13/17] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 15/17] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick Tejun Heo
` (4 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
The cid mapping is built from the online cpu set at scheduler enable
and stays valid for that set; routine hotplug invalidates it. The
default cid behavior is to restart the scheduler so the mapping gets
rebuilt against the new online set, and that requires not implementing
cpu_online / cpu_offline (which suppress the kernel's ACT_RESTART).
Drop the two ops along with their print_cpus() helper - the cluster
view was only useful as a hotplug demo and is meaningless over the
dense cid space the scheduler will move to. Wire main() to handle the
ACT_RESTART exit by reopening the skel and reattaching, matching the
pattern in scx_simple / scx_central / scx_flatcg etc. Reset optind so
getopt re-parses argv into the fresh skel rodata each iteration.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
tools/sched_ext/scx_qmap.bpf.c | 62 ----------------------------------
tools/sched_ext/scx_qmap.c | 13 +++----
2 files changed, 7 insertions(+), 68 deletions(-)
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index ba4879031dac..78a1dd118c7e 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -843,63 +843,6 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
cgrp->kn->id, period_us, quota_us, burst_us);
}
-/*
- * Print out the online and possible CPU map using bpf_printk() as a
- * demonstration of using the cpumask kfuncs and ops.cpu_on/offline().
- */
-static void print_cpus(void)
-{
- const struct cpumask *possible, *online;
- s32 cpu;
- char buf[128] = "", *p;
- int idx;
-
- possible = scx_bpf_get_possible_cpumask();
- online = scx_bpf_get_online_cpumask();
-
- idx = 0;
- bpf_for(cpu, 0, scx_bpf_nr_cpu_ids()) {
- if (!(p = MEMBER_VPTR(buf, [idx++])))
- break;
- if (bpf_cpumask_test_cpu(cpu, online))
- *p++ = 'O';
- else if (bpf_cpumask_test_cpu(cpu, possible))
- *p++ = 'X';
- else
- *p++ = ' ';
-
- if ((cpu & 7) == 7) {
- if (!(p = MEMBER_VPTR(buf, [idx++])))
- break;
- *p++ = '|';
- }
- }
- buf[sizeof(buf) - 1] = '\0';
-
- scx_bpf_put_cpumask(online);
- scx_bpf_put_cpumask(possible);
-
- bpf_printk("CPUS: |%s", buf);
-}
-
-void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu)
-{
- if (print_msgs) {
- bpf_printk("CPU %d coming online", cpu);
- /* @cpu is already online at this point */
- print_cpus();
- }
-}
-
-void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
-{
- if (print_msgs) {
- bpf_printk("CPU %d going offline", cpu);
- /* @cpu is still online at this point */
- print_cpus();
- }
-}
-
struct monitor_timer {
struct bpf_timer timer;
};
@@ -1078,9 +1021,6 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
slab[i].next_free = (i + 1 < max_tasks) ? &slab[i + 1] : NULL;
qa.task_free_head = &slab[0];
- if (print_msgs && !sub_cgroup_id)
- print_cpus();
-
ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
if (ret) {
scx_bpf_error("failed to create DSQ %d (%d)", SHARED_DSQ, ret);
@@ -1174,8 +1114,6 @@ SCX_OPS_DEFINE(qmap_ops,
.cgroup_set_bandwidth = (void *)qmap_cgroup_set_bandwidth,
.sub_attach = (void *)qmap_sub_attach,
.sub_detach = (void *)qmap_sub_detach,
- .cpu_online = (void *)qmap_cpu_online,
- .cpu_offline = (void *)qmap_cpu_offline,
.init = (void *)qmap_init,
.exit = (void *)qmap_exit,
.timeout_ms = 5000U,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 725c4880058d..99408b1bb1ec 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -67,12 +67,14 @@ int main(int argc, char **argv)
struct bpf_link *link;
struct qmap_arena *qa;
__u32 test_error_cnt = 0;
+ __u64 ecode;
int opt;
libbpf_set_print(libbpf_print_fn);
signal(SIGINT, sigint_handler);
signal(SIGTERM, sigint_handler);
-
+restart:
+ optind = 1;
skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
@@ -184,11 +186,10 @@ int main(int argc, char **argv)
}
bpf_link__destroy(link);
- UEI_REPORT(skel, uei);
+ ecode = UEI_REPORT(skel, uei);
scx_qmap__destroy(skel);
- /*
- * scx_qmap implements ops.cpu_on/offline() and doesn't need to restart
- * on CPU hotplug events.
- */
+
+ if (UEI_ECODE_RESTART(ecode))
+ goto restart;
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 15/17] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (13 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 14/17] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-28 20:35 ` [PATCH 16/17] tools/sched_ext: scx_qmap: Port to cid-form struct_ops Tejun Heo
` (3 subsequent siblings)
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Switch qmap's idle-cpu picker from scx_bpf_pick_idle_cpu() to a
BPF-side bitmap scan, still under cpu-form struct_ops. qa_idle_cids
tracks idle cids (updated in update_idle / cpu_offline) and each
task's taskc->cpus_allowed tracks its allowed cids (built in
set_cpumask / init_task); select_cpu / enqueue scan the intersection
for an idle cid. Callbacks translate cpu <-> cid on entry;
cid-qmap-port drops those translations.
The scan is barebone - no core preference or other topology-aware
picks like the in-kernel picker - but qmap is a demo and this is
enough to exercise the plumbing.
v3: qmap_init() refuses to load when nr_cids exceeds SCX_QMAP_MAX_CPUS;
task_ctx's flex array would otherwise overflow into the next slab
entry. (Sashiko)
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
tools/sched_ext/scx_qmap.bpf.c | 137 +++++++++++++++++++++++++++++----
1 file changed, 121 insertions(+), 16 deletions(-)
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 78a1dd118c7e..88ef3936937d 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -72,6 +72,13 @@ struct {
struct qmap_arena __arena qa;
+/*
+ * Global idle-cid tracking, maintained via update_idle / cpu_offline and
+ * scanned by the direct-dispatch path. Allocated in qmap_init() from one
+ * arena page, sized to the full cid space.
+ */
+struct scx_cmask __arena *qa_idle_cids;
+
/* Per-queue locks. Each in its own .data section as bpf_res_spin_lock requires. */
__hidden struct bpf_res_spin_lock qa_q_lock0 SEC(".data.qa_q_lock0");
__hidden struct bpf_res_spin_lock qa_q_lock1 SEC(".data.qa_q_lock1");
@@ -132,8 +139,18 @@ struct task_ctx {
bool force_local; /* Dispatch directly to local_dsq */
bool highpri;
u64 core_sched_seq;
+ struct scx_cmask cpus_allowed; /* per-task affinity in cid space */
};
+/*
+ * Slab stride for task_ctx. cpus_allowed's flex array bits[] overlaps the
+ * tail bytes appended per entry; struct_size() gives the actual per-entry
+ * footprint.
+ */
+#define TASK_CTX_STRIDE \
+ struct_size_t(struct task_ctx, cpus_allowed.bits, \
+ CMASK_NR_WORDS(SCX_QMAP_MAX_CPUS))
+
/* All task_ctx pointers are arena pointers. */
typedef struct task_ctx __arena task_ctx_t;
@@ -161,20 +178,37 @@ static int qmap_spin_lock(struct bpf_res_spin_lock *lock)
return 0;
}
-static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
+/*
+ * Try prev_cpu's cid, then scan taskc->cpus_allowed AND qa_idle_cids
+ * round-robin from prev_cid + 1. Atomic claim retries on race; bounded
+ * by IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
+ */
+#define IDLE_PICK_RETRIES 16
+
+static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu,
+ task_ctx_t *taskc)
{
- s32 cpu;
+ u32 nr_cids = scx_bpf_nr_cids();
+ s32 prev_cid, cid;
+ u32 i;
if (!always_enq_immed && p->nr_cpus_allowed == 1)
return prev_cpu;
- if (scx_bpf_test_and_clear_cpu_idle(prev_cpu))
+ prev_cid = scx_bpf_cpu_to_cid(prev_cpu);
+ if (cmask_test_and_clear(qa_idle_cids, prev_cid))
return prev_cpu;
- cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
- if (cpu >= 0)
- return cpu;
-
+ cid = prev_cid;
+ bpf_for(i, 0, IDLE_PICK_RETRIES) {
+ cid = cmask_next_and_set_wrap(&taskc->cpus_allowed,
+ qa_idle_cids, cid + 1);
+ barrier_var(cid);
+ if (cid >= nr_cids)
+ return -1;
+ if (cmask_test_and_clear(qa_idle_cids, cid))
+ return scx_bpf_cid_to_cpu(cid);
+ }
return -1;
}
@@ -286,7 +320,7 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD))
return prev_cpu;
- cpu = pick_direct_dispatch_cpu(p, prev_cpu);
+ cpu = pick_direct_dispatch_cpu(p, prev_cpu, taskc);
if (cpu >= 0) {
taskc->force_local = true;
@@ -379,7 +413,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
/* if select_cpu() wasn't called, try direct dispatch */
if (!__COMPAT_is_enq_cpu_selected(enq_flags) &&
- (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
+ (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p), taskc)) >= 0) {
__sync_fetch_and_add(&qa.nr_ddsp_from_enq, 1);
scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
return;
@@ -726,6 +760,10 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init_task, struct task_struct *p,
taskc->force_local = false;
taskc->highpri = false;
taskc->core_sched_seq = 0;
+ cmask_init(&taskc->cpus_allowed, 0, scx_bpf_nr_cids());
+ bpf_rcu_read_lock();
+ cmask_from_cpumask(&taskc->cpus_allowed, p->cpus_ptr);
+ bpf_rcu_read_unlock();
v = bpf_task_storage_get(&task_ctx_stor, p, NULL,
BPF_LOCAL_STORAGE_GET_F_CREATE);
@@ -843,6 +881,48 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
cgrp->kn->id, period_us, quota_us, burst_us);
}
+void BPF_STRUCT_OPS(qmap_update_idle, s32 cpu, bool idle)
+{
+ s32 cid = scx_bpf_cpu_to_cid(cpu);
+
+ QMAP_TOUCH_ARENA();
+ if (cid < 0)
+ return;
+ if (idle)
+ cmask_set(qa_idle_cids, cid);
+ else
+ cmask_clear(qa_idle_cids, cid);
+}
+
+/*
+ * The cpumask received here is kernel-address memory; walk it bit by bit
+ * (bpf_cpumask_test_cpu handles the access), convert each set cpu to its
+ * cid, and populate the arena-resident taskc cmask.
+ */
+void BPF_STRUCT_OPS(qmap_set_cpumask, struct task_struct *p,
+ const struct cpumask *cpumask)
+{
+ task_ctx_t *taskc;
+ u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
+ s32 cpu;
+
+ taskc = lookup_task_ctx(p);
+ if (!taskc)
+ return;
+
+ cmask_zero(&taskc->cpus_allowed);
+
+ bpf_for(cpu, 0, nr_cpu_ids) {
+ s32 cid;
+
+ if (!bpf_cpumask_test_cpu(cpu, cpumask))
+ continue;
+ cid = scx_bpf_cpu_to_cid(cpu);
+ if (cid >= 0)
+ __cmask_set(&taskc->cpus_allowed, cid);
+ }
+}
+
struct monitor_timer {
struct bpf_timer timer;
};
@@ -992,34 +1072,57 @@ static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer)
s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
{
- task_ctx_t *slab;
+ u8 __arena *slab;
u32 nr_pages, key = 0, i;
struct bpf_timer *timer;
s32 ret;
+ if (scx_bpf_nr_cids() > SCX_QMAP_MAX_CPUS) {
+ scx_bpf_error("nr_cids=%u exceeds SCX_QMAP_MAX_CPUS=%d",
+ scx_bpf_nr_cids(), SCX_QMAP_MAX_CPUS);
+ return -EINVAL;
+ }
+
/*
* Allocate the task_ctx slab in arena and thread the entire slab onto
- * the free list. max_tasks is set by userspace before load.
+ * the free list. max_tasks is set by userspace before load. Each entry
+ * is TASK_CTX_STRIDE bytes - task_ctx's trailing cpus_allowed flex
+ * array extends into the stride tail.
*/
if (!max_tasks) {
scx_bpf_error("max_tasks must be > 0");
return -EINVAL;
}
- nr_pages = (max_tasks * sizeof(struct task_ctx) + PAGE_SIZE - 1) / PAGE_SIZE;
+ nr_pages = (max_tasks * TASK_CTX_STRIDE + PAGE_SIZE - 1) / PAGE_SIZE;
slab = bpf_arena_alloc_pages(&arena, NULL, nr_pages, NUMA_NO_NODE, 0);
if (!slab) {
scx_bpf_error("failed to allocate task_ctx slab");
return -ENOMEM;
}
- qa.task_ctxs = slab;
+ qa.task_ctxs = (task_ctx_t *)slab;
bpf_for(i, 0, 5)
qa.fifos[i].idx = i;
- bpf_for(i, 0, max_tasks)
- slab[i].next_free = (i + 1 < max_tasks) ? &slab[i + 1] : NULL;
- qa.task_free_head = &slab[0];
+ bpf_for(i, 0, max_tasks) {
+ task_ctx_t *cur = (task_ctx_t *)(slab + i * TASK_CTX_STRIDE);
+ task_ctx_t *next = (i + 1 < max_tasks) ?
+ (task_ctx_t *)(slab + (i + 1) * TASK_CTX_STRIDE) : NULL;
+ cur->next_free = next;
+ }
+ qa.task_free_head = (task_ctx_t *)slab;
+
+ /*
+ * Allocate and initialize the idle cmask. Starts empty - update_idle
+ * fills it as cpus enter idle.
+ */
+ qa_idle_cids = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (!qa_idle_cids) {
+ scx_bpf_error("failed to allocate idle cmask");
+ return -ENOMEM;
+ }
+ cmask_init(qa_idle_cids, 0, scx_bpf_nr_cids());
ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
if (ret) {
@@ -1104,6 +1207,8 @@ SCX_OPS_DEFINE(qmap_ops,
.dispatch = (void *)qmap_dispatch,
.tick = (void *)qmap_tick,
.core_sched_before = (void *)qmap_core_sched_before,
+ .set_cpumask = (void *)qmap_set_cpumask,
+ .update_idle = (void *)qmap_update_idle,
.init_task = (void *)qmap_init_task,
.exit_task = (void *)qmap_exit_task,
.dump = (void *)qmap_dump,
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH 16/17] tools/sched_ext: scx_qmap: Port to cid-form struct_ops
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (14 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 15/17] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-29 12:47 ` Changwoo Min
2026-04-28 20:35 ` [PATCH 17/17] sched_ext: Require cid-form struct_ops for sub-sched support Tejun Heo
` (2 subsequent siblings)
18 siblings, 1 reply; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Flip qmap's struct_ops to bpf_sched_ext_ops_cid. The kernel now passes
cids and cmasks to callbacks directly, so the per-callback cpu<->cid
translations that the prior patch added drop out and cpu_ctxs[] is
reindexed by cid. Cpu-form kfunc calls switch to their cid-form
counterparts.
The cpu-only kfuncs (idle/any pick, cpumask iteration) have no cid
substitute. Their callers already moved to cmask scans against
qa_idle_cids and taskc->cpus_allowed in the prior patch, so the kfunc
calls drop here without behavior changes.
set_cmask is wired up via cmask_copy_from_kernel() to copy the
kernel-supplied cmask into the arena-resident taskc cmask. The
cpuperf monitor iterates the cid-form perf kfuncs.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
tools/sched_ext/scx_qmap.bpf.c | 231 +++++++++++++++++----------------
tools/sched_ext/scx_qmap.c | 59 ++++++++-
tools/sched_ext/scx_qmap.h | 2 +-
3 files changed, 177 insertions(+), 115 deletions(-)
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 88ef3936937d..f55192c7c51a 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -52,6 +52,28 @@ const volatile bool always_enq_immed;
const volatile u32 immed_stress_nth;
const volatile u32 max_tasks;
+/*
+ * Optional cid-override test harness. When cid_override_mode is non-zero,
+ * qmap_init() calls scx_bpf_cid_override() with the caller-supplied arrays
+ * to exercise the kfunc's acceptance and error paths.
+ *
+ * 0 = disabled
+ * 1 = valid reverse mapping
+ * 2 = invalid: duplicate cid assignment
+ * 3 = invalid: non-monotonic shard_start
+ */
+const volatile u32 cid_override_mode;
+const volatile u32 cid_override_nr_cpus;
+const volatile u32 cid_override_nr_shards;
+/*
+ * Arrays live in bss (writable) because scx_bpf_cid_override()'s BPF
+ * verifier signature treats its len-paired pointer as read/write - rodata
+ * fails verification with "write into map forbidden". Userspace populates
+ * them before SCX_OPS_LOAD, same as rodata, and nothing writes them after.
+ */
+s32 cid_override_cpu_to_cid[SCX_QMAP_MAX_CPUS];
+s32 cid_override_shard_start[SCX_QMAP_MAX_CPUS];
+
UEI_DEFINE(uei);
/*
@@ -179,25 +201,24 @@ static int qmap_spin_lock(struct bpf_res_spin_lock *lock)
}
/*
- * Try prev_cpu's cid, then scan taskc->cpus_allowed AND qa_idle_cids
- * round-robin from prev_cid + 1. Atomic claim retries on race; bounded
- * by IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
+ * Try prev_cid, then scan taskc->cpus_allowed AND qa_idle_cids round-robin
+ * from prev_cid + 1. Atomic claim retries on race; bounded by
+ * IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
*/
#define IDLE_PICK_RETRIES 16
-static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu,
+static s32 pick_direct_dispatch_cid(struct task_struct *p, s32 prev_cid,
task_ctx_t *taskc)
{
u32 nr_cids = scx_bpf_nr_cids();
- s32 prev_cid, cid;
+ s32 cid;
u32 i;
if (!always_enq_immed && p->nr_cpus_allowed == 1)
- return prev_cpu;
+ return prev_cid;
- prev_cid = scx_bpf_cpu_to_cid(prev_cpu);
if (cmask_test_and_clear(qa_idle_cids, prev_cid))
- return prev_cpu;
+ return prev_cid;
cid = prev_cid;
bpf_for(i, 0, IDLE_PICK_RETRIES) {
@@ -207,7 +228,7 @@ static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu,
if (cid >= nr_cids)
return -1;
if (cmask_test_and_clear(qa_idle_cids, cid))
- return scx_bpf_cid_to_cpu(cid);
+ return cid;
}
return -1;
}
@@ -308,25 +329,25 @@ static void qmap_fifo_remove(task_ctx_t *taskc)
bpf_res_spin_unlock(lock);
}
-s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
- s32 prev_cpu, u64 wake_flags)
+s32 BPF_STRUCT_OPS(qmap_select_cid, struct task_struct *p,
+ s32 prev_cid, u64 wake_flags)
{
task_ctx_t *taskc;
- s32 cpu;
+ s32 cid;
if (!(taskc = lookup_task_ctx(p)))
- return prev_cpu;
+ return prev_cid;
if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD))
- return prev_cpu;
+ return prev_cid;
- cpu = pick_direct_dispatch_cpu(p, prev_cpu, taskc);
+ cid = pick_direct_dispatch_cid(p, prev_cid, taskc);
- if (cpu >= 0) {
+ if (cid >= 0) {
taskc->force_local = true;
- return cpu;
+ return cid;
} else {
- return prev_cpu;
+ return prev_cid;
}
}
@@ -350,12 +371,12 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
static u32 user_cnt, kernel_cnt;
task_ctx_t *taskc;
int idx = weight_to_idx(p->scx.weight);
- s32 cpu;
+ s32 cid;
if (enq_flags & SCX_ENQ_REENQ) {
__sync_fetch_and_add(&qa.nr_reenqueued, 1);
- if (scx_bpf_task_cpu(p) == 0)
- __sync_fetch_and_add(&qa.nr_reenqueued_cpu0, 1);
+ if (scx_bpf_task_cid(p) == 0)
+ __sync_fetch_and_add(&qa.nr_reenqueued_cid0, 1);
}
if (p->flags & PF_KTHREAD) {
@@ -388,14 +409,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
if (!(++immed_stress_cnt % immed_stress_nth)) {
taskc->force_local = false;
- scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cpu(p),
+ scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cid(p),
slice_ns, enq_flags);
return;
}
}
/*
- * If qmap_select_cpu() is telling us to or this is the last runnable
+ * If qmap_select_cid() is telling us to or this is the last runnable
* task on the CPU, enqueue locally.
*/
if (taskc->force_local) {
@@ -411,11 +432,11 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
return;
}
- /* if select_cpu() wasn't called, try direct dispatch */
+ /* if select_cid() wasn't called, try direct dispatch */
if (!__COMPAT_is_enq_cpu_selected(enq_flags) &&
- (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p), taskc)) >= 0) {
+ (cid = pick_direct_dispatch_cid(p, scx_bpf_task_cid(p), taskc)) >= 0) {
__sync_fetch_and_add(&qa.nr_ddsp_from_enq, 1);
- scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
+ scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cid, slice_ns, enq_flags);
return;
}
@@ -423,15 +444,16 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
* If the task was re-enqueued due to the CPU being preempted by a
* higher priority scheduling class, just re-enqueue the task directly
* on the global DSQ. As we want another CPU to pick it up, find and
- * kick an idle CPU.
+ * kick an idle cid.
*/
if (enq_flags & SCX_ENQ_REENQ) {
- s32 cpu;
+ s32 cid;
scx_bpf_dsq_insert(p, SHARED_DSQ, 0, enq_flags);
- cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
- if (cpu >= 0)
- scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE);
+ cid = cmask_next_and_set_wrap(&taskc->cpus_allowed,
+ qa_idle_cids, 0);
+ if (cid < scx_bpf_nr_cids())
+ scx_bpf_kick_cid(cid, SCX_KICK_IDLE);
return;
}
@@ -483,7 +505,8 @@ static void update_core_sched_head_seq(struct task_struct *p)
static bool dispatch_highpri(bool from_timer)
{
struct task_struct *p;
- s32 this_cpu = bpf_get_smp_processor_id();
+ s32 this_cid = scx_bpf_this_cid();
+ u32 nr_cids = scx_bpf_nr_cids();
/* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */
bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) {
@@ -502,21 +525,29 @@ static bool dispatch_highpri(bool from_timer)
}
/*
- * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU
- * is found.
+ * Scan HIGHPRI_DSQ and dispatch until a task that can run here is
+ * found. Prefer this_cid if the task allows it; otherwise RR-scan the
+ * task's cpus_allowed starting after this_cid.
*/
bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) {
+ task_ctx_t *taskc;
bool dispatched = false;
- s32 cpu;
+ s32 cid;
- if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr))
- cpu = this_cpu;
+ if (!(taskc = lookup_task_ctx(p)))
+ return false;
+
+ if (cmask_test(&taskc->cpus_allowed, this_cid))
+ cid = this_cid;
else
- cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0);
+ cid = cmask_next_set_wrap(&taskc->cpus_allowed,
+ this_cid + 1);
+ if (cid >= nr_cids)
+ continue;
- if (scx_bpf_dsq_move(BPF_FOR_EACH_ITER, p, SCX_DSQ_LOCAL_ON | cpu,
+ if (scx_bpf_dsq_move(BPF_FOR_EACH_ITER, p, SCX_DSQ_LOCAL_ON | cid,
SCX_ENQ_PREEMPT)) {
- if (cpu == this_cpu) {
+ if (cid == this_cid) {
dispatched = true;
__sync_fetch_and_add(&qa.nr_expedited_local, 1);
} else {
@@ -535,7 +566,7 @@ static bool dispatch_highpri(bool from_timer)
return false;
}
-void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
+void BPF_STRUCT_OPS(qmap_dispatch, s32 cid, struct task_struct *prev)
{
struct task_struct *p;
struct cpu_ctx __arena *cpuc;
@@ -563,7 +594,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
}
}
- cpuc = &qa.cpu_ctxs[bpf_get_smp_processor_id()];
+ cpuc = &qa.cpu_ctxs[scx_bpf_this_cid()];
for (i = 0; i < 5; i++) {
/* Advance the dispatch cursor and pick the fifo. */
@@ -628,8 +659,8 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
* document this class of issue -- other schedulers
* seeing similar warnings can use this as a reference.
*/
- if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
- scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0);
+ if (!cmask_test(&taskc->cpus_allowed, cid))
+ scx_bpf_kick_cid(scx_bpf_task_cid(p), 0);
batch--;
cpuc->dsp_cnt--;
@@ -668,7 +699,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
{
- struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[bpf_get_smp_processor_id()];
+ struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[scx_bpf_this_cid()];
int idx;
/*
@@ -680,7 +711,7 @@ void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
idx = weight_to_idx(cpuc->avg_weight);
cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
- scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+ scx_bpf_cidperf_set(scx_bpf_task_cid(p), cpuc->cpuperf_target);
}
/*
@@ -828,9 +859,9 @@ void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx)
}
}
-void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle)
+void BPF_STRUCT_OPS(qmap_dump_cid, struct scx_dump_ctx *dctx, s32 cid, bool idle)
{
- struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[cpu];
+ struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[cid];
if (suppress_dump || idle)
return;
@@ -881,46 +912,24 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
cgrp->kn->id, period_us, quota_us, burst_us);
}
-void BPF_STRUCT_OPS(qmap_update_idle, s32 cpu, bool idle)
+void BPF_STRUCT_OPS(qmap_update_idle, s32 cid, bool idle)
{
- s32 cid = scx_bpf_cpu_to_cid(cpu);
-
QMAP_TOUCH_ARENA();
- if (cid < 0)
- return;
if (idle)
cmask_set(qa_idle_cids, cid);
else
cmask_clear(qa_idle_cids, cid);
}
-/*
- * The cpumask received here is kernel-address memory; walk it bit by bit
- * (bpf_cpumask_test_cpu handles the access), convert each set cpu to its
- * cid, and populate the arena-resident taskc cmask.
- */
-void BPF_STRUCT_OPS(qmap_set_cpumask, struct task_struct *p,
- const struct cpumask *cpumask)
+void BPF_STRUCT_OPS(qmap_set_cmask, struct task_struct *p,
+ const struct scx_cmask *cmask)
{
task_ctx_t *taskc;
- u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
- s32 cpu;
taskc = lookup_task_ctx(p);
if (!taskc)
return;
-
- cmask_zero(&taskc->cpus_allowed);
-
- bpf_for(cpu, 0, nr_cpu_ids) {
- s32 cid;
-
- if (!bpf_cpumask_test_cpu(cpu, cpumask))
- continue;
- cid = scx_bpf_cpu_to_cid(cpu);
- if (cid >= 0)
- __cmask_set(&taskc->cpus_allowed, cid);
- }
+ cmask_copy_from_kernel(&taskc->cpus_allowed, cmask);
}
struct monitor_timer {
@@ -935,59 +944,49 @@ struct {
} monitor_timer SEC(".maps");
/*
- * Print out the min, avg and max performance levels of CPUs every second to
- * demonstrate the cpuperf interface.
+ * Aggregate cidperf across the first nr_online_cids cids. Post-hotplug
+ * the first-N-are-online invariant drifts, so some cap/cur values may
+ * be stale. For this demo monitor that's fine; the scheduler exits on
+ * the enable-time hotplug_seq mismatch and userspace restarts, which
+ * rebuilds the layout.
*/
static void monitor_cpuperf(void)
{
- u32 nr_cpu_ids;
+ u32 nr_online = scx_bpf_nr_online_cids();
u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
- const struct cpumask *online;
- int i, nr_online_cpus = 0;
-
- nr_cpu_ids = scx_bpf_nr_cpu_ids();
- online = scx_bpf_get_online_cpumask();
-
- bpf_for(i, 0, nr_cpu_ids) {
- struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[i];
- u32 cap, cur;
+ s32 cid;
- if (!bpf_cpumask_test_cpu(i, online))
- continue;
- nr_online_cpus++;
+ QMAP_TOUCH_ARENA();
- /* collect the capacity and current cpuperf */
- cap = scx_bpf_cpuperf_cap(i);
- cur = scx_bpf_cpuperf_cur(i);
+ bpf_for(cid, 0, nr_online) {
+ struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[cid];
+ u32 cap = scx_bpf_cidperf_cap(cid);
+ u32 cur = scx_bpf_cidperf_cur(cid);
+ u32 target;
cur_min = cur < cur_min ? cur : cur_min;
cur_max = cur > cur_max ? cur : cur_max;
- /*
- * $cur is relative to $cap. Scale it down accordingly so that
- * it's in the same scale as other CPUs and $cur_sum/$cap_sum
- * makes sense.
- */
- cur_sum += cur * cap / SCX_CPUPERF_ONE;
+ cur_sum += (u64)cur * cap / SCX_CPUPERF_ONE;
cap_sum += cap;
- /* collect target */
- cur = cpuc->cpuperf_target;
- target_sum += cur;
- target_min = cur < target_min ? cur : target_min;
- target_max = cur > target_max ? cur : target_max;
+ target = cpuc->cpuperf_target;
+ target_sum += target;
+ target_min = target < target_min ? target : target_min;
+ target_max = target > target_max ? target : target_max;
}
+ if (!nr_online || !cap_sum)
+ return;
+
qa.cpuperf_min = cur_min;
qa.cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
qa.cpuperf_max = cur_max;
qa.cpuperf_target_min = target_min;
- qa.cpuperf_target_avg = target_sum / nr_online_cpus;
+ qa.cpuperf_target_avg = target_sum / nr_online;
qa.cpuperf_target_max = target_max;
-
- scx_bpf_put_cpumask(online);
}
/*
@@ -1083,6 +1082,18 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
return -EINVAL;
}
+ /*
+ * cid-override test hook. Must run before anything that reads the
+ * cid space (scx_bpf_nr_cids, cmask_init, etc.). On invalid input,
+ * the kfunc calls scx_error() which aborts the scheduler.
+ */
+ if (cid_override_mode) {
+ scx_bpf_cid_override((const s32 *)cid_override_cpu_to_cid,
+ cid_override_nr_cpus * sizeof(s32),
+ (const s32 *)cid_override_shard_start,
+ cid_override_nr_shards * sizeof(s32));
+ }
+
/*
* Allocate the task_ctx slab in arena and thread the entire slab onto
* the free list. max_tasks is set by userspace before load. Each entry
@@ -1199,20 +1210,20 @@ void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args)
}
}
-SCX_OPS_DEFINE(qmap_ops,
+SCX_OPS_CID_DEFINE(qmap_ops,
.flags = SCX_OPS_ENQ_EXITING | SCX_OPS_TID_TO_TASK,
- .select_cpu = (void *)qmap_select_cpu,
+ .select_cid = (void *)qmap_select_cid,
.enqueue = (void *)qmap_enqueue,
.dequeue = (void *)qmap_dequeue,
.dispatch = (void *)qmap_dispatch,
.tick = (void *)qmap_tick,
.core_sched_before = (void *)qmap_core_sched_before,
- .set_cpumask = (void *)qmap_set_cpumask,
+ .set_cmask = (void *)qmap_set_cmask,
.update_idle = (void *)qmap_update_idle,
.init_task = (void *)qmap_init_task,
.exit_task = (void *)qmap_exit_task,
.dump = (void *)qmap_dump,
- .dump_cpu = (void *)qmap_dump_cpu,
+ .dump_cid = (void *)qmap_dump_cid,
.dump_task = (void *)qmap_dump_task,
.cgroup_init = (void *)qmap_cgroup_init,
.cgroup_set_weight = (void *)qmap_cgroup_set_weight,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 99408b1bb1ec..a533542e3ca5 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -43,6 +43,7 @@ const char help_fmt[] =
" -p Switch only tasks on SCHED_EXT policy instead of all\n"
" -I Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n"
" -F COUNT IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n"
+" -C MODE cid-override test (shuffle|bad-dup|bad-mono)\n"
" -v Print libbpf debug messages\n"
" -h Display this help and exit\n";
@@ -73,6 +74,14 @@ int main(int argc, char **argv)
libbpf_set_print(libbpf_print_fn);
signal(SIGINT, sigint_handler);
signal(SIGTERM, sigint_handler);
+
+ if (libbpf_num_possible_cpus() > SCX_QMAP_MAX_CPUS) {
+ fprintf(stderr,
+ "scx_qmap: %d possible CPUs exceeds compile-time cap %d; "
+ "rebuild with larger SCX_QMAP_MAX_CPUS\n",
+ libbpf_num_possible_cpus(), SCX_QMAP_MAX_CPUS);
+ return 1;
+ }
restart:
optind = 1;
skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
@@ -80,7 +89,7 @@ int main(int argc, char **argv)
skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
skel->rodata->max_tasks = 16384;
- while ((opt = getopt(argc, argv, "s:e:t:T:l:b:N:PMHc:d:D:SpIF:vh")) != -1) {
+ while ((opt = getopt(argc, argv, "s:e:t:T:l:b:N:PMHc:d:D:SpIF:C:vh")) != -1) {
switch (opt) {
case 's':
skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -143,6 +152,48 @@ int main(int argc, char **argv)
case 'F':
skel->rodata->immed_stress_nth = strtoul(optarg, NULL, 0);
break;
+ case 'C': {
+ u32 nr_cpus = libbpf_num_possible_cpus();
+ u32 mode, i;
+ s32 shard_sz = 4;
+
+ if (!strcmp(optarg, "shuffle"))
+ mode = 1;
+ else if (!strcmp(optarg, "bad-dup"))
+ mode = 2;
+ else if (!strcmp(optarg, "bad-mono"))
+ mode = 3;
+ else {
+ fprintf(stderr, "unknown cid-override mode '%s'\n", optarg);
+ return 1;
+ }
+ skel->rodata->cid_override_mode = mode;
+ skel->rodata->cid_override_nr_cpus = nr_cpus;
+
+ /* shuffle: reversed cpu_to_cid, bad-dup: dup cid 0, bad-mono: identity */
+ for (i = 0; i < nr_cpus; i++) {
+ if (mode == 1)
+ skel->bss->cid_override_cpu_to_cid[i] = nr_cpus - 1 - i;
+ else
+ skel->bss->cid_override_cpu_to_cid[i] = i;
+ }
+ if (mode == 2 && nr_cpus >= 2)
+ skel->bss->cid_override_cpu_to_cid[1] = 0;
+
+ /* shards of shard_sz each */
+ skel->rodata->cid_override_nr_shards = (nr_cpus + shard_sz - 1) / shard_sz;
+ for (i = 0; i < skel->rodata->cid_override_nr_shards; i++)
+ skel->bss->cid_override_shard_start[i] = i * shard_sz;
+
+ if (mode == 3 && skel->rodata->cid_override_nr_shards >= 3) {
+ /* swap [1] and [2] so shard_start is not monotonically increasing */
+ s32 tmp = skel->bss->cid_override_shard_start[1];
+ skel->bss->cid_override_shard_start[1] =
+ skel->bss->cid_override_shard_start[2];
+ skel->bss->cid_override_shard_start[2] = tmp;
+ }
+ break;
+ }
case 'v':
verbose = true;
break;
@@ -162,9 +213,9 @@ int main(int argc, char **argv)
long nr_enqueued = qa->nr_enqueued;
long nr_dispatched = qa->nr_dispatched;
- printf("stats : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%llu/%llu deq=%llu core=%llu enq_ddsp=%llu\n",
+ printf("stats : enq=%lu dsp=%lu delta=%ld reenq/cid0=%llu/%llu deq=%llu core=%llu enq_ddsp=%llu\n",
nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
- qa->nr_reenqueued, qa->nr_reenqueued_cpu0,
+ qa->nr_reenqueued, qa->nr_reenqueued_cid0,
qa->nr_dequeued,
qa->nr_core_sched_execed,
qa->nr_ddsp_from_enq);
@@ -173,7 +224,7 @@ int main(int argc, char **argv)
qa->nr_expedited_remote,
qa->nr_expedited_from_timer,
qa->nr_expedited_lost);
- if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+ if (__COMPAT_has_ksym("scx_bpf_cidperf_cur"))
printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
qa->cpuperf_min,
qa->cpuperf_avg,
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index 9d9af2ad90c6..d15a705d5ac5 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -45,7 +45,7 @@ struct qmap_fifo {
struct qmap_arena {
/* userspace-visible stats */
- __u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0;
+ __u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cid0;
__u64 nr_dequeued, nr_ddsp_from_enq;
__u64 nr_core_sched_execed;
__u64 nr_expedited_local, nr_expedited_remote;
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH 16/17] tools/sched_ext: scx_qmap: Port to cid-form struct_ops
2026-04-28 20:35 ` [PATCH 16/17] tools/sched_ext: scx_qmap: Port to cid-form struct_ops Tejun Heo
@ 2026-04-29 12:47 ` Changwoo Min
2026-04-29 13:53 ` Andrea Righi
0 siblings, 1 reply; 30+ messages in thread
From: Changwoo Min @ 2026-04-29 12:47 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Andrea Righi
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Cheng-Yang Chou
On 4/29/26 5:35 AM, Tejun Heo wrote:
> /* @@ -1083,6 +1082,18 @@s32·BPF_STRUCT_OPS_SLEEPABLE(qmap_init) › ›
> return·-EINVAL; › } +› /*+› ·*·cid-
> override·test·hook.·Must·run·before·anything·that·reads·the+›
> ·*·cid·space·(scx_bpf_nr_cids,·cmask_init,·etc.).·On·invalid·input,+›
> ·*·the·kfunc·calls·scx_error()·which·aborts·the·scheduler.+› ·*/+›
> if·(cid_override_mode)·{+› ›
> scx_bpf_cid_override((const·s32·*)cid_override_cpu_to_cid,+› › › ›
> ·····cid_override_nr_cpus·*·sizeof(s32),+› › › ›
> ·····(const·s32·*)cid_override_shard_start,+› › › ›
> ·····cid_override_nr_shards·*·sizeof(s32));+› }+
This cause the following compilation error due to argument mismatch:
scx_qmap.bpf.c:1093:10: error: too many arguments to function call,
expected 2, have 4
1091 | scx_bpf_cid_override((const s32
*)cid_override_cpu_to_cid,
| ~~~~~~~~~~~~~~~~~~~~
1092 | cid_override_nr_cpus *
sizeof(s32),
1093 | (const s32
*)cid_override_shard_start,
|
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1094 | cid_override_nr_shards *
sizeof(s32));
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/changwoo/ws-multics69/dev/linux-tj/tools/sched_ext/include/scx/compat.bpf.h:130:20:
note:
'scx_bpf_cid_override' declared here
130 | static inline void scx_bpf_cid_override(const s32 *cpu_to_cid,
u32 cpu_to_cid__sz)
| ^
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The correct one should be as follows:
> scx_bpf_cid_override((const·s32·*)cid_override_cpu_to_cid,+› › › ›
> ·····cid_override_nr_cpus * sizeof(s32));
Reviewed-by: Changwoo Min <changwoo@igalia.com>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH 16/17] tools/sched_ext: scx_qmap: Port to cid-form struct_ops
2026-04-29 12:47 ` Changwoo Min
@ 2026-04-29 13:53 ` Andrea Righi
2026-04-29 16:42 ` Tejun Heo
0 siblings, 1 reply; 30+ messages in thread
From: Andrea Righi @ 2026-04-29 13:53 UTC (permalink / raw)
To: Changwoo Min
Cc: Tejun Heo, David Vernet, sched-ext, Emil Tsalapatis, linux-kernel,
Cheng-Yang Chou
Hello,
On Wed, Apr 29, 2026 at 09:47:12PM +0900, Changwoo Min wrote:
>
> On 4/29/26 5:35 AM, Tejun Heo wrote:
> > /* @@ -1083,6 +1082,18 @@s32·BPF_STRUCT_OPS_SLEEPABLE(qmap_init) › ›
> > return·-EINVAL; › } +› /*+› ·*·cid-
> > override·test·hook.·Must·run·before·anything·that·reads·the+›
> > ·*·cid·space·(scx_bpf_nr_cids,·cmask_init,·etc.).·On·invalid·input,+›
> > ·*·the·kfunc·calls·scx_error()·which·aborts·the·scheduler.+› ·*/+›
> > if·(cid_override_mode)·{+› ›
> > scx_bpf_cid_override((const·s32·*)cid_override_cpu_to_cid,+› › › ›
> > ·····cid_override_nr_cpus·*·sizeof(s32),+› › › ›
> > ·····(const·s32·*)cid_override_shard_start,+› › › ›
> > ·····cid_override_nr_shards·*·sizeof(s32));+› }+
>
> This cause the following compilation error due to argument mismatch:
>
> scx_qmap.bpf.c:1093:10: error: too many arguments to function call, expected
> 2, have 4
> 1091 | scx_bpf_cid_override((const s32
> *)cid_override_cpu_to_cid,
> | ~~~~~~~~~~~~~~~~~~~~
>
> 1092 | cid_override_nr_cpus *
> sizeof(s32),
> 1093 | (const s32
> *)cid_override_shard_start,
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 1094 | cid_override_nr_shards *
> sizeof(s32));
> | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> /home/changwoo/ws-multics69/dev/linux-tj/tools/sched_ext/include/scx/compat.bpf.h:130:20:
> note:
> 'scx_bpf_cid_override' declared here
>
> 130 | static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32
> cpu_to_cid__sz)
> | ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> The correct one should be as follows:
>
> > scx_bpf_cid_override((const·s32·*)cid_override_cpu_to_cid,+› › › ›
> > ·····cid_override_nr_cpus * sizeof(s32));
>
> Reviewed-by: Changwoo Min <changwoo@igalia.com>
And after fixing scx_bpf_cid_override() I'm also getting this with
`scx_qmap -C shuffle`:
0: R1=ctx() R10=fp0
; s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init) @ scx_qmap.bpf.c:1069
0: (b4) w1 = 0 ; R1=0
; u32 nr_pages, key = 0, i; @ scx_qmap.bpf.c:1072
1: (63) *(u32 *)(r10 -4) = r1 ; R1=0 R10=fp0 fp-8=0000????
; if (scx_bpf_nr_cids() > SCX_QMAP_MAX_CPUS) { @ scx_qmap.bpf.c:1076
2: (85) call scx_bpf_nr_cids#110275 ; R0=scalar()
3: (a6) if w0 < 0x401 goto pc+14 18: R10=fp0 fp-8=0000pppp
; if (cid_override_mode) { @ scx_qmap.bpf.c:1087
18: (18) r1 = 0xffffc90000322260 ; R1=map_value(map=scx_qmap.rodata,ks=4,vs=964,imm=608)
20: (61) r1 = *(u32 *)(r1 +0) ;
21: (05) goto pc+0
; scx_bpf_nr_cpu_ids() * (u32)sizeof(s32)); @ scx_qmap.bpf.c:1090
22: (85) call scx_bpf_nr_cpu_ids#110276 ; R0=scalar()
; if (bpf_ksym_exists(scx_bpf_cid_override___compat)) @ compat.bpf.h:132
23: (18) r1 = 0xffffffff81464430 ; R1=rdonly_mem(sz=0)
25: (15) if r1 == 0x0 goto pc+5 ; R1=rdonly_mem(sz=0)
; scx_bpf_nr_cpu_ids() * (u32)sizeof(s32)); @ scx_qmap.bpf.c:1090
26: (64) w0 <<= 2 ; R0=scalar(smin=0,smax=umax=umax32=0xfffffffc,smax32=0x7ffffffc,var_off=(0x0; 0xfffffffc))
; return scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz); @ compat.bpf.h:133
27: (18) r1 = 0xffffc90001526000 ; R1=map_value(map=scx_qmap.bss,ks=4,vs=4128)
29: (bc) w2 = w0 ; R0=scalar(id=2,smin=0,smax=umax=umax32=0xfffffffc,smax32=0x7ffffffc,var_off=(0x0; 0xfffffffc)) R2=scalar(id=2,smin=0,smax=umax=umax32=0xfffffffc,smax32=0x7ffffffc,var_off=(0x0; 0xfffffffc))
30: (85) call scx_bpf_cid_override#110197
R2 unbounded memory access, use 'var &= const' or 'if (var < const)'
arg#0 arg#1 memory, len pair leads to invalid memory access
processed 28 insns (limit 1000000) max_states_per_insn 0 total_states 2 peak_states 2 mark_read 0
The following seems to fix everything for me.
Thanks,
-Andrea
tools/sched_ext/scx_qmap.bpf.c | 26 +++++++++++++++++---------
tools/sched_ext/scx_qmap.c | 16 ++--------------
2 files changed, 19 insertions(+), 23 deletions(-)
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index f55192c7c51aa..800a92fdb6db7 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -63,8 +63,6 @@ const volatile u32 max_tasks;
* 3 = invalid: non-monotonic shard_start
*/
const volatile u32 cid_override_mode;
-const volatile u32 cid_override_nr_cpus;
-const volatile u32 cid_override_nr_shards;
/*
* Arrays live in bss (writable) because scx_bpf_cid_override()'s BPF
* verifier signature treats its len-paired pointer as read/write - rodata
@@ -72,7 +70,6 @@ const volatile u32 cid_override_nr_shards;
* them before SCX_OPS_LOAD, same as rodata, and nothing writes them after.
*/
s32 cid_override_cpu_to_cid[SCX_QMAP_MAX_CPUS];
-s32 cid_override_shard_start[SCX_QMAP_MAX_CPUS];
UEI_DEFINE(uei);
@@ -1073,12 +1070,25 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
{
u8 __arena *slab;
u32 nr_pages, key = 0, i;
+ u32 nr_cids, nr_cpu_ids;
struct bpf_timer *timer;
s32 ret;
- if (scx_bpf_nr_cids() > SCX_QMAP_MAX_CPUS) {
+ nr_cids = scx_bpf_nr_cids();
+ nr_cpu_ids = scx_bpf_nr_cpu_ids();
+
+ /*
+ * Separate compares so the verifier tracks each upper bound; needed for
+ * scx_bpf_cid_override(ptr, nr_cpu_ids * sizeof(s32)) vs bss array size.
+ */
+ if (nr_cids > SCX_QMAP_MAX_CPUS) {
scx_bpf_error("nr_cids=%u exceeds SCX_QMAP_MAX_CPUS=%d",
- scx_bpf_nr_cids(), SCX_QMAP_MAX_CPUS);
+ nr_cids, SCX_QMAP_MAX_CPUS);
+ return -EINVAL;
+ }
+ if (nr_cpu_ids > SCX_QMAP_MAX_CPUS) {
+ scx_bpf_error("nr_cpu_ids=%u exceeds SCX_QMAP_MAX_CPUS=%d",
+ nr_cpu_ids, SCX_QMAP_MAX_CPUS);
return -EINVAL;
}
@@ -1089,9 +1099,7 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
*/
if (cid_override_mode) {
scx_bpf_cid_override((const s32 *)cid_override_cpu_to_cid,
- cid_override_nr_cpus * sizeof(s32),
- (const s32 *)cid_override_shard_start,
- cid_override_nr_shards * sizeof(s32));
+ nr_cpu_ids * (u32)sizeof(s32));
}
/*
@@ -1133,7 +1141,7 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
scx_bpf_error("failed to allocate idle cmask");
return -ENOMEM;
}
- cmask_init(qa_idle_cids, 0, scx_bpf_nr_cids());
+ cmask_init(qa_idle_cids, 0, nr_cids);
ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
if (ret) {
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index a533542e3ca52..f3218610b5e5c 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -155,7 +155,6 @@ int main(int argc, char **argv)
case 'C': {
u32 nr_cpus = libbpf_num_possible_cpus();
u32 mode, i;
- s32 shard_sz = 4;
if (!strcmp(optarg, "shuffle"))
mode = 1;
@@ -168,7 +167,6 @@ int main(int argc, char **argv)
return 1;
}
skel->rodata->cid_override_mode = mode;
- skel->rodata->cid_override_nr_cpus = nr_cpus;
/* shuffle: reversed cpu_to_cid, bad-dup: dup cid 0, bad-mono: identity */
for (i = 0; i < nr_cpus; i++) {
@@ -179,19 +177,9 @@ int main(int argc, char **argv)
}
if (mode == 2 && nr_cpus >= 2)
skel->bss->cid_override_cpu_to_cid[1] = 0;
+ if (mode == 3)
+ skel->bss->cid_override_cpu_to_cid[0] = (s32)nr_cpus;
- /* shards of shard_sz each */
- skel->rodata->cid_override_nr_shards = (nr_cpus + shard_sz - 1) / shard_sz;
- for (i = 0; i < skel->rodata->cid_override_nr_shards; i++)
- skel->bss->cid_override_shard_start[i] = i * shard_sz;
-
- if (mode == 3 && skel->rodata->cid_override_nr_shards >= 3) {
- /* swap [1] and [2] so shard_start is not monotonically increasing */
- s32 tmp = skel->bss->cid_override_shard_start[1];
- skel->bss->cid_override_shard_start[1] =
- skel->bss->cid_override_shard_start[2];
- skel->bss->cid_override_shard_start[2] = tmp;
- }
break;
}
case 'v':
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [PATCH 17/17] sched_ext: Require cid-form struct_ops for sub-sched support
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (15 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 16/17] tools/sched_ext: scx_qmap: Port to cid-form struct_ops Tejun Heo
@ 2026-04-28 20:35 ` Tejun Heo
2026-04-29 12:49 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Changwoo Min
2026-04-29 13:29 ` Andrea Righi
18 siblings, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-28 20:35 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo,
Cheng-Yang Chou
Sub-scheduler support is tied to the cid-form struct_ops: sub_attach /
sub_detach will communicate allocation via cmask, and the hierarchy assumes
all participants share a single topological cid space. A cpu-form root that
accepts sub-scheds would need cpu <-> cid translation on every cross-sched
interaction, defeating the purpose.
Enforce this at validate_ops():
- A sub-scheduler (scx_parent(sch) non-NULL) must be cid-form.
- A root that exposes sub_attach / sub_detach must be cid-form.
scx_qmap, which is currently the only scheduler demoing sub-sched support,
was converted to cid-form in the preceding patch, so this doesn't cause
breakage.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>
---
kernel/sched/ext.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index d8f8fca5ded9..018c75b7ccf1 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6847,6 +6847,23 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
if (!sch->is_cid_type && (ops->cpu_acquire || ops->cpu_release))
pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
+ /*
+ * Sub-scheduler support is tied to the cid-form struct_ops. A sub-sched
+ * attaches through a cid-form-only interface (sub_attach/sub_detach),
+ * and a root that accepts sub-scheds must expose cid-form state to
+ * them. Reject cpu-form schedulers on either side.
+ */
+ if (!sch->is_cid_type) {
+ if (scx_parent(sch)) {
+ scx_error(sch, "sub-sched requires cid-form struct_ops");
+ return -EINVAL;
+ }
+ if (ops->sub_attach || ops->sub_detach) {
+ scx_error(sch, "sub_attach/sub_detach requires cid-form struct_ops");
+ return -EINVAL;
+ }
+ }
+
return 0;
}
--
2.54.0
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (16 preceding siblings ...)
2026-04-28 20:35 ` [PATCH 17/17] sched_ext: Require cid-form struct_ops for sub-sched support Tejun Heo
@ 2026-04-29 12:49 ` Changwoo Min
2026-04-29 13:29 ` Andrea Righi
18 siblings, 0 replies; 30+ messages in thread
From: Changwoo Min @ 2026-04-29 12:49 UTC (permalink / raw)
To: Tejun Heo, David Vernet, Andrea Righi
Cc: sched-ext, Emil Tsalapatis, linux-kernel
Thank you, Tejun, for the patchset.
I like it! I made some comments on patches 10 and 16.
Reviewed-by: Changwoo Min <changwoo@igalia.com>
On 4/29/26 5:35 AM, Tejun Heo wrote:
> Hello,
>
> v3 (all from the Sashiko AI review at
> https://sashiko.dev/#/patchset/20260424172721.3458520-1-tj%40kernel.org):
>
> - cid: drop leaked cpus_read_lock() on scx_cid_init() failure;
> BUILD_BUG_ON tightened to NR_CPUS<=8192 to match the BPF cmask
> helpers' CMASK_MAX_WORDS coverage.
> - bpf-struct-size: use offsetof() in struct_size() to match the
> kernel <linux/overflow.h> macro semantics (no inflation from
> trailing struct padding).
> - cmask: cmask_copy_from_kernel() validates src->base==0 via
> probe-read; nr_bits check is bit-level rather than rounded-up
> word-count.
> - cid-qmap-idle: qmap_init() refuses to load when scx_bpf_nr_cids()
> exceeds SCX_QMAP_MAX_CPUS; the task_ctx flex array would otherwise
> overflow into the next slab entry.
>
> v2: https://lore.kernel.org/r/20260424172721.3458520-1-tj@kernel.org
> v1: https://lore.kernel.org/r/20260421071945.3110084-1-tj@kernel.org
>
> This patchset introduces topological CPU IDs (cids) - dense,
> topology-ordered cpu identifiers - and an alternative cid-form struct_ops
> type that lets BPF schedulers operate in cid space directly.
>
> Key pieces:
>
> - cid space: scx_cid_init() walks nodes * LLCs * cores * threads and packs
> a dense cid mapping. The mapping can be overridden via
> scx_bpf_cid_override(). See "Topological CPU IDs" in ext_cid.h for the
> model.
>
> - cmask: a base-windowed bitmap over cid space. Kernel and BPF helpers with
> identical semantics. Used by scx_qmap for per-task affinity and idle-cid
> tracking; meant to be the substrate for sub-sched cid allocation.
>
> - bpf_sched_ext_ops_cid: a parallel struct_ops type whose callbacks take
> cids/cmasks instead of cpus/cpumasks. Kernel translates at the boundary
> via scx_cpu_arg() / scx_cpu_ret(); the two struct types share offsets up
> through @priv (verified by BUILD_BUG_ON) so the union view in scx_sched
> works without function-pointer casts. Sub-sched support is tied to
> cid-form: validate_ops() rejects cpu-form sub-scheds and cpu-form roots
> that expose sub_attach / sub_detach.
>
> - cid-form kfuncs: scx_bpf_kick_cid, scx_bpf_cidperf_{cap,cur,set},
> scx_bpf_cid_curr, scx_bpf_task_cid, scx_bpf_this_cid,
> scx_bpf_nr_{cids,online_cids}, scx_bpf_cid_to_cpu, scx_bpf_cpu_to_cid.
> A cid-form program may not call cpu-only kfuncs (enforced at verifier
> load via scx_kfunc_context_filter); the reverse is intentionally
> permissive to ease migration.
>
> - scx_qmap port: scx_qmap is converted to cid-form. It uses the cmask-based
> idle picker, per-task cid-space cpus_allowed, and cid-form kfuncs
> throughout. Sub-sched dispatching via scx_bpf_sub_dispatch() continues to
> work.
>
> v3 re-tested on the 16-cpu QEMU: cid-form scx_qmap under stress-ng plus
> reload cycles, hotplug auto-restart, and sub-sched (root scx_qmap +
> cgroup-scoped scx_qmap child). Clean.
>
> Based on sched_ext/for-7.2 (4939721aad2e).
>
> 0001-sched_ext-Add-ext_types.h-for-early-subsystem-wide-d.patch
> 0002-sched_ext-Rename-ops_cpu_valid-to-scx_cpu_valid-and-.patch
> 0003-sched_ext-Move-scx_exit-scx_error-and-friends-to-ext.patch
> 0004-sched_ext-Shift-scx_kick_cpu-validity-check-to-scx_b.patch
> 0005-sched_ext-Relocate-cpu_acquire-cpu_release-to-end-of.patch
> 0006-sched_ext-Make-scx_enable-take-scx_enable_cmd.patch
> 0007-sched_ext-Add-topological-CPU-IDs-cids.patch
> 0008-sched_ext-Add-scx_bpf_cid_override-kfunc.patch
> 0009-tools-sched_ext-Add-struct_size-helpers-to-common.bp.patch
> 0010-sched_ext-Add-cmask-a-base-windowed-bitmap-over-cid-.patch
> 0011-sched_ext-Add-cid-form-kfunc-wrappers-alongside-cpu-.patch
> 0012-sched_ext-Add-bpf_sched_ext_ops_cid-struct_ops-type.patch
> 0013-sched_ext-Forbid-cpu-form-kfuncs-from-cid-form-sched.patch
> 0014-tools-sched_ext-scx_qmap-Restart-on-hotplug-instead-.patch
> 0015-tools-sched_ext-scx_qmap-Add-cmask-based-idle-tracki.patch
> 0016-tools-sched_ext-scx_qmap-Port-to-cid-form-struct_ops.patch
> 0017-sched_ext-Require-cid-form-struct_ops-for-sub-sched-.patch
>
> Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-cid-v3
>
> kernel/sched/build_policy.c | 3 +
> kernel/sched/ext.c | 651 ++++++++++++++++++++++++++----
> kernel/sched/ext_cid.c | 409 +++++++++++++++++++
> kernel/sched/ext_cid.h | 164 ++++++++
> kernel/sched/ext_idle.c | 8 +-
> kernel/sched/ext_internal.h | 205 +++++++---
> kernel/sched/ext_types.h | 104 +++++
> tools/sched_ext/include/scx/cid.bpf.h | 667 +++++++++++++++++++++++++++++++
> tools/sched_ext/include/scx/common.bpf.h | 23 ++
> tools/sched_ext/include/scx/compat.bpf.h | 24 ++
> tools/sched_ext/scx_qmap.bpf.c | 346 +++++++++-------
> tools/sched_ext/scx_qmap.c | 70 +++-
> tools/sched_ext/scx_qmap.h | 2 +-
> 13 files changed, 2391 insertions(+), 285 deletions(-)
>
> Thanks.
>
> --
> tejun
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
2026-04-28 20:35 [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
` (17 preceding siblings ...)
2026-04-29 12:49 ` [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Changwoo Min
@ 2026-04-29 13:29 ` Andrea Righi
2026-04-29 14:11 ` Andrea Righi
2026-04-29 17:06 ` Tejun Heo
18 siblings, 2 replies; 30+ messages in thread
From: Andrea Righi @ 2026-04-29 13:29 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel
Hi Tejun,
On Tue, Apr 28, 2026 at 10:35:28AM -1000, Tejun Heo wrote:
> Hello,
>
> v3 (all from the Sashiko AI review at
> https://sashiko.dev/#/patchset/20260424172721.3458520-1-tj%40kernel.org):
>
> - cid: drop leaked cpus_read_lock() on scx_cid_init() failure;
> BUILD_BUG_ON tightened to NR_CPUS<=8192 to match the BPF cmask
> helpers' CMASK_MAX_WORDS coverage.
> - bpf-struct-size: use offsetof() in struct_size() to match the
> kernel <linux/overflow.h> macro semantics (no inflation from
> trailing struct padding).
> - cmask: cmask_copy_from_kernel() validates src->base==0 via
> probe-read; nr_bits check is bit-level rather than rounded-up
> word-count.
> - cid-qmap-idle: qmap_init() refuses to load when scx_bpf_nr_cids()
> exceeds SCX_QMAP_MAX_CPUS; the task_ctx flex array would otherwise
> overflow into the next slab entry.
>
> v2: https://lore.kernel.org/r/20260424172721.3458520-1-tj@kernel.org
> v1: https://lore.kernel.org/r/20260421071945.3110084-1-tj@kernel.org
>
> This patchset introduces topological CPU IDs (cids) - dense,
> topology-ordered cpu identifiers - and an alternative cid-form struct_ops
> type that lets BPF schedulers operate in cid space directly.
>
> Key pieces:
>
> - cid space: scx_cid_init() walks nodes * LLCs * cores * threads and packs
> a dense cid mapping. The mapping can be overridden via
> scx_bpf_cid_override(). See "Topological CPU IDs" in ext_cid.h for the
> model.
>
> - cmask: a base-windowed bitmap over cid space. Kernel and BPF helpers with
> identical semantics. Used by scx_qmap for per-task affinity and idle-cid
> tracking; meant to be the substrate for sub-sched cid allocation.
>
> - bpf_sched_ext_ops_cid: a parallel struct_ops type whose callbacks take
> cids/cmasks instead of cpus/cpumasks. Kernel translates at the boundary
> via scx_cpu_arg() / scx_cpu_ret(); the two struct types share offsets up
> through @priv (verified by BUILD_BUG_ON) so the union view in scx_sched
> works without function-pointer casts. Sub-sched support is tied to
> cid-form: validate_ops() rejects cpu-form sub-scheds and cpu-form roots
> that expose sub_attach / sub_detach.
>
> - cid-form kfuncs: scx_bpf_kick_cid, scx_bpf_cidperf_{cap,cur,set},
> scx_bpf_cid_curr, scx_bpf_task_cid, scx_bpf_this_cid,
> scx_bpf_nr_{cids,online_cids}, scx_bpf_cid_to_cpu, scx_bpf_cpu_to_cid.
> A cid-form program may not call cpu-only kfuncs (enforced at verifier
> load via scx_kfunc_context_filter); the reverse is intentionally
> permissive to ease migration.
So, IIUC scx schedulers attached to bpf_sched_ext_ops_cid can't use the built-in
idle CPU selection kfuncs (ext_idle.c), right?
And that also means sub-sched support => no built-in idle CPU selection. That's
a bit unfortunate... I guess we could implement a similar logic in cid/cmask
space, maybe in BPF.
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
2026-04-29 13:29 ` Andrea Righi
@ 2026-04-29 14:11 ` Andrea Righi
2026-04-29 17:06 ` Tejun Heo
1 sibling, 0 replies; 30+ messages in thread
From: Andrea Righi @ 2026-04-29 14:11 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel
On Wed, Apr 29, 2026 at 03:29:43PM +0200, Andrea Righi wrote:
> Hi Tejun,
>
> On Tue, Apr 28, 2026 at 10:35:28AM -1000, Tejun Heo wrote:
> > Hello,
> >
> > v3 (all from the Sashiko AI review at
> > https://sashiko.dev/#/patchset/20260424172721.3458520-1-tj%40kernel.org):
> >
> > - cid: drop leaked cpus_read_lock() on scx_cid_init() failure;
> > BUILD_BUG_ON tightened to NR_CPUS<=8192 to match the BPF cmask
> > helpers' CMASK_MAX_WORDS coverage.
> > - bpf-struct-size: use offsetof() in struct_size() to match the
> > kernel <linux/overflow.h> macro semantics (no inflation from
> > trailing struct padding).
> > - cmask: cmask_copy_from_kernel() validates src->base==0 via
> > probe-read; nr_bits check is bit-level rather than rounded-up
> > word-count.
> > - cid-qmap-idle: qmap_init() refuses to load when scx_bpf_nr_cids()
> > exceeds SCX_QMAP_MAX_CPUS; the task_ctx flex array would otherwise
> > overflow into the next slab entry.
> >
> > v2: https://lore.kernel.org/r/20260424172721.3458520-1-tj@kernel.org
> > v1: https://lore.kernel.org/r/20260421071945.3110084-1-tj@kernel.org
> >
> > This patchset introduces topological CPU IDs (cids) - dense,
> > topology-ordered cpu identifiers - and an alternative cid-form struct_ops
> > type that lets BPF schedulers operate in cid space directly.
> >
> > Key pieces:
> >
> > - cid space: scx_cid_init() walks nodes * LLCs * cores * threads and packs
> > a dense cid mapping. The mapping can be overridden via
> > scx_bpf_cid_override(). See "Topological CPU IDs" in ext_cid.h for the
> > model.
> >
> > - cmask: a base-windowed bitmap over cid space. Kernel and BPF helpers with
> > identical semantics. Used by scx_qmap for per-task affinity and idle-cid
> > tracking; meant to be the substrate for sub-sched cid allocation.
> >
> > - bpf_sched_ext_ops_cid: a parallel struct_ops type whose callbacks take
> > cids/cmasks instead of cpus/cpumasks. Kernel translates at the boundary
> > via scx_cpu_arg() / scx_cpu_ret(); the two struct types share offsets up
> > through @priv (verified by BUILD_BUG_ON) so the union view in scx_sched
> > works without function-pointer casts. Sub-sched support is tied to
> > cid-form: validate_ops() rejects cpu-form sub-scheds and cpu-form roots
> > that expose sub_attach / sub_detach.
> >
> > - cid-form kfuncs: scx_bpf_kick_cid, scx_bpf_cidperf_{cap,cur,set},
> > scx_bpf_cid_curr, scx_bpf_task_cid, scx_bpf_this_cid,
> > scx_bpf_nr_{cids,online_cids}, scx_bpf_cid_to_cpu, scx_bpf_cpu_to_cid.
> > A cid-form program may not call cpu-only kfuncs (enforced at verifier
> > load via scx_kfunc_context_filter); the reverse is intentionally
> > permissive to ease migration.
>
> So, IIUC scx schedulers attached to bpf_sched_ext_ops_cid can't use the built-in
> idle CPU selection kfuncs (ext_idle.c), right?
>
> And that also means sub-sched support => no built-in idle CPU selection. That's
> a bit unfortunate... I guess we could implement a similar logic in cid/cmask
> space, maybe in BPF.
And apart from this and the other comments in PATCH 8/17 and PATCH 16/17,
everything else looks good to me.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCHSET v3 sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
2026-04-29 13:29 ` Andrea Righi
2026-04-29 14:11 ` Andrea Righi
@ 2026-04-29 17:06 ` Tejun Heo
1 sibling, 0 replies; 30+ messages in thread
From: Tejun Heo @ 2026-04-29 17:06 UTC (permalink / raw)
To: Andrea Righi
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel
Hello,
Yeah. Expanding the in-kernel idle selector to be sub-sched-aware is
hard - it has to respect each sub-sched's cid window, partition
constraints, etc. Cmask in arena makes the same logic a lot easier
to express on the BPF side, so I think the right path is building
that infra in BPF rather than extending ext_idle.c.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 30+ messages in thread