public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
@ 2026-04-21  7:19 Tejun Heo
  2026-04-21  7:19 ` [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
                   ` (16 more replies)
  0 siblings, 17 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Hello,

This patchset introduces topological CPU IDs (cids) - dense,
topology-ordered cpu identifiers - and an alternative cid-form struct_ops
type that lets BPF schedulers operate in cid space directly.

Key pieces:

- cid space: scx_cid_init() walks nodes * LLCs * cores * threads and packs
  a dense cid mapping. The mapping can be overridden via
  scx_bpf_cid_override(). See "Topological CPU IDs" in ext_cid.h for the
  model.

- cmask: a base-windowed bitmap over cid space. Kernel and BPF helpers with
  identical semantics. Used by scx_qmap for per-task affinity and idle-cid
  tracking; meant to be the substrate for sub-sched cid allocation.

- bpf_sched_ext_ops_cid: a parallel struct_ops type whose callbacks take
  cids/cmasks instead of cpus/cpumasks. Kernel translates at the boundary
  via scx_cpu_arg() / scx_cpu_ret(); the two struct types share offsets up
  through @priv (verified by BUILD_BUG_ON) so the union view in scx_sched
  works without function-pointer casts. Sub-sched support is tied to
  cid-form: validate_ops() rejects cpu-form sub-scheds and cpu-form roots
  that expose sub_attach / sub_detach.

- cid-form kfuncs: scx_bpf_kick_cid, scx_bpf_cidperf_{cap,cur,set},
  scx_bpf_cid_curr, scx_bpf_task_cid, scx_bpf_this_cid,
  scx_bpf_nr_{cids,online_cids}, scx_bpf_cid_to_cpu, scx_bpf_cpu_to_cid.
  A cid-form program may not call cpu-only kfuncs (enforced at verifier
  load via scx_kfunc_context_filter); the reverse is intentionally
  permissive to ease migration.

- scx_qmap port: scx_qmap is converted to cid-form. It uses the cmask-based
  idle picker, per-task cid-space cpus_allowed, and cid-form kfuncs
  throughout. Sub-sched dispatching via scx_bpf_sub_dispatch() continues to
  work.

End-to-end testing on a 16-cpu QEMU (identity mapping) and an AMD Ryzen
9 3900X (non-identity cid mapping across CCXes) validated:
- cid <-> cpu table roundtrip
- SCX_DSQ_LOCAL_ON | cid routing (pinned workers land on the right cpu)
- scx_bpf_kick_cid translation (traced entry vs scx_kick_cpu exit)
- set_cmask fidelity (bits match cpu_to_cid(task->cpus_ptr))
- idle picker engagement under light load, backoff under saturation
- sub-sched qmap (root + 3 subs across cgroups)
- cpu-form regression (scx_simple, scx_flatcg still load and schedule)
- kfunc filter denial (cid-form calling scx_bpf_task_cpu rejected at load)

The patchset depends on a pre-existing bug fix that is being submitted
separately to for-7.1-fixes:

  "tools/sched_ext: scx_qmap: Silence task_ctx lookup miss"
  https://lore.kernel.org/r/59bc5171ee5aa02746c2f576d0f1e14f@kernel.org

The scx-cid-base branch listed below already carries that fix merged
through for-7.1-fixes; when for-7.1-fixes is merged back into for-7.2
the dependency resolves naturally.

Based on sched_ext/for-7.2 (12ff49d4e1d9) + for-7.1-fixes + the above fix
(scx-cid-base at 5755954f68ad).

 0001-sched_ext-Rename-ops_cpu_valid-to-scx_cpu_valid-and-.patch
 0002-sched_ext-Move-scx_exit-scx_error-and-friends-to-ext.patch
 0003-sched_ext-Shift-scx_kick_cpu-validity-check-to-scx_b.patch
 0004-sched_ext-Relocate-cpu_acquire-cpu_release-to-end-of.patch
 0005-sched_ext-Make-scx_enable-take-scx_enable_cmd.patch
 0006-sched_ext-Add-topological-CPU-IDs-cids.patch
 0007-sched_ext-Add-scx_bpf_cid_override-kfunc.patch
 0008-tools-sched_ext-Add-struct_size-helpers-to-common.bp.patch
 0009-sched_ext-Add-cmask-a-base-windowed-bitmap-over-cid-.patch
 0010-sched_ext-Add-cid-form-kfunc-wrappers-alongside-cpu-.patch
 0011-sched_ext-Add-bpf_sched_ext_ops_cid-struct_ops-type.patch
 0012-sched_ext-Forbid-cpu-form-kfuncs-from-cid-form-sched.patch
 0013-tools-sched_ext-scx_qmap-Restart-on-hotplug-instead-.patch
 0014-tools-sched_ext-scx_qmap-Add-cmask-based-idle-tracki.patch
 0015-tools-sched_ext-scx_qmap-Port-to-cid-form-struct_ops.patch
 0016-sched_ext-Require-cid-form-struct_ops-for-sub-sched-.patch

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-cid

 kernel/sched/build_policy.c              |   1 +
 kernel/sched/ext.c                       | 635 +++++++++++++++++++++++++----
 kernel/sched/ext_cid.c                   | 554 +++++++++++++++++++++++++++
 kernel/sched/ext_cid.h                   | 327 ++++++++++++++++
 kernel/sched/ext_idle.c                  |   8 +-
 kernel/sched/ext_internal.h              | 173 +++++++--
 tools/sched_ext/include/scx/cid.bpf.h    | 595 +++++++++++++++++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h |  23 ++
 tools/sched_ext/include/scx/compat.bpf.h |  24 ++
 tools/sched_ext/scx_qmap.bpf.c           | 304 ++++++-------
 tools/sched_ext/scx_qmap.c               |  25 +-
 tools/sched_ext/scx_qmap.h               |   2 +-
 12 files changed, 2419 insertions(+), 252 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 13:31   ` Cheng-Yang Chou
  2026-04-21  7:19 ` [PATCH 02/16] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h Tejun Heo
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Rename the static ext.c helper and declare it in ext_internal.h so
ext_idle.c and the upcoming cid code can call it directly instead of
relying on build_policy.c textual inclusion.

Pure rename and visibility change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 22 +++++++++++-----------
 kernel/sched/ext_idle.c     |  6 +++---
 kernel/sched/ext_internal.h |  2 ++
 3 files changed, 16 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 0a53a0dd64bf..8c7450c5ebfa 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1055,7 +1055,7 @@ static inline bool __cpu_valid(s32 cpu)
 }
 
 /**
- * ops_cpu_valid - Verify a cpu number, to be used on ops input args
+ * scx_cpu_valid - Verify a cpu number, to be used on ops input args
  * @sch: scx_sched to abort on error
  * @cpu: cpu number which came from a BPF ops
  * @where: extra information reported on error
@@ -1064,7 +1064,7 @@ static inline bool __cpu_valid(s32 cpu)
  * Verify that it is in range and one of the possible cpus. If invalid, trigger
  * an ops error.
  */
-static bool ops_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
+bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
 {
 	if (__cpu_valid(cpu)) {
 		return true;
@@ -1677,7 +1677,7 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
 	if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
 		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
 
-		if (!ops_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
+		if (!scx_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
 			return find_global_dsq(sch, tcpu);
 
 		return &cpu_rq(cpu)->scx.local_dsq;
@@ -3259,7 +3259,7 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		this_rq()->scx.in_select_cpu = false;
 		p->scx.selected_cpu = cpu;
 		*ddsp_taskp = NULL;
-		if (ops_cpu_valid(sch, cpu, "from ops.select_cpu()"))
+		if (scx_cpu_valid(sch, cpu, "from ops.select_cpu()"))
 			return cpu;
 		else
 			return prev_cpu;
@@ -8678,7 +8678,7 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 	struct rq *this_rq;
 	unsigned long irq_flags;
 
-	if (!ops_cpu_valid(sch, cpu, NULL))
+	if (!scx_cpu_valid(sch, cpu, NULL))
 		return;
 
 	local_irq_save(irq_flags);
@@ -8774,7 +8774,7 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
 	} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
 		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
 
-		if (ops_cpu_valid(sch, cpu, NULL)) {
+		if (scx_cpu_valid(sch, cpu, NULL)) {
 			ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr);
 			goto out;
 		}
@@ -9163,7 +9163,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux)
 	guard(rcu)();
 
 	sch = scx_prog_sched(aux);
-	if (likely(sch) && ops_cpu_valid(sch, cpu, NULL))
+	if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
 		return arch_scale_cpu_capacity(cpu);
 	else
 		return SCX_CPUPERF_ONE;
@@ -9191,7 +9191,7 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux)
 	guard(rcu)();
 
 	sch = scx_prog_sched(aux);
-	if (likely(sch) && ops_cpu_valid(sch, cpu, NULL))
+	if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
 		return arch_scale_freq_capacity(cpu);
 	else
 		return SCX_CPUPERF_ONE;
@@ -9227,7 +9227,7 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_au
 		return;
 	}
 
-	if (ops_cpu_valid(sch, cpu, NULL)) {
+	if (scx_cpu_valid(sch, cpu, NULL)) {
 		struct rq *rq = cpu_rq(cpu), *locked_rq = scx_locked_rq();
 		struct rq_flags rf;
 
@@ -9340,7 +9340,7 @@ __bpf_kfunc struct rq *scx_bpf_cpu_rq(s32 cpu, const struct bpf_prog_aux *aux)
 	if (unlikely(!sch))
 		return NULL;
 
-	if (!ops_cpu_valid(sch, cpu, NULL))
+	if (!scx_cpu_valid(sch, cpu, NULL))
 		return NULL;
 
 	if (!sch->warned_deprecated_rq) {
@@ -9397,7 +9397,7 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_
 	if (unlikely(!sch))
 		return NULL;
 
-	if (!ops_cpu_valid(sch, cpu, NULL))
+	if (!scx_cpu_valid(sch, cpu, NULL))
 		return NULL;
 
 	return rcu_dereference(cpu_rq(cpu)->curr);
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index c43d62d90e40..11d11ea6ca6b 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -917,7 +917,7 @@ static s32 select_cpu_from_kfunc(struct scx_sched *sch, struct task_struct *p,
 	bool we_locked = false;
 	s32 cpu;
 
-	if (!ops_cpu_valid(sch, prev_cpu, NULL))
+	if (!scx_cpu_valid(sch, prev_cpu, NULL))
 		return -EINVAL;
 
 	if (!check_builtin_idle_enabled(sch))
@@ -975,7 +975,7 @@ __bpf_kfunc s32 scx_bpf_cpu_node(s32 cpu, const struct bpf_prog_aux *aux)
 	guard(rcu)();
 
 	sch = scx_prog_sched(aux);
-	if (unlikely(!sch) || !ops_cpu_valid(sch, cpu, NULL))
+	if (unlikely(!sch) || !scx_cpu_valid(sch, cpu, NULL))
 		return NUMA_NO_NODE;
 	return cpu_to_node(cpu);
 }
@@ -1257,7 +1257,7 @@ __bpf_kfunc bool scx_bpf_test_and_clear_cpu_idle(s32 cpu, const struct bpf_prog_
 	if (!check_builtin_idle_enabled(sch))
 		return false;
 
-	if (!ops_cpu_valid(sch, cpu, NULL))
+	if (!scx_cpu_valid(sch, cpu, NULL))
 		return false;
 
 	return scx_idle_test_and_clear_cpu(cpu);
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 4a7ffc7f55d2..1345ccc01026 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1382,6 +1382,8 @@ DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
 
 int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id);
 
+bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where);
+
 /*
  * Return the rq currently locked from an scx callback, or NULL if no rq is
  * locked.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 02/16] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
  2026-04-21  7:19 ` [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 13:36   ` Cheng-Yang Chou
  2026-04-21  7:19 ` [PATCH 03/16] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu() Tejun Heo
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Things shared across multiple .c files belong in a header. scx_exit() /
scx_error() (and their scx_vexit() / scx_verror() siblings) are already
called from ext_idle.c and the upcoming ext_cid.c, and it was only
build_policy.c's textual inclusion of ext.c that made the references
resolve. Move the whole family to ext_internal.h.

Pure visibility change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          | 13 ++++---------
 kernel/sched/ext_internal.h |  8 ++++++++
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8c7450c5ebfa..5571f5995dd8 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -235,12 +235,10 @@ static void run_deferred(struct rq *rq);
 static bool task_dead_and_done(struct task_struct *p);
 static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
-static bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
-		      s64 exit_code, const char *fmt, va_list args);
 
-static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
-				    enum scx_exit_kind kind, s64 exit_code,
-				    const char *fmt, ...)
+__printf(4, 5) bool scx_exit(struct scx_sched *sch,
+			     enum scx_exit_kind kind, s64 exit_code,
+			     const char *fmt, ...)
 {
 	va_list args;
 	bool ret;
@@ -252,9 +250,6 @@ static __printf(4, 5) bool scx_exit(struct scx_sched *sch,
 	return ret;
 }
 
-#define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
-#define scx_verror(sch, fmt, args)	scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
-
 #define SCX_HAS_OP(sch, op)	test_bit(SCX_OP_IDX(op), (sch)->has_op)
 
 static long jiffies_delta_msecs(unsigned long at, unsigned long now)
@@ -6349,7 +6344,7 @@ static void scx_disable_irq_workfn(struct irq_work *irq_work)
 	kthread_queue_work(sch->helper, &sch->disable_work);
 }
 
-static bool scx_vexit(struct scx_sched *sch,
+bool scx_vexit(struct scx_sched *sch,
 		      enum scx_exit_kind kind, s64 exit_code,
 		      const char *fmt, va_list args)
 {
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 1345ccc01026..350b84876b2a 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -1384,6 +1384,14 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id);
 
 bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where);
 
+bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind, s64 exit_code,
+	       const char *fmt, va_list args);
+__printf(4, 5) bool scx_exit(struct scx_sched *sch, enum scx_exit_kind kind,
+			     s64 exit_code, const char *fmt, ...);
+
+#define scx_verror(sch, fmt, args)	scx_vexit((sch), SCX_EXIT_ERROR, 0, fmt, args)
+#define scx_error(sch, fmt, args...)	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
+
 /*
  * Return the rq currently locked from an scx callback, or NULL if no rq is
  * locked.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 03/16] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu()
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
  2026-04-21  7:19 ` [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
  2026-04-21  7:19 ` [PATCH 02/16] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 13:49   ` Cheng-Yang Chou
  2026-04-21  7:19 ` [PATCH 04/16] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops Tejun Heo
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Callers that already know the cpu is valid shouldn't have to pay for a
redundant check. scx_kick_cpu() is called from the in-kernel balance loop
break-out path with the current cpu (trivially valid) and from
scx_bpf_kick_cpu() with a BPF-supplied cpu that does need validation. Move
the check out of scx_kick_cpu() into scx_bpf_kick_cpu() so the backend is
reusable by callers that have already validated.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 5571f5995dd8..9e802d73f205 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8673,9 +8673,6 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 	struct rq *this_rq;
 	unsigned long irq_flags;
 
-	if (!scx_cpu_valid(sch, cpu, NULL))
-		return;
-
 	local_irq_save(irq_flags);
 
 	this_rq = this_rq();
@@ -8738,7 +8735,7 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
 
 	guard(rcu)();
 	sch = scx_prog_sched(aux);
-	if (likely(sch))
+	if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
 		scx_kick_cpu(sch, cpu, flags);
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 04/16] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (2 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 03/16] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu() Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 13:58   ` Cheng-Yang Chou
  2026-04-21  7:19 ` [PATCH 05/16] sched_ext: Make scx_enable() take scx_enable_cmd Tejun Heo
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

cpu_acquire and cpu_release are deprecated and slated for removal. Move
their declarations to the end of struct sched_ext_ops so an upcoming
cid-form struct (sched_ext_ops_cid) can omit them entirely without
disturbing the offsets of the shared fields.

Switch the two SCX_HAS_OP() callers for these ops to direct field checks
since the relocated ops sit outside the SCX_OPI_END range covered by the
has_op bitmap.

scx_kf_allow_flags[] auto-sizes to the highest used SCX_OP_IDX, so
SCX_OP_IDX(cpu_release) moving to a higher index just enlarges the
sparse array; the lookup logic is unchanged.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c          |  4 +--
 kernel/sched/ext_internal.h | 54 ++++++++++++++++++++++---------------
 2 files changed, 34 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 9e802d73f205..74e4271e44e9 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -2813,7 +2813,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 		 * core. This callback complements ->cpu_release(), which is
 		 * emitted in switch_class().
 		 */
-		if (SCX_HAS_OP(sch, cpu_acquire))
+		if (sch->ops.cpu_acquire)
 			SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL);
 		rq->scx.cpu_released = false;
 	}
@@ -2959,7 +2959,7 @@ static void switch_class(struct rq *rq, struct task_struct *next)
 	 *  next time that balance_one() is invoked.
 	 */
 	if (!rq->scx.cpu_released) {
-		if (SCX_HAS_OP(sch, cpu_release)) {
+		if (sch->ops.cpu_release) {
 			struct scx_cpu_release_args args = {
 				.reason = preempt_reason_from_class(next_class),
 				.task = next,
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 350b84876b2a..1d73fcc19aaf 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -555,28 +555,6 @@ struct sched_ext_ops {
 	 */
 	void (*update_idle)(s32 cpu, bool idle);
 
-	/**
-	 * @cpu_acquire: A CPU is becoming available to the BPF scheduler
-	 * @cpu: The CPU being acquired by the BPF scheduler.
-	 * @args: Acquire arguments, see the struct definition.
-	 *
-	 * A CPU that was previously released from the BPF scheduler is now once
-	 * again under its control.
-	 */
-	void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
-
-	/**
-	 * @cpu_release: A CPU is taken away from the BPF scheduler
-	 * @cpu: The CPU being released by the BPF scheduler.
-	 * @args: Release arguments, see the struct definition.
-	 *
-	 * The specified CPU is no longer under the control of the BPF
-	 * scheduler. This could be because it was preempted by a higher
-	 * priority sched_class, though there may be other reasons as well. The
-	 * caller should consult @args->reason to determine the cause.
-	 */
-	void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
-
 	/**
 	 * @init_task: Initialize a task to run in a BPF scheduler
 	 * @p: task to initialize for BPF scheduling
@@ -867,6 +845,38 @@ struct sched_ext_ops {
 
 	/* internal use only, must be NULL */
 	void __rcu *priv;
+
+	/*
+	 * Deprecated callbacks. Kept at the end of the struct so the cid-form
+	 * struct (sched_ext_ops_cid) can omit them without affecting the
+	 * shared field offsets. Use SCX_ENQ_IMMED instead. Sitting past
+	 * SCX_OPI_END means has_op doesn't cover them, so SCX_HAS_OP() cannot
+	 * be used; callers must test sch->ops.cpu_acquire / cpu_release
+	 * directly.
+	 */
+
+	/**
+	 * @cpu_acquire: A CPU is becoming available to the BPF scheduler
+	 * @cpu: The CPU being acquired by the BPF scheduler.
+	 * @args: Acquire arguments, see the struct definition.
+	 *
+	 * A CPU that was previously released from the BPF scheduler is now once
+	 * again under its control. Deprecated; use SCX_ENQ_IMMED instead.
+	 */
+	void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
+
+	/**
+	 * @cpu_release: A CPU is taken away from the BPF scheduler
+	 * @cpu: The CPU being released by the BPF scheduler.
+	 * @args: Release arguments, see the struct definition.
+	 *
+	 * The specified CPU is no longer under the control of the BPF
+	 * scheduler. This could be because it was preempted by a higher
+	 * priority sched_class, though there may be other reasons as well. The
+	 * caller should consult @args->reason to determine the cause.
+	 * Deprecated; use SCX_ENQ_IMMED instead.
+	 */
+	void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
 };
 
 enum scx_opi {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 05/16] sched_ext: Make scx_enable() take scx_enable_cmd
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (3 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 04/16] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 14:25   ` Cheng-Yang Chou
  2026-04-21  7:19 ` [PATCH 06/16] sched_ext: Add topological CPU IDs (cids) Tejun Heo
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Pass struct scx_enable_cmd to scx_enable() rather than unpacking @ops
at every call site and re-packing into a fresh cmd inside. bpf_scx_reg()
now builds the cmd on its stack and hands it in; scx_enable() just
wires up the kthread work and waits.

Relocate struct scx_enable_cmd above scx_alloc_and_add_sched() so
upcoming patches that also want the cmd can see it.

No behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 74e4271e44e9..62aab432dbf4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6424,6 +6424,19 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
 	return pnode;
 }
 
+/*
+ * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
+ * starvation. During the READY -> ENABLED task switching loop, the calling
+ * thread's sched_class gets switched from fair to ext. As fair has higher
+ * priority than ext, the calling thread can be indefinitely starved under
+ * fair-class saturation, leading to a system hang.
+ */
+struct scx_enable_cmd {
+	struct kthread_work	work;
+	struct sched_ext_ops	*ops;
+	int			ret;
+};
+
 /*
  * Allocate and initialize a new scx_sched. @cgrp's reference is always
  * consumed whether the function succeeds or fails.
@@ -6655,19 +6668,6 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 	return 0;
 }
 
-/*
- * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
- * starvation. During the READY -> ENABLED task switching loop, the calling
- * thread's sched_class gets switched from fair to ext. As fair has higher
- * priority than ext, the calling thread can be indefinitely starved under
- * fair-class saturation, leading to a system hang.
- */
-struct scx_enable_cmd {
-	struct kthread_work	work;
-	struct sched_ext_ops	*ops;
-	int			ret;
-};
-
 static void scx_root_enable_workfn(struct kthread_work *work)
 {
 	struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
@@ -7243,11 +7243,10 @@ static s32 __init scx_cgroup_lifetime_notifier_init(void)
 core_initcall(scx_cgroup_lifetime_notifier_init);
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
-static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
+static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link)
 {
 	static struct kthread_worker *helper;
 	static DEFINE_MUTEX(helper_mutex);
-	struct scx_enable_cmd cmd;
 
 	if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN),
 			   cpu_possible_mask)) {
@@ -7271,16 +7270,15 @@ static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
 	}
 
 #ifdef CONFIG_EXT_SUB_SCHED
-	if (ops->sub_cgroup_id > 1)
-		kthread_init_work(&cmd.work, scx_sub_enable_workfn);
+	if (cmd->ops->sub_cgroup_id > 1)
+		kthread_init_work(&cmd->work, scx_sub_enable_workfn);
 	else
 #endif	/* CONFIG_EXT_SUB_SCHED */
-		kthread_init_work(&cmd.work, scx_root_enable_workfn);
-	cmd.ops = ops;
+		kthread_init_work(&cmd->work, scx_root_enable_workfn);
 
-	kthread_queue_work(READ_ONCE(helper), &cmd.work);
-	kthread_flush_work(&cmd.work);
-	return cmd.ret;
+	kthread_queue_work(READ_ONCE(helper), &cmd->work);
+	kthread_flush_work(&cmd->work);
+	return cmd->ret;
 }
 
 
@@ -7452,7 +7450,9 @@ static int bpf_scx_check_member(const struct btf_type *t,
 
 static int bpf_scx_reg(void *kdata, struct bpf_link *link)
 {
-	return scx_enable(kdata, link);
+	struct scx_enable_cmd cmd = { .ops = kdata };
+
+	return scx_enable(&cmd, link);
 }
 
 static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 06/16] sched_ext: Add topological CPU IDs (cids)
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (4 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 05/16] sched_ext: Make scx_enable() take scx_enable_cmd Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 17:15   ` [PATCH v2 sched_ext/for-7.2] " Tejun Heo
  2026-04-21  7:19 ` [PATCH 07/16] sched_ext: Add scx_bpf_cid_override() kfunc Tejun Heo
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Raw cpu numbers are clumsy for sharding and cross-sched communication,
especially from BPF. The space is sparse, numerical closeness doesn't
track topological closeness (x86 hyperthreading often scatters SMT
siblings), and a range of cpu ids doesn't describe anything meaningful.
Sub-sched support makes this acute: cpu allocation, revocation, and
state constantly flow across sub-scheds. Passing whole cpumasks scales
poorly (every op scans 4K bits) and cpumasks are awkward in BPF.

cids assign every cpu a dense, topology-ordered id. CPUs sharing a core,
LLC, or NUMA node occupy contiguous cid ranges, so a topology unit
becomes a (start, length) slice. Communication passes slices; BPF can
process a u64 word of cids at a time.

Build the mapping once at root enable by walking online cpus node -> LLC
-> core. Possible-but-not-online cpus tail the space with no-topo cids.
Expose kfuncs to map cpu <-> cid in either direction and to query each
cid's topology metadata.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/build_policy.c              |   1 +
 kernel/sched/ext.c                       |  17 ++
 kernel/sched/ext_cid.c                   | 301 +++++++++++++++++++++++
 kernel/sched/ext_cid.h                   | 147 +++++++++++
 tools/sched_ext/include/scx/common.bpf.h |   3 +
 5 files changed, 469 insertions(+)
 create mode 100644 kernel/sched/ext_cid.c
 create mode 100644 kernel/sched/ext_cid.h

diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index 755883faf751..0386f12683c8 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -60,6 +60,7 @@
 #ifdef CONFIG_SCHED_CLASS_EXT
 # include "ext_internal.h"
 # include "ext.c"
+# include "ext_cid.c"
 # include "ext_idle.c"
 #endif
 
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 62aab432dbf4..ac0fa21cab26 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7,6 +7,7 @@
  * Copyright (c) 2022 David Vernet <dvernet@meta.com>
  */
 #include <linux/btf_ids.h>
+#include "ext_cid.h"
 #include "ext_idle.h"
 
 static DEFINE_RAW_SPINLOCK(scx_sched_lock);
@@ -6726,6 +6727,16 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 	 */
 	cpus_read_lock();
 
+	/*
+	 * Build the cid mapping before publishing scx_root. The cid kfuncs
+	 * dereference the cid arrays unconditionally once scx_prog_sched()
+	 * returns non-NULL; the rcu_assign_pointer() below pairs with their
+	 * rcu_dereference() to make the populated arrays visible.
+	 */
+	ret = scx_cid_init(sch);
+	if (ret)
+		goto err_disable;
+
 	/*
 	 * Make the scheduler instance visible. Must be inside cpus_read_lock().
 	 * See handle_hotplug().
@@ -9774,6 +9785,12 @@ static int __init scx_init(void)
 		return ret;
 	}
 
+	ret = scx_cid_kfunc_init();
+	if (ret) {
+		pr_err("sched_ext: Failed to register cid kfuncs (%d)\n", ret);
+		return ret;
+	}
+
 	ret = register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops);
 	if (ret) {
 		pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret);
diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
new file mode 100644
index 000000000000..55467ca69800
--- /dev/null
+++ b/kernel/sched/ext_cid.c
@@ -0,0 +1,301 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/cacheinfo.h>
+
+#include "ext_cid.h"
+
+s16 *scx_cid_to_cpu_tbl;
+s16 *scx_cpu_to_cid_tbl;
+struct scx_cid_topo *scx_cid_topo;
+
+#define SCX_CID_TOPO_NEG	(struct scx_cid_topo) {				\
+	.core_cid = -1, .core_idx = -1, .llc_cid = -1, .llc_idx = -1,		\
+	.node_cid = -1, .node_idx = -1,						\
+}
+
+/*
+ * Return @cpu's LLC shared_cpu_map. If cacheinfo isn't populated (offline or
+ * !present), record @cpu in @fallbacks and return its node mask instead - the
+ * worst that can happen is that the cpu's LLC becomes coarser than reality.
+ */
+static const struct cpumask *cpu_llc_mask(int cpu, struct cpumask *fallbacks)
+{
+	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
+
+	if (!ci || !ci->info_list || !ci->num_leaves) {
+		cpumask_set_cpu(cpu, fallbacks);
+		return cpumask_of_node(cpu_to_node(cpu));
+	}
+	return &ci->info_list[ci->num_leaves - 1].shared_cpu_map;
+}
+
+/*
+ * The cid arrays are sized by num_possible_cpus() / nr_cpu_ids which are fixed
+ * at boot, so allocate once on first enable and never free. Callers can
+ * dereference these unconditionally as long as scx_root is non-NULL
+ * (rcu_assign_pointer publishes scx_root after scx_cid_init() returns - see
+ * scx_root_enable()).
+ */
+static s32 scx_cid_arrays_alloc(void)
+{
+	u32 npossible = num_possible_cpus();
+	s16 *cid_to_cpu, *cpu_to_cid;
+	struct scx_cid_topo *cid_topo;
+
+	if (scx_cid_to_cpu_tbl)
+		return 0;
+
+	cid_to_cpu = kcalloc(npossible, sizeof(*scx_cid_to_cpu_tbl), GFP_KERNEL);
+	cpu_to_cid = kcalloc(nr_cpu_ids, sizeof(*scx_cpu_to_cid_tbl), GFP_KERNEL);
+	cid_topo = kmalloc_array(npossible, sizeof(*scx_cid_topo), GFP_KERNEL);
+
+	if (!cid_to_cpu || !cpu_to_cid || !cid_topo) {
+		kfree(cid_to_cpu);
+		kfree(cpu_to_cid);
+		kfree(cid_topo);
+		return -ENOMEM;
+	}
+
+	scx_cid_to_cpu_tbl = cid_to_cpu;
+	scx_cpu_to_cid_tbl = cpu_to_cid;
+	scx_cid_topo = cid_topo;
+	return 0;
+}
+
+/**
+ * scx_cid_init - build the cid mapping
+ * @sch: the scx_sched being initialized; used as the scx_error() target
+ *
+ * See "Topological CPU IDs" in ext_cid.h for the model. Walk online cpus by
+ * intersection at each level (parent_scratch & this_level_mask), which keeps
+ * containment correct by construction and naturally splits a physical LLC
+ * straddling two NUMA nodes into two LLC units. The caller must hold
+ * cpus_read_lock.
+ */
+s32 scx_cid_init(struct scx_sched *sch)
+{
+	cpumask_var_t to_walk __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t node_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t llc_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t core_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t llc_fallback __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t online_no_topo __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	u32 next_cid = 0;
+	s32 next_node_idx = 0, next_llc_idx = 0, next_core_idx = 0;
+	s32 cpu, ret;
+
+	/* s16 keeps the per-cid arrays compact; widen if NR_CPUS ever grows */
+	BUILD_BUG_ON(NR_CPUS > S16_MAX);
+
+	lockdep_assert_cpus_held();
+
+	ret = scx_cid_arrays_alloc();
+	if (ret)
+		return ret;
+
+	if (!zalloc_cpumask_var(&to_walk, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&node_scratch, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&llc_scratch, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&core_scratch, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&llc_fallback, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&online_no_topo, GFP_KERNEL))
+		return -ENOMEM;
+
+	/* -1 sentinels for sparse-possible cpu id holes (0 is a valid cid) */
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+		scx_cpu_to_cid_tbl[cpu] = -1;
+
+	cpumask_copy(to_walk, cpu_online_mask);
+
+	while (!cpumask_empty(to_walk)) {
+		s32 next_cpu = cpumask_first(to_walk);
+		s32 nid = cpu_to_node(next_cpu);
+		s32 node_cid = next_cid;
+		s32 node_idx;
+
+		/*
+		 * No NUMA info: skip and let the tail loop assign a no-topo
+		 * cid. cpumask_of_node(-1) is undefined.
+		 */
+		if (nid < 0) {
+			cpumask_clear_cpu(next_cpu, to_walk);
+			continue;
+		}
+
+		node_idx = next_node_idx++;
+
+		/* node_scratch = to_walk & this node */
+		cpumask_and(node_scratch, to_walk, cpumask_of_node(nid));
+		if (WARN_ON_ONCE(!cpumask_test_cpu(next_cpu, node_scratch)))
+			return -EINVAL;
+
+		while (!cpumask_empty(node_scratch)) {
+			s32 ncpu = cpumask_first(node_scratch);
+			const struct cpumask *llc_mask = cpu_llc_mask(ncpu, llc_fallback);
+			s32 llc_cid = next_cid;
+			s32 llc_idx = next_llc_idx++;
+
+			/* llc_scratch = node_scratch & this llc */
+			cpumask_and(llc_scratch, node_scratch, llc_mask);
+			if (WARN_ON_ONCE(!cpumask_test_cpu(ncpu, llc_scratch)))
+				return -EINVAL;
+
+			while (!cpumask_empty(llc_scratch)) {
+				s32 lcpu = cpumask_first(llc_scratch);
+				const struct cpumask *sib = topology_sibling_cpumask(lcpu);
+				s32 core_cid = next_cid;
+				s32 core_idx = next_core_idx++;
+				s32 ccpu;
+
+				/* core_scratch = llc_scratch & this core */
+				cpumask_and(core_scratch, llc_scratch, sib);
+				if (WARN_ON_ONCE(!cpumask_test_cpu(lcpu, core_scratch)))
+					return -EINVAL;
+
+				for_each_cpu(ccpu, core_scratch) {
+					s32 cid = next_cid++;
+
+					scx_cid_to_cpu_tbl[cid] = ccpu;
+					scx_cpu_to_cid_tbl[ccpu] = cid;
+					scx_cid_topo[cid] = (struct scx_cid_topo){
+						.core_cid = core_cid,
+						.core_idx = core_idx,
+						.llc_cid = llc_cid,
+						.llc_idx = llc_idx,
+						.node_cid = node_cid,
+						.node_idx = node_idx,
+					};
+
+					cpumask_clear_cpu(ccpu, llc_scratch);
+					cpumask_clear_cpu(ccpu, node_scratch);
+					cpumask_clear_cpu(ccpu, to_walk);
+				}
+			}
+		}
+	}
+
+	/*
+	 * No-topo section: any possible cpu without a cid - normally just the
+	 * not-online ones. Collect any currently-online cpus that land here in
+	 * @online_no_topo so we can warn about them at the end.
+	 */
+	for_each_cpu(cpu, cpu_possible_mask) {
+		s32 cid;
+
+		if (__scx_cpu_to_cid(cpu) != -1)
+			continue;
+		if (cpu_online(cpu))
+			cpumask_set_cpu(cpu, online_no_topo);
+
+		cid = next_cid++;
+		scx_cid_to_cpu_tbl[cid] = cpu;
+		scx_cpu_to_cid_tbl[cpu] = cid;
+		scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+	}
+
+	if (!cpumask_empty(llc_fallback))
+		pr_warn("scx_cid: cpus without cacheinfo, using node mask as llc: %*pbl\n",
+			cpumask_pr_args(llc_fallback));
+	if (!cpumask_empty(online_no_topo))
+		pr_warn("scx_cid: online cpus with no usable topology: %*pbl\n",
+			cpumask_pr_args(online_no_topo));
+
+	return 0;
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_cid_to_cpu - Return the raw CPU id for @cid
+ * @cid: cid to look up
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Return the raw CPU id for @cid. Trigger scx_error() and return -EINVAL if
+ * @cid is invalid. The cid<->cpu mapping is static for the lifetime of the
+ * loaded scheduler, so the BPF side can cache the result to avoid repeated
+ * kfunc invocations.
+ */
+__bpf_kfunc s32 scx_bpf_cid_to_cpu(s32 cid, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return -EINVAL;
+	return scx_cid_to_cpu(sch, cid);
+}
+
+/**
+ * scx_bpf_cpu_to_cid - Return the cid for @cpu
+ * @cpu: cpu to look up
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Return the cid for @cpu. Trigger scx_error() and return -EINVAL if @cpu is
+ * invalid. The cid<->cpu mapping is static for the lifetime of the loaded
+ * scheduler, so the BPF side can cache the result to avoid repeated kfunc
+ * invocations.
+ */
+__bpf_kfunc s32 scx_bpf_cpu_to_cid(s32 cpu, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return -EINVAL;
+	return scx_cpu_to_cid(sch, cpu);
+}
+
+/**
+ * scx_bpf_cid_topo - Copy out per-cid topology info
+ * @cid: cid to look up
+ * @out__uninit: where to copy the topology info; fully written by this call
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Fill @out__uninit with the topology info for @cid. Trigger scx_error() if
+ * @cid is out of range. If @cid is valid but in the no-topo section, all fields
+ * are set to -1.
+ */
+__bpf_kfunc void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out__uninit,
+				  const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch) || !cid_valid(sch, cid)) {
+		*out__uninit = SCX_CID_TOPO_NEG;
+		return;
+	}
+
+	*out__uninit = scx_cid_topo[cid];
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_cid)
+BTF_ID_FLAGS(func, scx_bpf_cid_to_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_cid, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cid_topo, KF_IMPLICIT_ARGS)
+BTF_KFUNCS_END(scx_kfunc_ids_cid)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
+	.owner	= THIS_MODULE,
+	.set	= &scx_kfunc_ids_cid,
+};
+
+int scx_cid_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
+		register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_cid) ?:
+		register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_cid);
+}
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
new file mode 100644
index 000000000000..dded0a540a26
--- /dev/null
+++ b/kernel/sched/ext_cid.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Topological CPU IDs (cids)
+ * --------------------------
+ *
+ * Raw cpu numbers are clumsy for sharding work and communication across
+ * topology units, especially from BPF: the space can be sparse, numerical
+ * closeness doesn't imply topological closeness (x86 hyperthreading often puts
+ * SMT siblings far apart), and a range of cpu ids doesn't mean anything.
+ * Sub-scheds make this acute - cpu allocation, revocation and other state are
+ * constantly communicated across sub-scheds, and passing whole cpumasks scales
+ * poorly with cpu count. cpumasks are also awkward in BPF: a variable-length
+ * kernel type sized for the maximum NR_CPUS (4k), with verbose helper sequences
+ * for every op.
+ *
+ * cids give every cpu a dense, topology-ordered id. CPUs sharing a core, LLC or
+ * NUMA node get contiguous cid ranges, so a topology unit becomes a (start,
+ * length) slice of cid space. Communication can pass a slice instead of a
+ * cpumask, and BPF code can process, for example, a u64 word's worth of cids at
+ * a time.
+ *
+ * The mapping is built once at root scheduler enable time by walking the
+ * topology of online cpus only. Going by online cpus is out of necessity:
+ * depending on the arch, topology info isn't reliably available for offline
+ * cpus. The expected usage model is restarting the scheduler on hotplug events
+ * so the mapping is rebuilt against the new online set. A scheduler that wants
+ * to handle hotplug without a restart can provide its own cid and shard mapping
+ * through the override interface.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_CID_H
+#define _KERNEL_SCHED_EXT_CID_H
+
+struct scx_sched;
+
+/*
+ * Per-cid topology info. For each topology level (core, LLC, node), records the
+ * first cid in the unit and its global index. Global indices are consecutive
+ * integers assigned in cid-walk order, so e.g. core_idx ranges over [0,
+ * nr_cores_at_init) with no gaps. No-topo cids have all fields set to -1.
+ *
+ * @core_cid:	first cid of this cid's core (smt-sibling group)
+ * @core_idx:	global index of that core, in [0, nr_cores_at_init)
+ * @llc_cid:	first cid of this cid's LLC
+ * @llc_idx:	global index of that LLC, in [0, nr_llcs_at_init)
+ * @node_cid:	first cid of this cid's NUMA node
+ * @node_idx:	global index of that node, in [0, nr_nodes_at_init)
+ */
+struct scx_cid_topo {
+	s32 core_cid;
+	s32 core_idx;
+	s32 llc_cid;
+	s32 llc_idx;
+	s32 node_cid;
+	s32 node_idx;
+};
+
+/*
+ * Cid space (total is always num_possible_cpus()) is laid out with
+ * topology-annotated cids first, then no-topo cids at the tail. The
+ * topology-annotated block covers the cpus that were online when scx_cid_init()
+ * ran and remains valid even after those cpus go offline. The tail block covers
+ * possible-but-not-online cpus and carries all-(-1) topo info (see
+ * scx_cid_topo); callers detect it via the -1 sentinels.
+ */
+extern s16 *scx_cid_to_cpu_tbl;
+extern s16 *scx_cpu_to_cid_tbl;
+extern struct scx_cid_topo *scx_cid_topo;
+
+s32 scx_cid_init(struct scx_sched *sch);
+int scx_cid_kfunc_init(void);
+
+/**
+ * cid_valid - Verify a cid value, to be used on ops input args
+ * @sch: scx_sched to abort on error
+ * @cid: cid which came from a BPF ops
+ *
+ * Return true if @cid is in [0, num_possible_cpus()). On failure, trigger
+ * scx_error() and return false.
+ */
+static inline bool cid_valid(struct scx_sched *sch, s32 cid)
+{
+	if (likely(cid >= 0 && cid < num_possible_cpus()))
+		return true;
+	scx_error(sch, "invalid cid %d", cid);
+	return false;
+}
+
+/**
+ * __scx_cid_to_cpu - Unchecked cid->cpu table lookup
+ * @cid: cid to look up. Must be in [0, num_possible_cpus()).
+ *
+ * Intended for callsites that have already validated @cid (or otherwise know
+ * it's valid).
+ */
+static inline s32 __scx_cid_to_cpu(s32 cid)
+{
+	return scx_cid_to_cpu_tbl[cid];
+}
+
+/**
+ * __scx_cpu_to_cid - Unchecked cpu->cid table lookup
+ * @cpu: cpu to look up. Must be a valid possible cpu id.
+ *
+ * Intended for callsites that have already validated @cpu (or know it must be
+ * valid by construction, e.g. task_cpu() or smp_processor_id()).
+ */
+static inline s32 __scx_cpu_to_cid(s32 cpu)
+{
+	return scx_cpu_to_cid_tbl[cpu];
+}
+
+/**
+ * scx_cid_to_cpu - Translate @cid to its cpu
+ * @sch: scx_sched for error reporting
+ * @cid: cid to look up
+ *
+ * Return the cpu for @cid or a negative errno on failure. Invalid cid triggers
+ * scx_error() on @sch. The cid arrays are allocated on first scheduler enable
+ * and never freed, so the returned cpu is stable for the lifetime of the loaded
+ * scheduler.
+ */
+static inline s32 scx_cid_to_cpu(struct scx_sched *sch, s32 cid)
+{
+	if (!cid_valid(sch, cid))
+		return -EINVAL;
+	return __scx_cid_to_cpu(cid);
+}
+
+/**
+ * scx_cpu_to_cid - Translate @cpu to its cid
+ * @sch: scx_sched for error reporting
+ * @cpu: cpu to look up
+ *
+ * Return the cid for @cpu or a negative errno on failure. Invalid cpu triggers
+ * scx_error() on @sch. Same lifetime guarantee as scx_cid_to_cpu().
+ */
+static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
+{
+	if (!scx_cpu_valid(sch, cpu, NULL))
+		return -EINVAL;
+	return __scx_cpu_to_cid(cpu);
+}
+
+#endif /* _KERNEL_SCHED_EXT_CID_H */
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 67b4b179b422..18f823d424cc 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -102,6 +102,9 @@ struct task_struct *scx_bpf_cpu_curr(s32 cpu) __ksym __weak;
 struct task_struct *scx_bpf_tid_to_task(u64 tid) __ksym __weak;
 u64 scx_bpf_now(void) __ksym __weak;
 void scx_bpf_events(struct scx_event_stats *events, size_t events__sz) __ksym __weak;
+s32 scx_bpf_cpu_to_cid(s32 cpu) __ksym __weak;
+s32 scx_bpf_cid_to_cpu(s32 cid) __ksym __weak;
+void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out) __ksym __weak;
 
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 07/16] sched_ext: Add scx_bpf_cid_override() kfunc
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (5 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 06/16] sched_ext: Add topological CPU IDs (cids) Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 08/16] tools/sched_ext: Add struct_size() helpers to common.bpf.h Tejun Heo
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

The auto-probed cid mapping reflects the kernel's view of topology
(node -> LLC -> core), but a BPF scheduler may want a different layout -
to align cid slices with its own partitioning, or to work around how the
kernel reports a particular machine.

Add scx_bpf_cid_override(), callable from ops.init() of the root
scheduler. It validates the caller-supplied cpu->cid array and replaces
the in-place mapping; topo info is invalidated. A compat.bpf.h wrapper
silently no-ops on kernels that lack the kfunc.

A new SCX_KF_ALLOW_INIT bit in the kfunc context filter restricts the
kfunc to ops.init() at verifier load time.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c                       | 16 +++--
 kernel/sched/ext_cid.c                   | 75 +++++++++++++++++++++++-
 kernel/sched/ext_cid.h                   |  1 +
 tools/sched_ext/include/scx/compat.bpf.h | 12 ++++
 4 files changed, 97 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index ac0fa21cab26..fedad66d13b6 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9640,10 +9640,11 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = {
  */
 enum scx_kf_allow_flags {
 	SCX_KF_ALLOW_UNLOCKED		= 1 << 0,
-	SCX_KF_ALLOW_CPU_RELEASE	= 1 << 1,
-	SCX_KF_ALLOW_DISPATCH		= 1 << 2,
-	SCX_KF_ALLOW_ENQUEUE		= 1 << 3,
-	SCX_KF_ALLOW_SELECT_CPU		= 1 << 4,
+	SCX_KF_ALLOW_INIT		= 1 << 1,
+	SCX_KF_ALLOW_CPU_RELEASE	= 1 << 2,
+	SCX_KF_ALLOW_DISPATCH		= 1 << 3,
+	SCX_KF_ALLOW_ENQUEUE		= 1 << 4,
+	SCX_KF_ALLOW_SELECT_CPU		= 1 << 5,
 };
 
 /*
@@ -9671,7 +9672,7 @@ static const u32 scx_kf_allow_flags[] = {
 	[SCX_OP_IDX(sub_detach)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(cpu_online)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(cpu_offline)]	= SCX_KF_ALLOW_UNLOCKED,
-	[SCX_OP_IDX(init)]		= SCX_KF_ALLOW_UNLOCKED,
+	[SCX_OP_IDX(init)]		= SCX_KF_ALLOW_UNLOCKED | SCX_KF_ALLOW_INIT,
 	[SCX_OP_IDX(exit)]		= SCX_KF_ALLOW_UNLOCKED,
 };
 
@@ -9686,6 +9687,7 @@ static const u32 scx_kf_allow_flags[] = {
 int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 {
 	bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id);
+	bool in_init = btf_id_set8_contains(&scx_kfunc_ids_init, kfunc_id);
 	bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id);
 	bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id);
 	bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id);
@@ -9695,7 +9697,7 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	u32 moff, flags;
 
 	/* Not an SCX kfunc - allow. */
-	if (!(in_unlocked || in_select_cpu || in_enqueue || in_dispatch ||
+	if (!(in_unlocked || in_init || in_select_cpu || in_enqueue || in_dispatch ||
 	      in_cpu_release || in_idle || in_any))
 		return 0;
 
@@ -9731,6 +9733,8 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 
 	if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked)
 		return 0;
+	if ((flags & SCX_KF_ALLOW_INIT) && in_init)
+		return 0;
 	if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release)
 		return 0;
 	if ((flags & SCX_KF_ALLOW_DISPATCH) && in_dispatch)
diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
index 55467ca69800..4ee727d27c78 100644
--- a/kernel/sched/ext_cid.c
+++ b/kernel/sched/ext_cid.c
@@ -210,6 +210,68 @@ s32 scx_cid_init(struct scx_sched *sch)
 
 __bpf_kfunc_start_defs();
 
+/**
+ * scx_bpf_cid_override - Install an explicit cpu->cid mapping
+ * @cpu_to_cid: array of nr_cpu_ids s32 entries (cid for each cpu)
+ * @cpu_to_cid__sz: must be nr_cpu_ids * sizeof(s32) bytes
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * May only be called from ops.init() of the root scheduler. Replace the
+ * topology-probed cid mapping with the caller-provided one. Each possible cpu
+ * must map to a unique cid in [0, num_possible_cpus()). Topo info is cleared.
+ * On invalid input, trigger scx_error() to abort the scheduler.
+ */
+__bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
+				      const struct bpf_prog_aux *aux)
+{
+	cpumask_var_t seen __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	struct scx_sched *sch;
+	bool alloced;
+	s32 cpu, cid;
+
+	/* GFP_KERNEL alloc must happen before the rcu read section */
+	alloced = zalloc_cpumask_var(&seen, GFP_KERNEL);
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return;
+
+	if (!alloced) {
+		scx_error(sch, "scx_bpf_cid_override: failed to allocate cpumask");
+		return;
+	}
+
+	if (scx_parent(sch)) {
+		scx_error(sch, "scx_bpf_cid_override() only allowed from root sched");
+		return;
+	}
+
+	if (cpu_to_cid__sz != nr_cpu_ids * sizeof(s32)) {
+		scx_error(sch, "scx_bpf_cid_override: expected %zu bytes, got %u",
+			  nr_cpu_ids * sizeof(s32), cpu_to_cid__sz);
+		return;
+	}
+
+	for_each_possible_cpu(cpu) {
+		s32 c = cpu_to_cid[cpu];
+
+		if (!cid_valid(sch, c))
+			return;
+		if (cpumask_test_and_set_cpu(c, seen)) {
+			scx_error(sch, "cid %d assigned to multiple cpus", c);
+			return;
+		}
+		scx_cpu_to_cid_tbl[cpu] = c;
+		scx_cid_to_cpu_tbl[c] = cpu;
+	}
+
+	/* Invalidate stale topo info - the override carries no topology. */
+	for (cid = 0; cid < num_possible_cpus(); cid++)
+		scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+}
+
 /**
  * scx_bpf_cid_to_cpu - Return the raw CPU id for @cid
  * @cid: cid to look up
@@ -282,6 +344,16 @@ __bpf_kfunc void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out__uninit,
 
 __bpf_kfunc_end_defs();
 
+BTF_KFUNCS_START(scx_kfunc_ids_init)
+BTF_ID_FLAGS(func, scx_bpf_cid_override, KF_IMPLICIT_ARGS | KF_SLEEPABLE)
+BTF_KFUNCS_END(scx_kfunc_ids_init)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_init = {
+	.owner	= THIS_MODULE,
+	.set	= &scx_kfunc_ids_init,
+	.filter	= scx_kfunc_context_filter,
+};
+
 BTF_KFUNCS_START(scx_kfunc_ids_cid)
 BTF_ID_FLAGS(func, scx_bpf_cid_to_cpu, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpu_to_cid, KF_IMPLICIT_ARGS)
@@ -295,7 +367,8 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
 
 int scx_cid_kfunc_init(void)
 {
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init) ?:
+		register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
 		register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_cid) ?:
 		register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_cid);
 }
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
index dded0a540a26..19848fa9e8fc 100644
--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -68,6 +68,7 @@ struct scx_cid_topo {
 extern s16 *scx_cid_to_cpu_tbl;
 extern s16 *scx_cpu_to_cid_tbl;
 extern struct scx_cid_topo *scx_cid_topo;
+extern struct btf_id_set8 scx_kfunc_ids_init;
 
 s32 scx_cid_init(struct scx_sched *sch);
 int scx_cid_kfunc_init(void);
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 2808003eef04..6b9d054c3e4f 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -121,6 +121,18 @@ static inline bool scx_bpf_sub_dispatch(u64 cgroup_id)
 	return false;
 }
 
+/*
+ * v7.2: scx_bpf_cid_override() for explicit cpu->cid mapping. Ignore if
+ * missing.
+ */
+void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz) __ksym __weak;
+
+static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz)
+{
+	if (bpf_ksym_exists(scx_bpf_cid_override___compat))
+		return scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz);
+}
+
 /**
  * __COMPAT_is_enq_cpu_selected - Test if SCX_ENQ_CPU_SELECTED is on
  * in a compatible way. We will preserve this __COMPAT helper until v6.16.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 08/16] tools/sched_ext: Add struct_size() helpers to common.bpf.h
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (6 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 07/16] sched_ext: Add scx_bpf_cid_override() kfunc Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Add flex_array_size(), struct_size() and struct_size_t() to
scx/common.bpf.h so BPF schedulers can size flex-array-containing
structs the same way kernel code does. These are abbreviated forms of
the <linux/overflow.h> macros.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/include/scx/common.bpf.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 18f823d424cc..4bf959a8cd08 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -1043,6 +1043,16 @@ static inline u64 scx_clock_irq(u32 cpu)
 	return irqt ? BPF_CORE_READ(irqt, total) : 0;
 }
 
+/* Abbreviated forms of <linux/overflow.h>'s struct_size() family. */
+#define flex_array_size(p, member, count)	\
+	((count) * sizeof(*(p)->member))
+
+#define struct_size(p, member, count)		\
+	(sizeof(*(p)) + flex_array_size(p, member, count))
+
+#define struct_size_t(type, member, count)	\
+	struct_size((type *)NULL, member, count)
+
 #include "compat.bpf.h"
 #include "enums.bpf.h"
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (7 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 08/16] tools/sched_ext: Add struct_size() helpers to common.bpf.h Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 17:30   ` Cheng-Yang Chou
  2026-04-21 23:21   ` [PATCH v2] " Tejun Heo
  2026-04-21  7:19 ` [PATCH 10/16] sched_ext: Add cid-form kfunc wrappers alongside cpu-form Tejun Heo
                   ` (7 subsequent siblings)
  16 siblings, 2 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid
space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes
most of its bits for a small window and is awkward in BPF.

scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global
64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two
cmasks therefore address bits[] against the same global windows, so
cross-cmask word ops reduce to

	dest->bits[i] OP= operand->bits[i - delta]

with no bit-shifting, at the cost of up to one extra storage word for
head misalignment. This alignment guarantee is the reason binary ops
can stay word-level; every mutating helper preserves it.

Binary ops are op(dest, operand) and only touch the intersection. Single-
bit ops follow kernel bitops convention: bare = atomic, __-prefixed =
non-atomic. Bulk and find ops are non-atomic.

Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/
cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and
adds the extra helpers that basic idle-cpu selection needs.

No callers yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext_cid.c                   | 139 ++++++
 kernel/sched/ext_cid.h                   | 169 +++++++
 tools/sched_ext/include/scx/cid.bpf.h    | 595 +++++++++++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h |   1 +
 4 files changed, 904 insertions(+)
 create mode 100644 tools/sched_ext/include/scx/cid.bpf.h

diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
index 4ee727d27c78..c8b7cdaf82d5 100644
--- a/kernel/sched/ext_cid.c
+++ b/kernel/sched/ext_cid.c
@@ -365,6 +365,145 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
 	.set	= &scx_kfunc_ids_cid,
 };
 
+/*
+ * cmask bulk ops. See ext_cid.h for the layout and semantics: binary ops only
+ * touch the intersection of dest and operand ranges; dest bits outside the
+ * intersection, and dest head/tail padding, are left untouched. The 64-cid grid
+ * alignment of bits[] makes the word-to-word correspondence trivial.
+ */
+enum {
+	CMASK_OP_AND,
+	CMASK_OP_OR,
+	CMASK_OP_COPY,
+};
+
+void scx_cmask_zero(struct scx_cmask *m)
+{
+	memset(m->bits, 0, SCX_CMASK_NR_WORDS(m->nr_bits) * sizeof(u64));
+}
+
+/*
+ * Apply @op to one word - dest[@di] = (dest[@di] & ~@mask) | (op(...) & @mask).
+ * Only bits in @mask within the word are touched.
+ */
+static void cmask_op_word(struct scx_cmask *dest, const struct scx_cmask *operand,
+			  u32 di, u32 oi, u64 mask, int op)
+{
+	u64 dv = dest->bits[di];
+	u64 ov = operand->bits[oi];
+	u64 rv;
+
+	switch (op) {
+	case CMASK_OP_AND:
+		rv = dv & ov;
+		break;
+	case CMASK_OP_OR:
+		rv = dv | ov;
+		break;
+	case CMASK_OP_COPY:
+		rv = ov;
+		break;
+	default:
+		BUG();
+	}
+
+	dest->bits[di] = (dv & ~mask) | (rv & mask);
+}
+
+static void cmask_op(struct scx_cmask *dest, const struct scx_cmask *operand, int op)
+{
+	u32 lo = max(dest->base, operand->base);
+	u32 hi = min(dest->base + dest->nr_bits,
+		     operand->base + operand->nr_bits);
+	u32 d_base = dest->base / 64;
+	u32 o_base = operand->base / 64;
+	u32 lo_word, hi_word, w;
+	u64 head_mask, tail_mask;
+
+	if (lo >= hi)
+		return;
+
+	lo_word = lo / 64;
+	hi_word = (hi - 1) / 64;
+	head_mask = GENMASK_U64(63, lo & 63);
+	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+	/* intersection fits in a single word - apply both head and tail */
+	if (lo_word == hi_word) {
+		cmask_op_word(dest, operand, lo_word - d_base, lo_word - o_base,
+			      head_mask & tail_mask, op);
+		return;
+	}
+
+	/* first word: head mask */
+	cmask_op_word(dest, operand, lo_word - d_base, lo_word - o_base, head_mask, op);
+
+	/* interior words: unmasked */
+	for (w = lo_word + 1; w < hi_word; w++)
+		cmask_op_word(dest, operand, w - d_base, w - o_base,
+			      GENMASK_U64(63, 0), op);
+
+	/* last word: tail mask */
+	cmask_op_word(dest, operand, hi_word - d_base, hi_word - o_base, tail_mask, op);
+}
+
+/*
+ * scx_cmask_and/or/copy only modify @dest bits that lie in the intersection
+ * of [@dest->base, @dest->base + @dest->nr_bits) and [@operand->base,
+ * @operand->base + @operand->nr_bits). Bits in @dest outside that window keep
+ * their prior values - in particular, scx_cmask_copy() does NOT zero @dest
+ * bits that lie outside @operand's range.
+ */
+void scx_cmask_and(struct scx_cmask *dest, const struct scx_cmask *operand)
+{
+	cmask_op(dest, operand, CMASK_OP_AND);
+}
+
+void scx_cmask_or(struct scx_cmask *dest, const struct scx_cmask *operand)
+{
+	cmask_op(dest, operand, CMASK_OP_OR);
+}
+
+void scx_cmask_copy(struct scx_cmask *dest, const struct scx_cmask *operand)
+{
+	cmask_op(dest, operand, CMASK_OP_COPY);
+}
+
+/**
+ * scx_cmask_next_set - find the first set bit at or after @cid
+ * @m: cmask to search
+ * @cid: starting cid (clamped to @m->base if below)
+ *
+ * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
+ * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
+ * termination condition used by scx_cmask_for_each_set()).
+ */
+u32 scx_cmask_next_set(const struct scx_cmask *m, u32 cid)
+{
+	u32 end = m->base + m->nr_bits;
+	u32 base = m->base / 64;
+	u32 last_wi = (end - 1) / 64 - base;
+	u32 wi;
+	u64 word;
+
+	if (cid < m->base)
+		cid = m->base;
+	if (cid >= end)
+		return end;
+
+	wi = cid / 64 - base;
+	word = m->bits[wi] & GENMASK_U64(63, cid & 63);
+
+	while (!word) {
+		if (++wi > last_wi)
+			return end;
+		word = m->bits[wi];
+	}
+
+	cid = (base + wi) * 64 + __ffs64(word);
+	return cid < end ? cid : end;
+}
+
 int scx_cid_kfunc_init(void)
 {
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init) ?:
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
index 19848fa9e8fc..46f03f2150c2 100644
--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -145,4 +145,173 @@ static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
 	return __scx_cpu_to_cid(cpu);
 }
 
+/*
+ * cmask: variable-length, base-windowed bitmap over cid space
+ * -----------------------------------------------------------
+ *
+ * A cmask covers the cid range [base, base + nr_bits). bits[] is aligned to the
+ * global 64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64), so the
+ * first (base & 63) bits of bits[0] are head padding and any tail past base +
+ * nr_bits is tail padding. Both must stay zero for the lifetime of the mask;
+ * all mutating helpers preserve that invariant.
+ *
+ * Grid alignment means two cmasks always address bits[] against the same global
+ * 64-cid windows, so cross-cmask word ops (AND, OR, ...) reduce to
+ *
+ *	dest->bits[i] OP= operand->bits[i - delta]
+ *
+ * with no bit-shifting, regardless of how the two bases relate mod 64.
+ *
+ * Binary ops take the form op(dest, operand) and only touch the intersection of
+ * the two ranges on dest; dest bits outside the intersection are left
+ * unchanged. Single-bit ops follow kernel bitops conventions: the bare name is
+ * atomic, the __-prefixed variant is non-atomic. Bulk ops are non-atomic.
+ *
+ * Single-bit ops use atomic64_*() rather than set_bit()/clear_bit() so the u64
+ * storage is addressed consistently across 64-bit and 32-bit-LE kernels
+ * (set_bit() addresses as unsigned long[], which diverges from u64 on
+ * 32-bit-BE). If test_and_set/test_and_clear codegen on x86 matters - they fall
+ * to a LOCK CMPXCHG loop here vs a single LOCK BTS/BTR with the bitops family -
+ * those two can be ifdef'd to the bitops primitives under BITS_PER_LONG == 64.
+ */
+struct scx_cmask {
+	u32 base;
+	u32 nr_bits;
+	DECLARE_FLEX_ARRAY(u64, bits);
+};
+
+/*
+ * Number of u64 words of bits[] storage that covers @nr_bits regardless of base
+ * alignment. The +1 absorbs up to 63 bits of head padding when base is not
+ * 64-aligned - always allocating one extra word beats branching on base or
+ * splitting the compute.
+ */
+#define SCX_CMASK_NR_WORDS(nr_bits)	(((nr_bits) + 63) / 64 + 1)
+
+/*
+ * Define an on-stack cmask for up to @cap_bits. @name is a struct scx_cmask *
+ * aliasing zero-initialized storage; call scx_cmask_init() to set base/nr_bits.
+ */
+#define SCX_CMASK_DEFINE(name, cap_bits)	\
+	DEFINE_RAW_FLEX(struct scx_cmask, name, bits, SCX_CMASK_NR_WORDS(cap_bits))
+
+static inline bool __scx_cmask_contains(const struct scx_cmask *m, u32 cid)
+{
+	return likely(cid >= m->base && cid < m->base + m->nr_bits);
+}
+
+/* Word in bits[] covering @cid. @cid must satisfy __scx_cmask_contains(). */
+static inline u64 *__scx_cmask_word(const struct scx_cmask *m, u32 cid)
+{
+	return (u64 *)&m->bits[cid / 64 - m->base / 64];
+}
+
+static inline void scx_cmask_init(struct scx_cmask *m, u32 base, u32 nr_bits)
+{
+	m->base = base;
+	m->nr_bits = nr_bits;
+	memset(m->bits, 0, SCX_CMASK_NR_WORDS(nr_bits) * sizeof(u64));
+}
+
+static inline bool scx_cmask_test(const struct scx_cmask *m, u32 cid)
+{
+	if (!__scx_cmask_contains(m, cid))
+		return false;
+	return READ_ONCE(*__scx_cmask_word(m, cid)) & BIT_U64(cid & 63);
+}
+
+static inline void scx_cmask_set(struct scx_cmask *m, u32 cid)
+{
+	if (!__scx_cmask_contains(m, cid))
+		return;
+	atomic64_or(BIT_U64(cid & 63), (atomic64_t *)__scx_cmask_word(m, cid));
+}
+
+static inline void scx_cmask_clear(struct scx_cmask *m, u32 cid)
+{
+	if (!__scx_cmask_contains(m, cid))
+		return;
+	atomic64_and(~BIT_U64(cid & 63), (atomic64_t *)__scx_cmask_word(m, cid));
+}
+
+/*
+ * test_and_set/test_and_clear use atomic64_fetch_or/and which lower to a LOCK
+ * CMPXCHG loop on x86 (vs a single LOCK BTS/BTR with test_and_set_bit). If this
+ * ever matters, these two can be ifdef'd to the bitops primitives under
+ * BITS_PER_LONG == 64.
+ */
+static inline bool scx_cmask_test_and_set(struct scx_cmask *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+
+	if (!__scx_cmask_contains(m, cid))
+		return false;
+	return atomic64_fetch_or(bit, (atomic64_t *)__scx_cmask_word(m, cid)) & bit;
+}
+
+static inline bool scx_cmask_test_and_clear(struct scx_cmask *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+
+	if (!__scx_cmask_contains(m, cid))
+		return false;
+	return atomic64_fetch_and(~bit, (atomic64_t *)__scx_cmask_word(m, cid)) & bit;
+}
+
+static inline void __scx_cmask_set(struct scx_cmask *m, u32 cid)
+{
+	if (!__scx_cmask_contains(m, cid))
+		return;
+	*__scx_cmask_word(m, cid) |= BIT_U64(cid & 63);
+}
+
+static inline void __scx_cmask_clear(struct scx_cmask *m, u32 cid)
+{
+	if (!__scx_cmask_contains(m, cid))
+		return;
+	*__scx_cmask_word(m, cid) &= ~BIT_U64(cid & 63);
+}
+
+static inline bool __scx_cmask_test_and_set(struct scx_cmask *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+	u64 *w, prev;
+
+	if (!__scx_cmask_contains(m, cid))
+		return false;
+	w = __scx_cmask_word(m, cid);
+	prev = *w & bit;
+	*w |= bit;
+	return prev;
+}
+
+static inline bool __scx_cmask_test_and_clear(struct scx_cmask *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+	u64 *w, prev;
+
+	if (!__scx_cmask_contains(m, cid))
+		return false;
+	w = __scx_cmask_word(m, cid);
+	prev = *w & bit;
+	*w &= ~bit;
+	return prev;
+}
+
+void scx_cmask_zero(struct scx_cmask *m);
+void scx_cmask_copy(struct scx_cmask *dest, const struct scx_cmask *operand);
+void scx_cmask_and(struct scx_cmask *dest, const struct scx_cmask *operand);
+void scx_cmask_or(struct scx_cmask *dest, const struct scx_cmask *operand);
+u32  scx_cmask_next_set(const struct scx_cmask *m, u32 cid);
+
+static inline u32 scx_cmask_first_set(const struct scx_cmask *m)
+{
+	return scx_cmask_next_set(m, m->base);
+}
+
+#define scx_cmask_for_each_set(cid, m)						\
+	for ((cid) = scx_cmask_first_set(m);					\
+	     (cid) < (m)->base + (m)->nr_bits;					\
+	     (cid) = scx_cmask_next_set((m), (cid) + 1))
+
 #endif /* _KERNEL_SCHED_EXT_CID_H */
diff --git a/tools/sched_ext/include/scx/cid.bpf.h b/tools/sched_ext/include/scx/cid.bpf.h
new file mode 100644
index 000000000000..a0d7beb62384
--- /dev/null
+++ b/tools/sched_ext/include/scx/cid.bpf.h
@@ -0,0 +1,595 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF-side helpers for cids and cmasks. See kernel/sched/ext_cid.h for the
+ * authoritative layout and semantics. The BPF-side helpers use the cmask_*
+ * naming (no scx_ prefix); cmask is the SCX bitmap type so the prefix is
+ * redundant in BPF code. Atomics use __sync_val_compare_and_swap and every
+ * helper is inline (no .c counterpart).
+ *
+ * Included by scx/common.bpf.h; don't include directly.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef __SCX_CID_BPF_H
+#define __SCX_CID_BPF_H
+
+#include "bpf_arena_common.bpf.h"
+
+#ifndef BIT_U64
+#define BIT_U64(nr)		(1ULL << (nr))
+#endif
+#ifndef GENMASK_U64
+#define GENMASK_U64(h, l)	((~0ULL << (l)) & (~0ULL >> (63 - (h))))
+#endif
+
+/*
+ * Storage cap for bounded loops over bits[]. Sized to cover NR_CPUS=8192 with
+ * one extra word for head-misalignment. Increase if deployment targets larger
+ * NR_CPUS.
+ */
+#ifndef CMASK_MAX_WORDS
+#define CMASK_MAX_WORDS 129
+#endif
+
+#define CMASK_NR_WORDS(nr_bits)		(((nr_bits) + 63) / 64 + 1)
+
+static __always_inline bool __cmask_contains(const struct scx_cmask __arena *m, u32 cid)
+{
+	return cid >= m->base && cid < m->base + m->nr_bits;
+}
+
+static __always_inline u64 __arena *__cmask_word(const struct scx_cmask __arena *m, u32 cid)
+{
+	return (u64 __arena *)&m->bits[cid / 64 - m->base / 64];
+}
+
+static __always_inline void cmask_init(struct scx_cmask __arena *m, u32 base, u32 nr_bits)
+{
+	u32 nr_words = CMASK_NR_WORDS(nr_bits), i;
+
+	m->base = base;
+	m->nr_bits = nr_bits;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		if (i >= nr_words)
+			break;
+		m->bits[i] = 0;
+	}
+}
+
+static __always_inline bool cmask_test(const struct scx_cmask __arena *m, u32 cid)
+{
+	if (!__cmask_contains(m, cid))
+		return false;
+	return *__cmask_word(m, cid) & BIT_U64(cid & 63);
+}
+
+/*
+ * x86 BPF JIT rejects BPF_OR | BPF_FETCH and BPF_AND | BPF_FETCH on arena
+ * pointers (see bpf_jit_supports_insn() in arch/x86/net/bpf_jit_comp.c). Only
+ * BPF_CMPXCHG / BPF_XCHG / BPF_ADD with FETCH are allowed. Implement
+ * test_and_{set,clear} and the atomic set/clear via a cmpxchg loop.
+ *
+ * CMASK_CAS_TRIES is far above what any non-pathological contention needs.
+ * Exhausting it means the bit update was lost, which corrupts the caller's view
+ * of the bitmap, so raise scx_bpf_error() to abort the scheduler.
+ */
+#define CMASK_CAS_TRIES		1024
+
+static __always_inline void cmask_set(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (old & bit)
+			return;
+		new = old | bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return;
+	}
+	scx_bpf_error("cmask_set CAS exhausted at cid %u", cid);
+}
+
+static __always_inline void cmask_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (!(old & bit))
+			return;
+		new = old & ~bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return;
+	}
+	scx_bpf_error("cmask_clear CAS exhausted at cid %u", cid);
+}
+
+static __always_inline bool cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (old & bit)
+			return true;
+		new = old | bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return false;
+	}
+	scx_bpf_error("cmask_test_and_set CAS exhausted at cid %u", cid);
+	return false;
+}
+
+static __always_inline bool cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (!(old & bit))
+			return false;
+		new = old & ~bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return true;
+	}
+	scx_bpf_error("cmask_test_and_clear CAS exhausted at cid %u", cid);
+	return false;
+}
+
+static __always_inline void __cmask_set(struct scx_cmask __arena *m, u32 cid)
+{
+	if (!__cmask_contains(m, cid))
+		return;
+	*__cmask_word(m, cid) |= BIT_U64(cid & 63);
+}
+
+static __always_inline void __cmask_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	if (!__cmask_contains(m, cid))
+		return;
+	*__cmask_word(m, cid) &= ~BIT_U64(cid & 63);
+}
+
+static __always_inline bool __cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+	u64 __arena *w;
+	u64 prev;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	prev = *w & bit;
+	*w |= bit;
+	return prev;
+}
+
+static __always_inline bool __cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+	u64 __arena *w;
+	u64 prev;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	prev = *w & bit;
+	*w &= ~bit;
+	return prev;
+}
+
+static __always_inline void cmask_zero(struct scx_cmask __arena *m)
+{
+	u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		if (i >= nr_words)
+			break;
+		m->bits[i] = 0;
+	}
+}
+
+/*
+ * BPF_-prefixed to avoid colliding with the kernel's anonymous CMASK_OP_*
+ * enum in ext_cid.c, which is exported via BTF and reachable through
+ * vmlinux.h.
+ */
+enum {
+	BPF_CMASK_OP_AND,
+	BPF_CMASK_OP_OR,
+	BPF_CMASK_OP_COPY,
+};
+
+static __always_inline void cmask_op_word(struct scx_cmask __arena *dest,
+					  const struct scx_cmask __arena *operand,
+					  u32 di, u32 oi, u64 mask, int op)
+{
+	u64 dv = dest->bits[di];
+	u64 ov = operand->bits[oi];
+	u64 rv;
+
+	if (op == BPF_CMASK_OP_AND)
+		rv = dv & ov;
+	else if (op == BPF_CMASK_OP_OR)
+		rv = dv | ov;
+	else
+		rv = ov;
+
+	dest->bits[di] = (dv & ~mask) | (rv & mask);
+}
+
+static __always_inline void cmask_op(struct scx_cmask __arena *dest,
+				     const struct scx_cmask __arena *operand, int op)
+{
+	u32 d_end = dest->base + dest->nr_bits;
+	u32 o_end = operand->base + operand->nr_bits;
+	u32 lo = dest->base > operand->base ? dest->base : operand->base;
+	u32 hi = d_end < o_end ? d_end : o_end;
+	u32 d_base = dest->base / 64;
+	u32 o_base = operand->base / 64;
+	u32 lo_word, hi_word, i;
+	u64 head_mask, tail_mask;
+
+	if (lo >= hi)
+		return;
+
+	lo_word = lo / 64;
+	hi_word = (hi - 1) / 64;
+	head_mask = GENMASK_U64(63, lo & 63);
+	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 w = lo_word + i;
+		u64 m;
+
+		if (w > hi_word)
+			break;
+
+		m = GENMASK_U64(63, 0);
+		if (w == lo_word)
+			m &= head_mask;
+		if (w == hi_word)
+			m &= tail_mask;
+
+		cmask_op_word(dest, operand, w - d_base, w - o_base, m, op);
+	}
+}
+
+/*
+ * cmask_and/or/copy only modify @dest bits that lie in the intersection of
+ * [@dest->base, @dest->base + @dest->nr_bits) and [@operand->base,
+ * @operand->base + @operand->nr_bits). Bits in @dest outside that window
+ * keep their prior values - in particular, cmask_copy() does NOT zero @dest
+ * bits that lie outside @operand's range.
+ */
+static __always_inline void cmask_and(struct scx_cmask __arena *dest,
+				      const struct scx_cmask __arena *operand)
+{
+	cmask_op(dest, operand, BPF_CMASK_OP_AND);
+}
+
+static __always_inline void cmask_or(struct scx_cmask __arena *dest,
+				     const struct scx_cmask __arena *operand)
+{
+	cmask_op(dest, operand, BPF_CMASK_OP_OR);
+}
+
+static __always_inline void cmask_copy(struct scx_cmask __arena *dest,
+				       const struct scx_cmask __arena *operand)
+{
+	cmask_op(dest, operand, BPF_CMASK_OP_COPY);
+}
+
+/**
+ * cmask_next_set - find the first set bit at or after @cid
+ * @m: cmask to search
+ * @cid: starting cid (clamped to @m->base if below)
+ *
+ * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
+ * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
+ * termination condition used by cmask_for_each()).
+ */
+static __always_inline u32 cmask_next_set(const struct scx_cmask __arena *m, u32 cid)
+{
+	u32 end = m->base + m->nr_bits;
+	u32 base = m->base / 64;
+	u32 last_wi = (end - 1) / 64 - base;
+	u32 start_wi, start_bit, i;
+
+	if (cid < m->base)
+		cid = m->base;
+	if (cid >= end)
+		return end;
+
+	start_wi = cid / 64 - base;
+	start_bit = cid & 63;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 wi = start_wi + i;
+		u64 word;
+		u32 found;
+
+		if (wi > last_wi)
+			break;
+
+		word = m->bits[wi];
+		if (i == 0)
+			word &= GENMASK_U64(63, start_bit);
+		if (!word)
+			continue;
+
+		found = (base + wi) * 64 + __builtin_ctzll(word);
+		if (found >= end)
+			return end;
+		return found;
+	}
+	return end;
+}
+
+static __always_inline u32 cmask_first_set(const struct scx_cmask __arena *m)
+{
+	return cmask_next_set(m, m->base);
+}
+
+#define cmask_for_each(cid, m)							\
+	for ((cid) = cmask_first_set(m);					\
+	     (cid) < (m)->base + (m)->nr_bits;					\
+	     (cid) = cmask_next_set((m), (cid) + 1))
+
+/*
+ * Population count over [base, base + nr_bits). Padding bits in the head/tail
+ * words are guaranteed zero by the mutating helpers, so a flat popcount over
+ * all words is correct.
+ */
+static __always_inline u32 cmask_weight(const struct scx_cmask __arena *m)
+{
+	u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
+	u32 count = 0;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		if (i >= nr_words)
+			break;
+		count += __builtin_popcountll(m->bits[i]);
+	}
+	return count;
+}
+
+/*
+ * True if @a and @b share any set bit. Walk only the intersection of their
+ * ranges, matching the semantics of cmask_and().
+ */
+static __always_inline bool cmask_intersects(const struct scx_cmask __arena *a,
+					     const struct scx_cmask __arena *b)
+{
+	u32 a_end = a->base + a->nr_bits;
+	u32 b_end = b->base + b->nr_bits;
+	u32 lo = a->base > b->base ? a->base : b->base;
+	u32 hi = a_end < b_end ? a_end : b_end;
+	u32 a_base = a->base / 64;
+	u32 b_base = b->base / 64;
+	u32 lo_word, hi_word, i;
+	u64 head_mask, tail_mask;
+
+	if (lo >= hi)
+		return false;
+
+	lo_word = lo / 64;
+	hi_word = (hi - 1) / 64;
+	head_mask = GENMASK_U64(63, lo & 63);
+	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 w = lo_word + i;
+		u64 mask, av, bv;
+
+		if (w > hi_word)
+			break;
+
+		mask = GENMASK_U64(63, 0);
+		if (w == lo_word)
+			mask &= head_mask;
+		if (w == hi_word)
+			mask &= tail_mask;
+
+		av = a->bits[w - a_base] & mask;
+		bv = b->bits[w - b_base] & mask;
+		if (av & bv)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Find the next cid set in both @a and @b at or after @start, bounded by the
+ * intersection of the two ranges. Return a->base + a->nr_bits if none found.
+ *
+ * Building block for cmask_next_and_set_wrap(). Callers that want a bounded
+ * scan without wrap call this directly.
+ */
+static __always_inline u32 cmask_next_and_set(const struct scx_cmask __arena *a,
+					      const struct scx_cmask __arena *b,
+					      u32 start)
+{
+	u32 a_end = a->base + a->nr_bits;
+	u32 b_end = b->base + b->nr_bits;
+	u32 a_wbase = a->base / 64;
+	u32 b_wbase = b->base / 64;
+	u32 lo = a->base > b->base ? a->base : b->base;
+	u32 hi = a_end < b_end ? a_end : b_end;
+	u32 last_wi, start_wi, start_bit, i;
+
+	if (lo >= hi)
+		return a_end;
+	if (start < lo)
+		start = lo;
+	if (start >= hi)
+		return a_end;
+
+	last_wi = (hi - 1) / 64;
+	start_wi = start / 64;
+	start_bit = start & 63;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 abs_wi = start_wi + i;
+		u64 word;
+		u32 found;
+
+		if (abs_wi > last_wi)
+			break;
+
+		word = a->bits[abs_wi - a_wbase] & b->bits[abs_wi - b_wbase];
+		if (i == 0)
+			word &= GENMASK_U64(63, start_bit);
+		if (!word)
+			continue;
+
+		found = abs_wi * 64 + __builtin_ctzll(word);
+		if (found >= hi)
+			return a_end;
+		return found;
+	}
+	return a_end;
+}
+
+/*
+ * Find the next set cid in @m at or after @start, wrapping to @m->base if no
+ * set bit is found in [start, m->base + m->nr_bits). Return m->base +
+ * m->nr_bits if @m is empty.
+ *
+ * Callers do round-robin distribution by passing (last_cid + 1) as @start.
+ */
+static __always_inline u32 cmask_next_set_wrap(const struct scx_cmask __arena *m,
+					       u32 start)
+{
+	u32 end = m->base + m->nr_bits;
+	u32 found;
+
+	found = cmask_next_set(m, start);
+	if (found < end || start <= m->base)
+		return found;
+
+	found = cmask_next_set(m, m->base);
+	return found < start ? found : end;
+}
+
+/*
+ * Find the next cid set in both @a and @b at or after @start, wrapping to
+ * @a->base if none found in the forward half. Return a->base + a->nr_bits
+ * if the intersection is empty.
+ *
+ * Callers do round-robin distribution by passing (last_cid + 1) as @start.
+ */
+static __always_inline u32 cmask_next_and_set_wrap(const struct scx_cmask __arena *a,
+						   const struct scx_cmask __arena *b,
+						   u32 start)
+{
+	u32 a_end = a->base + a->nr_bits;
+	u32 found;
+
+	found = cmask_next_and_set(a, b, start);
+	if (found < a_end || start <= a->base)
+		return found;
+
+	found = cmask_next_and_set(a, b, a->base);
+	return found < start ? found : a_end;
+}
+
+/**
+ * cmask_from_cpumask - translate a kernel cpumask to a cid-space cmask
+ * @m: cmask to fill. Zeroed first; only bits within [@m->base, @m->base +
+ *     @m->nr_bits) are updated - cpus mapping to cids outside that range
+ *     are ignored.
+ * @cpumask: kernel cpumask to translate
+ *
+ * For each cpu in @cpumask, set the cpu's cid in @m. Caller must ensure
+ * @cpumask stays stable across the call (e.g. RCU read lock for
+ * task->cpus_ptr).
+ */
+static __always_inline void cmask_from_cpumask(struct scx_cmask __arena *m,
+					       const struct cpumask *cpumask)
+{
+	u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
+	s32 cpu;
+
+	cmask_zero(m);
+	bpf_for(cpu, 0, nr_cpu_ids) {
+		s32 cid;
+
+		if (!bpf_cpumask_test_cpu(cpu, cpumask))
+			continue;
+		cid = scx_bpf_cpu_to_cid(cpu);
+		if (cid >= 0)
+			__cmask_set(m, cid);
+	}
+}
+
+/**
+ * cmask_copy_from_kernel - copy a kernel-memory scx_cmask into an arena cmask
+ * @dst: arena cmask to fill. Must be sized for at least @src's bit count.
+ * @src: kernel-memory cmask (e.g. the @cmask arg delivered to ops.set_cmask()).
+ *       Kernel guarantees @src->base == 0.
+ *
+ * Probe the kernel header for nr_bits, zero @dst, then copy @src->bits[]
+ * word by word via bpf_probe_read_kernel. Call scx_bpf_error() on any probe
+ * failure. Intended for set_cmask callbacks where @src is kernel memory that
+ * BPF cmask helpers (which expect __arena pointers) can't touch directly.
+ */
+static __always_inline void cmask_copy_from_kernel(struct scx_cmask __arena *dst,
+						   const struct scx_cmask *src)
+{
+	u32 nr_bits = 0, nr_words, dst_nr_words, wi;
+
+	if (bpf_probe_read_kernel(&nr_bits, sizeof(nr_bits), &src->nr_bits)) {
+		scx_bpf_error("probe-read cmask->nr_bits failed");
+		return;
+	}
+
+	nr_words = CMASK_NR_WORDS(nr_bits);
+	dst_nr_words = CMASK_NR_WORDS(dst->nr_bits);
+	if (nr_words > dst_nr_words) {
+		scx_bpf_error("src cmask nr_bits=%u exceeds dst capacity",
+			      nr_bits);
+		return;
+	}
+
+	cmask_zero(dst);
+	bpf_for(wi, 0, CMASK_MAX_WORDS) {
+		u64 word = 0;
+		if (wi >= nr_words)
+			break;
+		if (bpf_probe_read_kernel(&word, sizeof(u64), &src->bits[wi])) {
+			scx_bpf_error("probe-read cmask->bits[%u] failed", wi);
+			return;
+		}
+		dst->bits[wi] = word;
+	}
+}
+
+#endif /* __SCX_CID_BPF_H */
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 4bf959a8cd08..3e353dfafb46 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -1055,5 +1055,6 @@ static inline u64 scx_clock_irq(u32 cpu)
 
 #include "compat.bpf.h"
 #include "enums.bpf.h"
+#include "cid.bpf.h"
 
 #endif	/* __SCX_COMMON_BPF_H */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 10/16] sched_ext: Add cid-form kfunc wrappers alongside cpu-form
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (8 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 11/16] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type Tejun Heo
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without full cid coverage a
scheduler has to mix cid and cpu forms, which is a subtle-bug factory.
Close the gap with a cid-native interface.

Pair every cpu-form kfunc that takes a cpu id with a cid-form
equivalent (kick, task placement, cpuperf query/set, per-cpu current
task, nr-cpu-ids). Add two cid-natives with no cpu-form sibling:
scx_bpf_this_cid() (cid of the running cpu, scx equivalent of
bpf_get_smp_processor_id) and scx_bpf_nr_online_cids().

scx_bpf_cpu_rq is deprecated; no cid-form counterpart. NUMA node info
is reachable via scx_bpf_cid_topo() on the BPF side.

Each cid-form wrapper is a thin cid -> cpu translation that delegates
to the cpu path, registered in the same context sets so usage
constraints match.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c                       | 173 +++++++++++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h |   9 ++
 2 files changed, 182 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fedad66d13b6..8d52e579b96c 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -8750,6 +8750,28 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
 		scx_kick_cpu(sch, cpu, flags);
 }
 
+/**
+ * scx_bpf_kick_cid - Trigger reschedule on the CPU mapped to @cid
+ * @cid: cid to kick
+ * @flags: %SCX_KICK_* flags
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_kick_cpu().
+ */
+__bpf_kfunc void scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	s32 cpu;
+
+	guard(rcu)();
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return;
+	cpu = scx_cid_to_cpu(sch, cid);
+	if (cpu >= 0)
+		scx_kick_cpu(sch, cpu, flags);
+}
+
 /**
  * scx_bpf_dsq_nr_queued - Return the number of queued tasks
  * @dsq_id: id of the DSQ
@@ -9172,6 +9194,29 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu, const struct bpf_prog_aux *aux)
 		return SCX_CPUPERF_ONE;
 }
 
+/**
+ * scx_bpf_cidperf_cap - Query the maximum relative capacity of the CPU at @cid
+ * @cid: cid of the CPU to query
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpuperf_cap().
+ */
+__bpf_kfunc u32 scx_bpf_cidperf_cap(s32 cid, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	s32 cpu;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return SCX_CPUPERF_ONE;
+	cpu = scx_cid_to_cpu(sch, cid);
+	if (cpu < 0)
+		return SCX_CPUPERF_ONE;
+	return arch_scale_cpu_capacity(cpu);
+}
+
 /**
  * scx_bpf_cpuperf_cur - Query the current relative performance of a CPU
  * @cpu: CPU of interest
@@ -9200,6 +9245,29 @@ __bpf_kfunc u32 scx_bpf_cpuperf_cur(s32 cpu, const struct bpf_prog_aux *aux)
 		return SCX_CPUPERF_ONE;
 }
 
+/**
+ * scx_bpf_cidperf_cur - Query the current performance of the CPU at @cid
+ * @cid: cid of the CPU to query
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpuperf_cur().
+ */
+__bpf_kfunc u32 scx_bpf_cidperf_cur(s32 cid, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	s32 cpu;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return SCX_CPUPERF_ONE;
+	cpu = scx_cid_to_cpu(sch, cid);
+	if (cpu < 0)
+		return SCX_CPUPERF_ONE;
+	return arch_scale_freq_capacity(cpu);
+}
+
 /**
  * scx_bpf_cpuperf_set - Set the relative performance target of a CPU
  * @cpu: CPU of interest
@@ -9260,6 +9328,31 @@ __bpf_kfunc void scx_bpf_cpuperf_set(s32 cpu, u32 perf, const struct bpf_prog_au
 	}
 }
 
+/**
+ * scx_bpf_cidperf_set - Set the performance target of the CPU at @cid
+ * @cid: cid of the CPU to target
+ * @perf: target performance level [0, %SCX_CPUPERF_ONE]
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpuperf_set().
+ */
+__bpf_kfunc void scx_bpf_cidperf_set(s32 cid, u32 perf,
+				     const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	s32 cpu;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return;
+	cpu = scx_cid_to_cpu(sch, cid);
+	if (cpu < 0)
+		return;
+	scx_bpf_cpuperf_set(cpu, perf, aux);
+}
+
 /**
  * scx_bpf_nr_node_ids - Return the number of possible node IDs
  *
@@ -9280,6 +9373,41 @@ __bpf_kfunc u32 scx_bpf_nr_cpu_ids(void)
 	return nr_cpu_ids;
 }
 
+/**
+ * scx_bpf_nr_cids - Return the size of the cid space
+ *
+ * Equals num_possible_cpus(). All valid cids are in [0, return value).
+ */
+__bpf_kfunc u32 scx_bpf_nr_cids(void)
+{
+	return num_possible_cpus();
+}
+
+/**
+ * scx_bpf_nr_online_cids - Return current count of online CPUs in cid space
+ *
+ * Return num_online_cpus(). The standard model restarts the scheduler on
+ * hotplug, which lets schedulers treat [0, nr_online_cids) as the online
+ * range. Schedulers that prefer to handle hotplug without a restart should
+ * install a custom mapping via scx_bpf_cid_override() and track onlining
+ * through the ops.cid_online / ops.cid_offline callbacks.
+ */
+__bpf_kfunc u32 scx_bpf_nr_online_cids(void)
+{
+	return num_online_cpus();
+}
+
+/**
+ * scx_bpf_this_cid - Return the cid of the CPU this program is running on
+ *
+ * cid-addressed equivalent of bpf_get_smp_processor_id() for scx programs.
+ * The current cpu is trivially valid, so this is just a table lookup.
+ */
+__bpf_kfunc s32 scx_bpf_this_cid(void)
+{
+	return __scx_cpu_to_cid(raw_smp_processor_id());
+}
+
 /**
  * scx_bpf_get_possible_cpumask - Get a referenced kptr to cpu_possible_mask
  */
@@ -9328,6 +9456,18 @@ __bpf_kfunc s32 scx_bpf_task_cpu(const struct task_struct *p)
 	return task_cpu(p);
 }
 
+/**
+ * scx_bpf_task_cid - cid a task is currently associated with
+ * @p: task of interest
+ *
+ * cid-addressed equivalent of scx_bpf_task_cpu(). task_cpu(p) is always a
+ * valid cpu, so this is just a table lookup.
+ */
+__bpf_kfunc s32 scx_bpf_task_cid(const struct task_struct *p)
+{
+	return __scx_cpu_to_cid(task_cpu(p));
+}
+
 /**
  * scx_bpf_cpu_rq - Fetch the rq of a CPU
  * @cpu: CPU of the rq
@@ -9406,6 +9546,30 @@ __bpf_kfunc struct task_struct *scx_bpf_cpu_curr(s32 cpu, const struct bpf_prog_
 	return rcu_dereference(cpu_rq(cpu)->curr);
 }
 
+/**
+ * scx_bpf_cid_curr - Return the curr task on the CPU at @cid
+ * @cid: cid of interest
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * cid-addressed equivalent of scx_bpf_cpu_curr(). Callers must hold RCU
+ * read lock (KF_RCU).
+ */
+__bpf_kfunc struct task_struct *scx_bpf_cid_curr(s32 cid, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+	s32 cpu;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return NULL;
+	cpu = scx_cid_to_cpu(sch, cid);
+	if (cpu < 0)
+		return NULL;
+	return rcu_dereference(cpu_rq(cpu)->curr);
+}
+
 /**
  * scx_bpf_tid_to_task - Look up a task by its scx tid
  * @tid: task ID previously read from p->scx.tid
@@ -9593,6 +9757,7 @@ BTF_KFUNCS_START(scx_kfunc_ids_any)
 BTF_ID_FLAGS(func, scx_bpf_task_set_slice, KF_IMPLICIT_ARGS | KF_RCU);
 BTF_ID_FLAGS(func, scx_bpf_task_set_dsq_vtime, KF_IMPLICIT_ARGS | KF_RCU);
 BTF_ID_FLAGS(func, scx_bpf_kick_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_kick_cid, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
 BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
 BTF_ID_FLAGS(func, scx_bpf_dsq_peek, KF_IMPLICIT_ARGS | KF_RCU_PROTECTED | KF_RET_NULL)
@@ -9607,16 +9772,24 @@ BTF_ID_FLAGS(func, scx_bpf_dump_bstr, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_cpuperf_set, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cidperf_cap, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cidperf_cur, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cidperf_set, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_nr_node_ids)
 BTF_ID_FLAGS(func, scx_bpf_nr_cpu_ids)
+BTF_ID_FLAGS(func, scx_bpf_nr_cids)
+BTF_ID_FLAGS(func, scx_bpf_nr_online_cids)
+BTF_ID_FLAGS(func, scx_bpf_this_cid)
 BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask, KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_put_cpumask, KF_RELEASE)
 BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
+BTF_ID_FLAGS(func, scx_bpf_task_cid, KF_RCU)
 BTF_ID_FLAGS(func, scx_bpf_cpu_rq, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_locked_rq, KF_IMPLICIT_ARGS | KF_RET_NULL)
 BTF_ID_FLAGS(func, scx_bpf_cpu_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
+BTF_ID_FLAGS(func, scx_bpf_cid_curr, KF_IMPLICIT_ARGS | KF_RET_NULL | KF_RCU_PROTECTED)
 BTF_ID_FLAGS(func, scx_bpf_tid_to_task, KF_RET_NULL | KF_RCU_PROTECTED)
 BTF_ID_FLAGS(func, scx_bpf_now)
 BTF_ID_FLAGS(func, scx_bpf_events)
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 3e353dfafb46..4acfc9e8a645 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -105,6 +105,15 @@ void scx_bpf_events(struct scx_event_stats *events, size_t events__sz) __ksym __
 s32 scx_bpf_cpu_to_cid(s32 cpu) __ksym __weak;
 s32 scx_bpf_cid_to_cpu(s32 cid) __ksym __weak;
 void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out) __ksym __weak;
+void scx_bpf_kick_cid(s32 cid, u64 flags) __ksym __weak;
+s32 scx_bpf_task_cid(const struct task_struct *p) __ksym __weak;
+s32 scx_bpf_this_cid(void) __ksym __weak;
+struct task_struct *scx_bpf_cid_curr(s32 cid) __ksym __weak;
+u32 scx_bpf_nr_cids(void) __ksym __weak;
+u32 scx_bpf_nr_online_cids(void) __ksym __weak;
+u32 scx_bpf_cidperf_cap(s32 cid) __ksym __weak;
+u32 scx_bpf_cidperf_cur(s32 cid) __ksym __weak;
+void scx_bpf_cidperf_set(s32 cid, u32 perf) __ksym __weak;
 
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 11/16] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (9 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 10/16] sched_ext: Add cid-form kfunc wrappers alongside cpu-form Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 12/16] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers Tejun Heo
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

cpumask is awkward from BPF and unusable from arena; cid/cmask work in
both. Sub-sched enqueue will need cmask. Without a full cid interface,
schedulers end up mixing forms - a subtle-bug factory.

Add sched_ext_ops_cid, which mirrors sched_ext_ops with cid/cmask
replacing cpu/cpumask in the topology-carrying callbacks.
cpu_acquire/cpu_release are deprecated and absent; a prior patch
moved them past @priv so the cid-form can omit them without
disturbing shared-field offsets.

The two structs share byte-identical layout up to @priv, so the
existing bpf_scx init/check hooks, has_op bitmap, and
scx_kf_allow_flags[] are offset-indexed and apply to both.
BUILD_BUG_ON in scx_init() pins the shared-field and renamed-callback
offsets so any future drift trips at boot.

The kernel<->BPF boundary translates between cpu and cid:

- A static key, enabled on cid-form sched load, gates the translation
  so cpu-form schedulers pay nothing.
- dispatch, update_idle, cpu_online/offline and dump_cpu translate
  the cpu arg at the callsite.
- select_cpu also translates the returned cid back to a cpu.
- set_cpumask is wrapped to synthesize a cmask in a per-cpu scratch
  before calling the cid-form callback.

All scheds in a hierarchy share one form. The static key drives the
hot-path branch.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c                       | 282 +++++++++++++++++++++--
 kernel/sched/ext_cid.c                   |  43 +++-
 kernel/sched/ext_cid.h                   |  10 +
 kernel/sched/ext_idle.c                  |   2 +-
 kernel/sched/ext_internal.h              | 109 ++++++++-
 tools/sched_ext/include/scx/compat.bpf.h |  12 +
 6 files changed, 436 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 8d52e579b96c..fcb5f98d670d 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -510,6 +510,33 @@ do {										\
 		update_locked_rq(NULL);						\
 } while (0)
 
+/*
+ * Flipped on enable per sch->is_cid_type. Declared in ext_internal.h so
+ * subsystem inlines can read it.
+ */
+DEFINE_STATIC_KEY_FALSE(__scx_is_cid_type);
+
+/*
+ * scx_cpu_arg() wraps a cpu arg being handed to an SCX op. For cid-form
+ * schedulers it resolves to the matching cid; for cpu-form it passes @cpu
+ * through. scx_cpu_ret() is the inverse for a cpu/cid returned from an op
+ * (currently only ops.select_cpu); it validates the BPF-supplied cid and
+ * triggers scx_error() on @sch if invalid.
+ */
+static s32 scx_cpu_arg(s32 cpu)
+{
+	if (scx_is_cid_type())
+		return __scx_cpu_to_cid(cpu);
+	return cpu;
+}
+
+static s32 scx_cpu_ret(struct scx_sched *sch, s32 cpu_or_cid)
+{
+	if (cpu_or_cid < 0 || !scx_is_cid_type())
+		return cpu_or_cid;
+	return scx_cid_to_cpu(sch, cpu_or_cid);
+}
+
 #define SCX_CALL_OP_RET(sch, op, rq, args...)					\
 ({										\
 	__typeof__((sch)->ops.op(args)) __ret;					\
@@ -568,6 +595,39 @@ do {										\
 	__ret;									\
 })
 
+/**
+ * scx_call_op_set_cpumask - invoke ops.set_cpumask / ops_cid.set_cmask for @task
+ * @sch: scx_sched being invoked
+ * @rq: rq to update as the currently-locked rq, or NULL
+ * @task: task whose affinity is changing
+ * @cpumask: new cpumask
+ *
+ * For cid-form schedulers, translate @cpumask to a cmask via the per-cpu
+ * scratch in ext_cid.c and dispatch through the ops_cid union view. Caller
+ * must hold @rq's rq lock so this_cpu_ptr is stable across the call.
+ */
+static inline void scx_call_op_set_cpumask(struct scx_sched *sch, struct rq *rq,
+					   struct task_struct *task,
+					   const struct cpumask *cpumask)
+{
+	WARN_ON_ONCE(current->scx.kf_tasks[0]);
+	current->scx.kf_tasks[0] = task;
+	if (rq)
+		update_locked_rq(rq);
+
+	if (scx_is_cid_type()) {
+		const struct scx_cmask *cmask =
+			scx_build_cmask_from_cpumask(cpumask);
+		sch->ops_cid.set_cmask(task, cmask);
+	} else {
+		sch->ops.set_cpumask(task, cpumask);
+	}
+
+	if (rq)
+		update_locked_rq(NULL);
+	current->scx.kf_tasks[0] = NULL;
+}
+
 /* see SCX_CALL_OP_TASK() */
 static __always_inline bool scx_kf_arg_task_ok(struct scx_sched *sch,
 							struct task_struct *p)
@@ -1671,7 +1731,7 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
 		return &rq->scx.local_dsq;
 
 	if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
-		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+		s32 cpu = scx_cpu_ret(sch, dsq_id & SCX_DSQ_LOCAL_CPU_MASK);
 
 		if (!scx_cpu_valid(sch, cpu, "in SCX_DSQ_LOCAL_ON dispatch verdict"))
 			return find_global_dsq(sch, tcpu);
@@ -2752,11 +2812,13 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 		dspc->nr_tasks = 0;
 
 		if (nested) {
-			SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL);
+			SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+				    prev_on_sch ? prev : NULL);
 		} else {
 			/* stash @prev so that nested invocations can access it */
 			rq->scx.sub_dispatch_prev = prev;
-			SCX_CALL_OP(sch, dispatch, rq, cpu, prev_on_sch ? prev : NULL);
+			SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+				    prev_on_sch ? prev : NULL);
 			rq->scx.sub_dispatch_prev = NULL;
 		}
 
@@ -3251,7 +3313,9 @@ static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flag
 		*ddsp_taskp = p;
 
 		this_rq()->scx.in_select_cpu = true;
-		cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p, prev_cpu, wake_flags);
+		cpu = SCX_CALL_OP_TASK_RET(sch, select_cpu, NULL, p,
+					   scx_cpu_arg(prev_cpu), wake_flags);
+		cpu = scx_cpu_ret(sch, cpu);
 		this_rq()->scx.in_select_cpu = false;
 		p->scx.selected_cpu = cpu;
 		*ddsp_taskp = NULL;
@@ -3301,7 +3365,7 @@ static void set_cpus_allowed_scx(struct task_struct *p,
 	 * designation pointless. Cast it away when calling the operation.
 	 */
 	if (SCX_HAS_OP(sch, set_cpumask))
-		SCX_CALL_OP_TASK(sch, set_cpumask, task_rq(p), p, (struct cpumask *)p->cpus_ptr);
+		scx_call_op_set_cpumask(sch, task_rq(p), p, (struct cpumask *)p->cpus_ptr);
 }
 
 static void handle_hotplug(struct rq *rq, bool online)
@@ -3323,9 +3387,9 @@ static void handle_hotplug(struct rq *rq, bool online)
 		scx_idle_update_selcpu_topology(&sch->ops);
 
 	if (online && SCX_HAS_OP(sch, cpu_online))
-		SCX_CALL_OP(sch, cpu_online, NULL, cpu);
+		SCX_CALL_OP(sch, cpu_online, NULL, scx_cpu_arg(cpu));
 	else if (!online && SCX_HAS_OP(sch, cpu_offline))
-		SCX_CALL_OP(sch, cpu_offline, NULL, cpu);
+		SCX_CALL_OP(sch, cpu_offline, NULL, scx_cpu_arg(cpu));
 	else
 		scx_exit(sch, SCX_EXIT_UNREG_KERN,
 			 SCX_ECODE_ACT_RESTART | SCX_ECODE_RSN_HOTPLUG,
@@ -3893,7 +3957,7 @@ static void switching_to_scx(struct rq *rq, struct task_struct *p)
 	 * different scheduler class. Keep the BPF scheduler up-to-date.
 	 */
 	if (SCX_HAS_OP(sch, set_cpumask))
-		SCX_CALL_OP_TASK(sch, set_cpumask, rq, p, (struct cpumask *)p->cpus_ptr);
+		scx_call_op_set_cpumask(sch, rq, p, (struct cpumask *)p->cpus_ptr);
 }
 
 static void switched_from_scx(struct rq *rq, struct task_struct *p)
@@ -5914,6 +5978,8 @@ static void scx_root_disable(struct scx_sched *sch)
 	mutex_unlock(&scx_enable_mutex);
 
 	WARN_ON_ONCE(scx_set_enable_state(SCX_DISABLED) != SCX_DISABLING);
+
+	static_branch_disable(&__scx_is_cid_type);
 done:
 	scx_bypass(sch, false);
 }
@@ -6277,8 +6343,7 @@ static void scx_dump_state(struct scx_sched *sch, struct scx_exit_info *ei,
 		used = seq_buf_used(&ns);
 		if (SCX_HAS_OP(sch, dump_cpu)) {
 			ops_dump_init(&ns, "  ");
-			SCX_CALL_OP(sch, dump_cpu, NULL,
-				    &dctx, cpu, idle);
+			SCX_CALL_OP(sch, dump_cpu, NULL, &dctx, scx_cpu_arg(cpu), idle);
 			ops_dump_exit();
 		}
 
@@ -6434,7 +6499,11 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
  */
 struct scx_enable_cmd {
 	struct kthread_work	work;
-	struct sched_ext_ops	*ops;
+	union {
+		struct sched_ext_ops		*ops;
+		struct sched_ext_ops_cid	*ops_cid;
+	};
+	bool			is_cid_type;
 	int			ret;
 };
 
@@ -6442,10 +6511,11 @@ struct scx_enable_cmd {
  * Allocate and initialize a new scx_sched. @cgrp's reference is always
  * consumed whether the function succeeds or fails.
  */
-static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
+static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 						 struct cgroup *cgrp,
 						 struct scx_sched *parent)
 {
+	struct sched_ext_ops *ops = cmd->ops;
 	struct scx_sched *sch;
 	s32 level = parent ? parent->level + 1 : 0;
 	s32 node, cpu, ret, bypass_fail_cpu = nr_cpu_ids;
@@ -6528,7 +6598,19 @@ static struct scx_sched *scx_alloc_and_add_sched(struct sched_ext_ops *ops,
 	init_irq_work(&sch->disable_irq_work, scx_disable_irq_workfn);
 	kthread_init_work(&sch->disable_work, scx_disable_workfn);
 	timer_setup(&sch->bypass_lb_timer, scx_bypass_lb_timerfn, 0);
-	sch->ops = *ops;
+
+	/*
+	 * Copy ops through the right union view. For cid-form the source is
+	 * struct sched_ext_ops_cid which lacks the trailing cpu_acquire/
+	 * cpu_release; those stay zero from kzalloc.
+	 */
+	if (cmd->is_cid_type) {
+		sch->ops_cid = *cmd->ops_cid;
+		sch->is_cid_type = true;
+	} else {
+		sch->ops = *cmd->ops;
+	}
+
 	rcu_assign_pointer(ops->priv, sch);
 
 	sch->kobj.kset = scx_kset;
@@ -6663,7 +6745,12 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 		return -EINVAL;
 	}
 
-	if (ops->cpu_acquire || ops->cpu_release)
+	/*
+	 * cid-form's struct is shorter and doesn't include the cpu_acquire /
+	 * cpu_release tail; reading those fields off a cid-form @ops would
+	 * run past the BPF allocation. Skip for cid-form.
+	 */
+	if (!sch->is_cid_type && (ops->cpu_acquire || ops->cpu_release))
 		pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
 
 	return 0;
@@ -6699,12 +6786,15 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 #if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
 	cgroup_get(cgrp);
 #endif
-	sch = scx_alloc_and_add_sched(ops, cgrp, NULL);
+	sch = scx_alloc_and_add_sched(cmd, cgrp, NULL);
 	if (IS_ERR(sch)) {
 		ret = PTR_ERR(sch);
 		goto err_free_tid_hash;
 	}
 
+	if (sch->is_cid_type)
+		static_branch_enable(&__scx_is_cid_type);
+
 	/*
 	 * Transition to ENABLING and clear exit info to arm the disable path.
 	 * Failure triggers full disabling from here on.
@@ -7022,7 +7112,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
 	raw_spin_unlock_irq(&scx_sched_lock);
 
 	/* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
-	sch = scx_alloc_and_add_sched(ops, cgrp, parent);
+	sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
 	kobject_put(&parent->kobj);
 	if (IS_ERR(sch)) {
 		ret = PTR_ERR(sch);
@@ -7466,6 +7556,13 @@ static int bpf_scx_reg(void *kdata, struct bpf_link *link)
 	return scx_enable(&cmd, link);
 }
 
+static int bpf_scx_reg_cid(void *kdata, struct bpf_link *link)
+{
+	struct scx_enable_cmd cmd = { .ops_cid = kdata, .is_cid_type = true };
+
+	return scx_enable(&cmd, link);
+}
+
 static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
 {
 	struct sched_ext_ops *ops = kdata;
@@ -7597,6 +7694,73 @@ static struct bpf_struct_ops bpf_sched_ext_ops = {
 	.cfi_stubs = &__bpf_ops_sched_ext_ops
 };
 
+/*
+ * cid-form cfi stubs. Stubs whose signatures match the cpu-form (param types
+ * identical, only param names differ across structs) are reused; only
+ * set_cmask needs a fresh stub since the second argument type differs.
+ */
+static void sched_ext_ops_cid__set_cmask(struct task_struct *p,
+					 const struct scx_cmask *cmask) {}
+
+static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
+	.select_cid		= sched_ext_ops__select_cpu,
+	.enqueue		= sched_ext_ops__enqueue,
+	.dequeue		= sched_ext_ops__dequeue,
+	.dispatch		= sched_ext_ops__dispatch,
+	.tick			= sched_ext_ops__tick,
+	.runnable		= sched_ext_ops__runnable,
+	.running		= sched_ext_ops__running,
+	.stopping		= sched_ext_ops__stopping,
+	.quiescent		= sched_ext_ops__quiescent,
+	.yield			= sched_ext_ops__yield,
+	.core_sched_before	= sched_ext_ops__core_sched_before,
+	.set_weight		= sched_ext_ops__set_weight,
+	.set_cmask		= sched_ext_ops_cid__set_cmask,
+	.update_idle		= sched_ext_ops__update_idle,
+	.init_task		= sched_ext_ops__init_task,
+	.exit_task		= sched_ext_ops__exit_task,
+	.enable			= sched_ext_ops__enable,
+	.disable		= sched_ext_ops__disable,
+#ifdef CONFIG_EXT_GROUP_SCHED
+	.cgroup_init		= sched_ext_ops__cgroup_init,
+	.cgroup_exit		= sched_ext_ops__cgroup_exit,
+	.cgroup_prep_move	= sched_ext_ops__cgroup_prep_move,
+	.cgroup_move		= sched_ext_ops__cgroup_move,
+	.cgroup_cancel_move	= sched_ext_ops__cgroup_cancel_move,
+	.cgroup_set_weight	= sched_ext_ops__cgroup_set_weight,
+	.cgroup_set_bandwidth	= sched_ext_ops__cgroup_set_bandwidth,
+	.cgroup_set_idle	= sched_ext_ops__cgroup_set_idle,
+#endif
+	.sub_attach		= sched_ext_ops__sub_attach,
+	.sub_detach		= sched_ext_ops__sub_detach,
+	.cid_online		= sched_ext_ops__cpu_online,
+	.cid_offline		= sched_ext_ops__cpu_offline,
+	.init			= sched_ext_ops__init,
+	.exit			= sched_ext_ops__exit,
+	.dump			= sched_ext_ops__dump,
+	.dump_cid		= sched_ext_ops__dump_cpu,
+	.dump_task		= sched_ext_ops__dump_task,
+};
+
+/*
+ * The cid-form struct_ops shares all bpf_struct_ops hooks with the cpu form.
+ * init_member, check_member, reg, unreg, etc. process kdata as the byte block
+ * verified to match by the BUILD_BUG_ON checks in scx_init().
+ */
+static struct bpf_struct_ops bpf_sched_ext_ops_cid = {
+	.verifier_ops = &bpf_scx_verifier_ops,
+	.reg = bpf_scx_reg_cid,
+	.unreg = bpf_scx_unreg,
+	.check_member = bpf_scx_check_member,
+	.init_member = bpf_scx_init_member,
+	.init = bpf_scx_init,
+	.update = bpf_scx_update,
+	.validate = bpf_scx_validate,
+	.name = "sched_ext_ops_cid",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_ops_sched_ext_ops_cid
+};
+
 
 /********************************************************************************
  * System integration and init.
@@ -8797,7 +8961,7 @@ __bpf_kfunc s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
 		ret = READ_ONCE(this_rq()->scx.local_dsq.nr);
 		goto out;
 	} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
-		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
+		s32 cpu = scx_cpu_ret(sch, dsq_id & SCX_DSQ_LOCAL_CPU_MASK);
 
 		if (scx_cpu_valid(sch, cpu, NULL)) {
 			ret = READ_ONCE(cpu_rq(cpu)->scx.local_dsq.nr);
@@ -9893,8 +10057,15 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 
 	/*
 	 * Non-SCX struct_ops: SCX kfuncs are not permitted.
-	 */
-	if (prog->aux->st_ops != &bpf_sched_ext_ops)
+	 *
+	 * Both bpf_sched_ext_ops (cpu-form) and bpf_sched_ext_ops_cid
+	 * (cid-form) are valid SCX struct_ops. Member offsets match between
+	 * the two (verified by BUILD_BUG_ON in scx_init()), so the shared
+	 * scx_kf_allow_flags[] table indexed by SCX_MOFF_IDX(moff) applies to
+	 * both.
+	 */
+	if (prog->aux->st_ops != &bpf_sched_ext_ops &&
+	    prog->aux->st_ops != &bpf_sched_ext_ops_cid)
 		return -EACCES;
 
 	/* SCX struct_ops: check the per-op allow list. */
@@ -9924,6 +10095,73 @@ static int __init scx_init(void)
 {
 	int ret;
 
+	/*
+	 * sched_ext_ops_cid mirrors sched_ext_ops up to and including @priv.
+	 * Both bpf_scx_init_member() and bpf_scx_check_member() use offsets
+	 * from struct sched_ext_ops; sched_ext_ops_cid relies on those offsets
+	 * matching for the shared fields. Catch any drift at boot.
+	 */
+#define CID_OFFSET_MATCH(cpu_field, cid_field)					\
+	BUILD_BUG_ON(offsetof(struct sched_ext_ops, cpu_field) !=		\
+		     offsetof(struct sched_ext_ops_cid, cid_field))
+	/* data fields used by bpf_scx_init_member() */
+	CID_OFFSET_MATCH(dispatch_max_batch, dispatch_max_batch);
+	CID_OFFSET_MATCH(flags, flags);
+	CID_OFFSET_MATCH(name, name);
+	CID_OFFSET_MATCH(timeout_ms, timeout_ms);
+	CID_OFFSET_MATCH(exit_dump_len, exit_dump_len);
+	CID_OFFSET_MATCH(hotplug_seq, hotplug_seq);
+	CID_OFFSET_MATCH(sub_cgroup_id, sub_cgroup_id);
+	/* shared callbacks: the union view requires byte-for-byte offset match */
+	CID_OFFSET_MATCH(enqueue, enqueue);
+	CID_OFFSET_MATCH(dequeue, dequeue);
+	CID_OFFSET_MATCH(dispatch, dispatch);
+	CID_OFFSET_MATCH(tick, tick);
+	CID_OFFSET_MATCH(runnable, runnable);
+	CID_OFFSET_MATCH(running, running);
+	CID_OFFSET_MATCH(stopping, stopping);
+	CID_OFFSET_MATCH(quiescent, quiescent);
+	CID_OFFSET_MATCH(yield, yield);
+	CID_OFFSET_MATCH(core_sched_before, core_sched_before);
+	CID_OFFSET_MATCH(set_weight, set_weight);
+	CID_OFFSET_MATCH(update_idle, update_idle);
+	CID_OFFSET_MATCH(init_task, init_task);
+	CID_OFFSET_MATCH(exit_task, exit_task);
+	CID_OFFSET_MATCH(enable, enable);
+	CID_OFFSET_MATCH(disable, disable);
+	CID_OFFSET_MATCH(dump, dump);
+	CID_OFFSET_MATCH(dump_task, dump_task);
+	CID_OFFSET_MATCH(sub_attach, sub_attach);
+	CID_OFFSET_MATCH(sub_detach, sub_detach);
+	CID_OFFSET_MATCH(init, init);
+	CID_OFFSET_MATCH(exit, exit);
+#ifdef CONFIG_EXT_GROUP_SCHED
+	CID_OFFSET_MATCH(cgroup_init, cgroup_init);
+	CID_OFFSET_MATCH(cgroup_exit, cgroup_exit);
+	CID_OFFSET_MATCH(cgroup_prep_move, cgroup_prep_move);
+	CID_OFFSET_MATCH(cgroup_move, cgroup_move);
+	CID_OFFSET_MATCH(cgroup_cancel_move, cgroup_cancel_move);
+	CID_OFFSET_MATCH(cgroup_set_weight, cgroup_set_weight);
+	CID_OFFSET_MATCH(cgroup_set_bandwidth, cgroup_set_bandwidth);
+	CID_OFFSET_MATCH(cgroup_set_idle, cgroup_set_idle);
+#endif
+	/* renamed callbacks must occupy the same slot as their cpu-form sibling */
+	CID_OFFSET_MATCH(select_cpu, select_cid);
+	CID_OFFSET_MATCH(set_cpumask, set_cmask);
+	CID_OFFSET_MATCH(cpu_online, cid_online);
+	CID_OFFSET_MATCH(cpu_offline, cid_offline);
+	CID_OFFSET_MATCH(dump_cpu, dump_cid);
+	/* @priv tail must align since both share the same data block */
+	CID_OFFSET_MATCH(priv, priv);
+	/*
+	 * cid-form must end exactly at @priv - validate_ops() skips
+	 * cpu_acquire/cpu_release for cid-form because reading those fields
+	 * past the BPF allocation would be UB.
+	 */
+	BUILD_BUG_ON(sizeof(struct sched_ext_ops_cid) !=
+		     offsetofend(struct sched_ext_ops, priv));
+#undef CID_OFFSET_MATCH
+
 	/*
 	 * kfunc registration can't be done from init_sched_ext_class() as
 	 * register_btf_kfunc_id_set() needs most of the system to be up.
@@ -9974,6 +10212,12 @@ static int __init scx_init(void)
 		return ret;
 	}
 
+	ret = register_bpf_struct_ops(&bpf_sched_ext_ops_cid, sched_ext_ops_cid);
+	if (ret) {
+		pr_err("sched_ext: Failed to register cid struct_ops (%d)\n", ret);
+		return ret;
+	}
+
 	ret = register_pm_notifier(&scx_pm_notifier);
 	if (ret) {
 		pr_err("sched_ext: Failed to register PM notifier (%d)\n", ret);
diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
index c8b7cdaf82d5..20f1344f3a77 100644
--- a/kernel/sched/ext_cid.c
+++ b/kernel/sched/ext_cid.c
@@ -9,6 +9,14 @@
 
 #include "ext_cid.h"
 
+/*
+ * Per-cpu scratch cmask used by scx_call_op_set_cpumask() to synthesize a
+ * cmask from a cpumask. Allocated alongside the cid arrays on first enable
+ * and never freed. Sized to the full cid space. Caller holds rq lock so
+ * this_cpu_ptr is safe.
+ */
+static struct scx_cmask __percpu *scx_set_cmask_scratch;
+
 s16 *scx_cid_to_cpu_tbl;
 s16 *scx_cpu_to_cid_tbl;
 struct scx_cid_topo *scx_cid_topo;
@@ -44,8 +52,11 @@ static const struct cpumask *cpu_llc_mask(int cpu, struct cpumask *fallbacks)
 static s32 scx_cid_arrays_alloc(void)
 {
 	u32 npossible = num_possible_cpus();
+	size_t scratch_total = sizeof(struct scx_cmask) +
+		SCX_CMASK_NR_WORDS(npossible) * sizeof(u64);
 	s16 *cid_to_cpu, *cpu_to_cid;
 	struct scx_cid_topo *cid_topo;
+	struct scx_cmask __percpu *set_cmask_scratch;
 
 	if (scx_cid_to_cpu_tbl)
 		return 0;
@@ -53,17 +64,20 @@ static s32 scx_cid_arrays_alloc(void)
 	cid_to_cpu = kcalloc(npossible, sizeof(*scx_cid_to_cpu_tbl), GFP_KERNEL);
 	cpu_to_cid = kcalloc(nr_cpu_ids, sizeof(*scx_cpu_to_cid_tbl), GFP_KERNEL);
 	cid_topo = kmalloc_array(npossible, sizeof(*scx_cid_topo), GFP_KERNEL);
+	set_cmask_scratch = __alloc_percpu(scratch_total, sizeof(u64));
 
-	if (!cid_to_cpu || !cpu_to_cid || !cid_topo) {
+	if (!cid_to_cpu || !cpu_to_cid || !cid_topo || !set_cmask_scratch) {
 		kfree(cid_to_cpu);
 		kfree(cpu_to_cid);
 		kfree(cid_topo);
+		free_percpu(set_cmask_scratch);
 		return -ENOMEM;
 	}
 
 	scx_cid_to_cpu_tbl = cid_to_cpu;
 	scx_cpu_to_cid_tbl = cpu_to_cid;
 	scx_cid_topo = cid_topo;
+	scx_set_cmask_scratch = set_cmask_scratch;
 	return 0;
 }
 
@@ -208,6 +222,33 @@ s32 scx_cid_init(struct scx_sched *sch)
 	return 0;
 }
 
+/**
+ * scx_build_cmask_from_cpumask - Build a cmask from a kernel cpumask
+ * @cpumask: source cpumask
+ *
+ * Synthesize a cmask covering the full cid space [0, num_possible_cpus())
+ * with bits set for cids whose cpu is in @cpumask. Return a pointer to the
+ * per-cpu scratch buffer, valid until the next invocation on this cpu.
+ * Caller must hold the rq lock so this_cpu_ptr() is stable.
+ */
+const struct scx_cmask *scx_build_cmask_from_cpumask(const struct cpumask *cpumask)
+{
+	struct scx_cmask *cmask;
+	s32 cpu;
+
+	lockdep_assert_irqs_disabled();
+
+	cmask = this_cpu_ptr(scx_set_cmask_scratch);
+	scx_cmask_init(cmask, 0, num_possible_cpus());
+	for_each_cpu(cpu, cpumask) {
+		s32 cid = __scx_cpu_to_cid(cpu);
+
+		if (cid >= 0)
+			__scx_cmask_set(cmask, cid);
+	}
+	return cmask;
+}
+
 __bpf_kfunc_start_defs();
 
 /**
diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
index 46f03f2150c2..b6837576d4dc 100644
--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -57,6 +57,8 @@ struct scx_cid_topo {
 	s32 node_idx;
 };
 
+const struct scx_cmask *scx_build_cmask_from_cpumask(const struct cpumask *cpumask);
+
 /*
  * Cid space (total is always num_possible_cpus()) is laid out with
  * topology-annotated cids first, then no-topo cids at the tail. The
@@ -145,6 +147,14 @@ static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
 	return __scx_cpu_to_cid(cpu);
 }
 
+/**
+ * scx_is_cid_type - Test whether the active scheduler hierarchy is cid-form
+ */
+static inline bool scx_is_cid_type(void)
+{
+	return static_branch_unlikely(&__scx_is_cid_type);
+}
+
 /*
  * cmask: variable-length, base-windowed bitmap over cid space
  * -----------------------------------------------------------
diff --git a/kernel/sched/ext_idle.c b/kernel/sched/ext_idle.c
index 11d11ea6ca6b..b7b50e4c2190 100644
--- a/kernel/sched/ext_idle.c
+++ b/kernel/sched/ext_idle.c
@@ -789,7 +789,7 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
 	 */
 	if (SCX_HAS_OP(sch, update_idle) && do_notify &&
 	    !scx_bypassing(sch, cpu_of(rq)))
-		SCX_CALL_OP(sch, update_idle, rq, cpu_of(rq), idle);
+		SCX_CALL_OP(sch, update_idle, rq, scx_cpu_arg(cpu_of(rq)), idle);
 }
 
 static void reset_idle_masks(struct sched_ext_ops *ops)
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 1d73fcc19aaf..6bfa976e4f52 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -879,6 +879,95 @@ struct sched_ext_ops {
 	void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
 };
 
+struct scx_cmask;
+
+/**
+ * struct sched_ext_ops_cid - cid-form alternative to struct sched_ext_ops
+ *
+ * Mirrors struct sched_ext_ops with cpu/cpumask substituted with cid/cmask
+ * where applicable. Layout up to and including @priv matches sched_ext_ops
+ * byte-for-byte (verified by BUILD_BUG_ON checks at scx_init() time) so
+ * shared field offsets work for both struct types in bpf_scx_init_member()
+ * and bpf_scx_check_member(). The deprecated cpu_acquire/cpu_release
+ * callbacks at the tail of sched_ext_ops are omitted here entirely.
+ *
+ * Differences from sched_ext_ops:
+ *   - select_cpu       -> select_cid (returns cid)
+ *   - dispatch         -> dispatch (cpu arg is now cid)
+ *   - update_idle      -> update_idle (cpu arg is now cid)
+ *   - set_cpumask      -> set_cmask (cmask instead of cpumask)
+ *   - cpu_online       -> cid_online
+ *   - cpu_offline      -> cid_offline
+ *   - dump_cpu         -> dump_cid
+ *   - cpu_acquire/cpu_release  -> not present (deprecated in sched_ext_ops)
+ *
+ * BPF schedulers using this type cannot call cpu-form scx_bpf_* kfuncs;
+ * use the cid-form variants instead. Enforced at BPF verifier time via
+ * scx_kfunc_context_filter() branching on prog->aux->st_ops.
+ *
+ * See sched_ext_ops for callback documentation.
+ */
+struct sched_ext_ops_cid {
+	s32 (*select_cid)(struct task_struct *p, s32 prev_cid, u64 wake_flags);
+	void (*enqueue)(struct task_struct *p, u64 enq_flags);
+	void (*dequeue)(struct task_struct *p, u64 deq_flags);
+	void (*dispatch)(s32 cid, struct task_struct *prev);
+	void (*tick)(struct task_struct *p);
+	void (*runnable)(struct task_struct *p, u64 enq_flags);
+	void (*running)(struct task_struct *p);
+	void (*stopping)(struct task_struct *p, bool runnable);
+	void (*quiescent)(struct task_struct *p, u64 deq_flags);
+	bool (*yield)(struct task_struct *from, struct task_struct *to);
+	bool (*core_sched_before)(struct task_struct *a,
+				   struct task_struct *b);
+	void (*set_weight)(struct task_struct *p, u32 weight);
+	void (*set_cmask)(struct task_struct *p,
+			   const struct scx_cmask *cmask);
+	void (*update_idle)(s32 cid, bool idle);
+	s32 (*init_task)(struct task_struct *p,
+			  struct scx_init_task_args *args);
+	void (*exit_task)(struct task_struct *p,
+			   struct scx_exit_task_args *args);
+	void (*enable)(struct task_struct *p);
+	void (*disable)(struct task_struct *p);
+	void (*dump)(struct scx_dump_ctx *ctx);
+	void (*dump_cid)(struct scx_dump_ctx *ctx, s32 cid, bool idle);
+	void (*dump_task)(struct scx_dump_ctx *ctx, struct task_struct *p);
+#ifdef CONFIG_EXT_GROUP_SCHED
+	s32 (*cgroup_init)(struct cgroup *cgrp,
+			    struct scx_cgroup_init_args *args);
+	void (*cgroup_exit)(struct cgroup *cgrp);
+	s32 (*cgroup_prep_move)(struct task_struct *p,
+				 struct cgroup *from, struct cgroup *to);
+	void (*cgroup_move)(struct task_struct *p,
+			     struct cgroup *from, struct cgroup *to);
+	void (*cgroup_cancel_move)(struct task_struct *p,
+				    struct cgroup *from, struct cgroup *to);
+	void (*cgroup_set_weight)(struct cgroup *cgrp, u32 weight);
+	void (*cgroup_set_bandwidth)(struct cgroup *cgrp,
+				      u64 period_us, u64 quota_us, u64 burst_us);
+	void (*cgroup_set_idle)(struct cgroup *cgrp, bool idle);
+#endif	/* CONFIG_EXT_GROUP_SCHED */
+	s32 (*sub_attach)(struct scx_sub_attach_args *args);
+	void (*sub_detach)(struct scx_sub_detach_args *args);
+	void (*cid_online)(s32 cid);
+	void (*cid_offline)(s32 cid);
+	s32 (*init)(void);
+	void (*exit)(struct scx_exit_info *info);
+
+	/* Data fields - must match sched_ext_ops layout exactly */
+	u32 dispatch_max_batch;
+	u64 flags;
+	u32 timeout_ms;
+	u32 exit_dump_len;
+	u64 hotplug_seq;
+	u64 sub_cgroup_id;
+	char name[SCX_OPS_NAME_LEN];
+
+	/* internal use only, must be NULL */
+	void __rcu *priv;
+};
+
 enum scx_opi {
 	SCX_OPI_BEGIN			= 0,
 	SCX_OPI_NORMAL_BEGIN		= 0,
@@ -1035,7 +1124,18 @@ struct scx_sched_pnode {
 };
 
 struct scx_sched {
-	struct sched_ext_ops	ops;
+	/*
+	 * cpu-form and cid-form ops share field offsets up to .priv (verified
+	 * by BUILD_BUG_ON in scx_init()). The anonymous union lets the kernel
+	 * access either view of the same storage without function-pointer
+	 * casts: use .ops for cpu-form and shared fields, .ops_cid for the
+	 * cid-renamed callbacks (set_cmask, select_cid, cid_online, ...).
+	 */
+	union {
+		struct sched_ext_ops		ops;
+		struct sched_ext_ops_cid	ops_cid;
+	};
+	bool			is_cid_type;	/* true if registered via bpf_sched_ext_ops_cid */
 	DECLARE_BITMAP(has_op, SCX_OPI_END);
 
 	/*
@@ -1390,6 +1490,13 @@ enum scx_ops_state {
 extern struct scx_sched __rcu *scx_root;
 DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
 
+/*
+ * True when the currently loaded scheduler hierarchy is cid-form. All scheds
+ * in a hierarchy share one form, so this single key tells callsites which
+ * view to use without per-sch dereferences. Use scx_is_cid_type() to test.
+ */
+DECLARE_STATIC_KEY_FALSE(__scx_is_cid_type);
+
 int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id);
 
 bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where);
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 6b9d054c3e4f..87f15f296234 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -446,4 +446,16 @@ static inline void scx_bpf_dsq_reenq(u64 dsq_id, u64 reenq_flags)
 		__VA_ARGS__,							\
 	};
 
+/*
+ * Define a cid-form sched_ext_ops. Programs targeting this struct_ops type
+ * use cid-form callback signatures (select_cid, set_cmask, cid_online/offline,
+ * dispatch with cid arg, etc.) and may only call the cid-form scx_bpf_*
+ * kfuncs (kick_cid, task_cid, this_cid, ...).
+ */
+#define SCX_OPS_CID_DEFINE(__name, ...)						\
+	SEC(".struct_ops.link")							\
+	struct sched_ext_ops_cid __name = {					\
+		__VA_ARGS__,							\
+	};
+
 #endif	/* __SCX_COMPAT_BPF_H */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 12/16] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (10 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 11/16] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 13/16] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline Tejun Heo
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

cid and cpu are both small s32s, trivially confused when a cid-form
scheduler calls a cpu-keyed kfunc. Reject cid-form programs that
reference any kfunc in the new scx_kfunc_ids_cpu_only at verifier load
time.

The reverse direction is intentionally permissive: cpu-form schedulers
can freely call cid-form kfuncs to ease a gradual cpumask -> cid
migration.

The check sits in scx_kfunc_context_filter() right after the SCX
struct_ops gate and before the any/idle allow and per-op allow-list
checks, so it catches cpu-only kfuncs regardless of which set they
belong to (any, idle, or select_cpu).

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index fcb5f98d670d..02bdd393bbe4 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -9968,6 +9968,42 @@ static const struct btf_kfunc_id_set scx_kfunc_set_any = {
 	.filter			= scx_kfunc_context_filter,
 };
 
+/*
+ * cpu-form kfuncs that are forbidden from cid-form schedulers
+ * (bpf_sched_ext_ops_cid). Programs targeting the cid struct_ops type must
+ * use the cid-form alternative (cid/cmask kfuncs).
+ *
+ * Membership overlaps with scx_kfunc_ids_{any,idle,select_cpu}; the filter
+ * tests this set independently and rejects matches before the per-op
+ * allow-list check runs.
+ */
+BTF_KFUNCS_START(scx_kfunc_ids_cpu_only)
+BTF_ID_FLAGS(func, scx_bpf_kick_cpu)
+BTF_ID_FLAGS(func, scx_bpf_task_cpu)
+BTF_ID_FLAGS(func, scx_bpf_cpu_rq)
+BTF_ID_FLAGS(func, scx_bpf_cpu_curr)
+BTF_ID_FLAGS(func, scx_bpf_cpu_node)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cap)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_cur)
+BTF_ID_FLAGS(func, scx_bpf_cpuperf_set)
+BTF_ID_FLAGS(func, scx_bpf_get_possible_cpumask)
+BTF_ID_FLAGS(func, scx_bpf_get_online_cpumask)
+BTF_ID_FLAGS(func, scx_bpf_put_cpumask)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_dfl)
+BTF_ID_FLAGS(func, __scx_bpf_select_cpu_and)
+BTF_ID_FLAGS(func, scx_bpf_select_cpu_and)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask_node)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask)
+BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask_node)
+BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask)
+BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu)
+BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu_node)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu)
+BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu_node)
+BTF_KFUNCS_END(scx_kfunc_ids_cpu_only)
+
 /*
  * Per-op kfunc allow flags. Each bit corresponds to a context-sensitive kfunc
  * group; an op may permit zero or more groups, with the union expressed in
@@ -10031,6 +10067,7 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	bool in_cpu_release = btf_id_set8_contains(&scx_kfunc_ids_cpu_release, kfunc_id);
 	bool in_idle = btf_id_set8_contains(&scx_kfunc_ids_idle, kfunc_id);
 	bool in_any = btf_id_set8_contains(&scx_kfunc_ids_any, kfunc_id);
+	bool in_cpu_only = btf_id_set8_contains(&scx_kfunc_ids_cpu_only, kfunc_id);
 	u32 moff, flags;
 
 	/* Not an SCX kfunc - allow. */
@@ -10068,6 +10105,15 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	    prog->aux->st_ops != &bpf_sched_ext_ops_cid)
 		return -EACCES;
 
+	/*
+	 * cid-form schedulers must use cid/cmask kfuncs. cid and cpu are both
+	 * small s32s and trivially confused, so cpu-only kfuncs are rejected at
+	 * load time. The reverse (cpu-form calling cid-form kfuncs) is
+	 * intentionally permissive to ease gradual cpumask -> cid migration.
+	 */
+	if (prog->aux->st_ops == &bpf_sched_ext_ops_cid && in_cpu_only)
+		return -EACCES;
+
 	/* SCX struct_ops: check the per-op allow list. */
 	if (in_any || in_idle)
 		return 0;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 13/16] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (11 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 12/16] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 14/16] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick Tejun Heo
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

The cid mapping is built from the online cpu set at scheduler enable
and stays valid for that set; routine hotplug invalidates it. The
default cid behavior is to restart the scheduler so the mapping gets
rebuilt against the new online set, and that requires not implementing
cpu_online / cpu_offline (which suppress the kernel's ACT_RESTART).

Drop the two ops along with their print_cpus() helper - the cluster
view was only useful as a hotplug demo and is meaningless over the
dense cid space the scheduler will move to. Wire main() to handle the
ACT_RESTART exit by reopening the skel and reattaching, matching the
pattern in scx_simple / scx_central / scx_flatcg etc. Reset optind so
getopt re-parses argv into the fresh skel rodata each iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 62 ----------------------------------
 tools/sched_ext/scx_qmap.c     | 13 +++----
 2 files changed, 7 insertions(+), 68 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 39acabef56b7..35a2dc6dd757 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -841,63 +841,6 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
 			   cgrp->kn->id, period_us, quota_us, burst_us);
 }
 
-/*
- * Print out the online and possible CPU map using bpf_printk() as a
- * demonstration of using the cpumask kfuncs and ops.cpu_on/offline().
- */
-static void print_cpus(void)
-{
-	const struct cpumask *possible, *online;
-	s32 cpu;
-	char buf[128] = "", *p;
-	int idx;
-
-	possible = scx_bpf_get_possible_cpumask();
-	online = scx_bpf_get_online_cpumask();
-
-	idx = 0;
-	bpf_for(cpu, 0, scx_bpf_nr_cpu_ids()) {
-		if (!(p = MEMBER_VPTR(buf, [idx++])))
-			break;
-		if (bpf_cpumask_test_cpu(cpu, online))
-			*p++ = 'O';
-		else if (bpf_cpumask_test_cpu(cpu, possible))
-			*p++ = 'X';
-		else
-			*p++ = ' ';
-
-		if ((cpu & 7) == 7) {
-			if (!(p = MEMBER_VPTR(buf, [idx++])))
-				break;
-			*p++ = '|';
-		}
-	}
-	buf[sizeof(buf) - 1] = '\0';
-
-	scx_bpf_put_cpumask(online);
-	scx_bpf_put_cpumask(possible);
-
-	bpf_printk("CPUS: |%s", buf);
-}
-
-void BPF_STRUCT_OPS(qmap_cpu_online, s32 cpu)
-{
-	if (print_msgs) {
-		bpf_printk("CPU %d coming online", cpu);
-		/* @cpu is already online at this point */
-		print_cpus();
-	}
-}
-
-void BPF_STRUCT_OPS(qmap_cpu_offline, s32 cpu)
-{
-	if (print_msgs) {
-		bpf_printk("CPU %d going offline", cpu);
-		/* @cpu is still online at this point */
-		print_cpus();
-	}
-}
-
 struct monitor_timer {
 	struct bpf_timer timer;
 };
@@ -1076,9 +1019,6 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 		slab[i].next_free = (i + 1 < max_tasks) ? &slab[i + 1] : NULL;
 	qa.task_free_head = &slab[0];
 
-	if (print_msgs && !sub_cgroup_id)
-		print_cpus();
-
 	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
 	if (ret) {
 		scx_bpf_error("failed to create DSQ %d (%d)", SHARED_DSQ, ret);
@@ -1172,8 +1112,6 @@ SCX_OPS_DEFINE(qmap_ops,
 	       .cgroup_set_bandwidth	= (void *)qmap_cgroup_set_bandwidth,
 	       .sub_attach		= (void *)qmap_sub_attach,
 	       .sub_detach		= (void *)qmap_sub_detach,
-	       .cpu_online		= (void *)qmap_cpu_online,
-	       .cpu_offline		= (void *)qmap_cpu_offline,
 	       .init			= (void *)qmap_init,
 	       .exit			= (void *)qmap_exit,
 	       .timeout_ms		= 5000U,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 725c4880058d..99408b1bb1ec 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -67,12 +67,14 @@ int main(int argc, char **argv)
 	struct bpf_link *link;
 	struct qmap_arena *qa;
 	__u32 test_error_cnt = 0;
+	__u64 ecode;
 	int opt;
 
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
-
+restart:
+	optind = 1;
 	skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
 
 	skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
@@ -184,11 +186,10 @@ int main(int argc, char **argv)
 	}
 
 	bpf_link__destroy(link);
-	UEI_REPORT(skel, uei);
+	ecode = UEI_REPORT(skel, uei);
 	scx_qmap__destroy(skel);
-	/*
-	 * scx_qmap implements ops.cpu_on/offline() and doesn't need to restart
-	 * on CPU hotplug events.
-	 */
+
+	if (UEI_ECODE_RESTART(ecode))
+		goto restart;
 	return 0;
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 14/16] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (12 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 13/16] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 15/16] tools/sched_ext: scx_qmap: Port to cid-form struct_ops Tejun Heo
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Switch qmap's idle-cpu picker from scx_bpf_pick_idle_cpu() to a
BPF-side bitmap scan, still under cpu-form struct_ops. qa_idle_cids
tracks idle cids (updated in update_idle / cpu_offline) and each
task's taskc->cpus_allowed tracks its allowed cids (built in
set_cpumask / init_task); select_cpu / enqueue scan the intersection
for an idle cid. Callbacks translate cpu <-> cid on entry;
cid-qmap-port drops those translations.

The scan is barebone - no core preference or other topology-aware
picks like the in-kernel picker - but qmap is a demo and this is
enough to exercise the plumbing.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 131 +++++++++++++++++++++++++++++----
 1 file changed, 115 insertions(+), 16 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 35a2dc6dd757..d30ec914a118 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -72,6 +72,13 @@ struct {
 
 struct qmap_arena __arena qa;
 
+/*
+ * Global idle-cid tracking, maintained via update_idle / cpu_offline and
+ * scanned by the direct-dispatch path. Allocated in qmap_init() from one
+ * arena page, sized to the full cid space.
+ */
+struct scx_cmask __arena *qa_idle_cids;
+
 /* Per-queue locks. Each in its own .data section as bpf_res_spin_lock requires. */
 __hidden struct bpf_res_spin_lock qa_q_lock0 SEC(".data.qa_q_lock0");
 __hidden struct bpf_res_spin_lock qa_q_lock1 SEC(".data.qa_q_lock1");
@@ -132,8 +139,18 @@ struct task_ctx {
 	bool			force_local;	/* Dispatch directly to local_dsq */
 	bool			highpri;
 	u64			core_sched_seq;
+	struct scx_cmask	cpus_allowed;	/* per-task affinity in cid space */
 };
 
+/*
+ * Slab stride for task_ctx. cpus_allowed's flex array bits[] overlaps the
+ * tail bytes appended per entry; struct_size() gives the actual per-entry
+ * footprint.
+ */
+#define TASK_CTX_STRIDE							\
+	struct_size_t(struct task_ctx, cpus_allowed.bits,		\
+		      CMASK_NR_WORDS(SCX_QMAP_MAX_CPUS))
+
 /* All task_ctx pointers are arena pointers. */
 typedef struct task_ctx __arena task_ctx_t;
 
@@ -161,20 +178,37 @@ static int qmap_spin_lock(struct bpf_res_spin_lock *lock)
 	return 0;
 }
 
-static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu)
+/*
+ * Try prev_cpu's cid, then scan taskc->cpus_allowed AND qa_idle_cids
+ * round-robin from prev_cid + 1. Atomic claim retries on race; bounded
+ * by IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
+ */
+#define IDLE_PICK_RETRIES	16
+
+static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu,
+				    task_ctx_t *taskc)
 {
-	s32 cpu;
+	u32 nr_cids = scx_bpf_nr_cids();
+	s32 prev_cid, cid;
+	u32 i;
 
 	if (!always_enq_immed && p->nr_cpus_allowed == 1)
 		return prev_cpu;
 
-	if (scx_bpf_test_and_clear_cpu_idle(prev_cpu))
+	prev_cid = scx_bpf_cpu_to_cid(prev_cpu);
+	if (cmask_test_and_clear(qa_idle_cids, prev_cid))
 		return prev_cpu;
 
-	cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
-	if (cpu >= 0)
-		return cpu;
-
+	cid = prev_cid;
+	bpf_for(i, 0, IDLE_PICK_RETRIES) {
+		cid = cmask_next_and_set_wrap(&taskc->cpus_allowed,
+					      qa_idle_cids, cid + 1);
+		barrier_var(cid);
+		if (cid >= nr_cids)
+			return -1;
+		if (cmask_test_and_clear(qa_idle_cids, cid))
+			return scx_bpf_cid_to_cpu(cid);
+	}
 	return -1;
 }
 
@@ -286,7 +320,7 @@ s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
 	if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD))
 		return prev_cpu;
 
-	cpu = pick_direct_dispatch_cpu(p, prev_cpu);
+	cpu = pick_direct_dispatch_cpu(p, prev_cpu, taskc);
 
 	if (cpu >= 0) {
 		taskc->force_local = true;
@@ -379,7 +413,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 
 	/* if select_cpu() wasn't called, try direct dispatch */
 	if (!__COMPAT_is_enq_cpu_selected(enq_flags) &&
-	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p))) >= 0) {
+	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p), taskc)) >= 0) {
 		__sync_fetch_and_add(&qa.nr_ddsp_from_enq, 1);
 		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
 		return;
@@ -724,6 +758,10 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init_task, struct task_struct *p,
 	taskc->force_local = false;
 	taskc->highpri = false;
 	taskc->core_sched_seq = 0;
+	cmask_init(&taskc->cpus_allowed, 0, scx_bpf_nr_cids());
+	bpf_rcu_read_lock();
+	cmask_from_cpumask(&taskc->cpus_allowed, p->cpus_ptr);
+	bpf_rcu_read_unlock();
 
 	v = bpf_task_storage_get(&task_ctx_stor, p, NULL,
 				 BPF_LOCAL_STORAGE_GET_F_CREATE);
@@ -841,6 +879,48 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
 			   cgrp->kn->id, period_us, quota_us, burst_us);
 }
 
+void BPF_STRUCT_OPS(qmap_update_idle, s32 cpu, bool idle)
+{
+	s32 cid = scx_bpf_cpu_to_cid(cpu);
+
+	QMAP_TOUCH_ARENA();
+	if (cid < 0)
+		return;
+	if (idle)
+		cmask_set(qa_idle_cids, cid);
+	else
+		cmask_clear(qa_idle_cids, cid);
+}
+
+/*
+ * The cpumask received here is kernel-address memory; walk it bit by bit
+ * (bpf_cpumask_test_cpu handles the access), convert each set cpu to its
+ * cid, and populate the arena-resident taskc cmask.
+ */
+void BPF_STRUCT_OPS(qmap_set_cpumask, struct task_struct *p,
+		    const struct cpumask *cpumask)
+{
+	task_ctx_t *taskc;
+	u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
+	s32 cpu;
+
+	taskc = lookup_task_ctx(p);
+	if (!taskc)
+		return;
+
+	cmask_zero(&taskc->cpus_allowed);
+
+	bpf_for(cpu, 0, nr_cpu_ids) {
+		s32 cid;
+
+		if (!bpf_cpumask_test_cpu(cpu, cpumask))
+			continue;
+		cid = scx_bpf_cpu_to_cid(cpu);
+		if (cid >= 0)
+			__cmask_set(&taskc->cpus_allowed, cid);
+	}
+}
+
 struct monitor_timer {
 	struct bpf_timer timer;
 };
@@ -990,34 +1070,51 @@ static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer)
 
 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 {
-	task_ctx_t *slab;
+	u8 __arena *slab;
 	u32 nr_pages, key = 0, i;
 	struct bpf_timer *timer;
 	s32 ret;
 
 	/*
 	 * Allocate the task_ctx slab in arena and thread the entire slab onto
-	 * the free list. max_tasks is set by userspace before load.
+	 * the free list. max_tasks is set by userspace before load. Each entry
+	 * is TASK_CTX_STRIDE bytes - task_ctx's trailing cpus_allowed flex
+	 * array extends into the stride tail.
 	 */
 	if (!max_tasks) {
 		scx_bpf_error("max_tasks must be > 0");
 		return -EINVAL;
 	}
 
-	nr_pages = (max_tasks * sizeof(struct task_ctx) + PAGE_SIZE - 1) / PAGE_SIZE;
+	nr_pages = (max_tasks * TASK_CTX_STRIDE + PAGE_SIZE - 1) / PAGE_SIZE;
 	slab = bpf_arena_alloc_pages(&arena, NULL, nr_pages, NUMA_NO_NODE, 0);
 	if (!slab) {
 		scx_bpf_error("failed to allocate task_ctx slab");
 		return -ENOMEM;
 	}
-	qa.task_ctxs = slab;
+	qa.task_ctxs = (task_ctx_t *)slab;
 
 	bpf_for(i, 0, 5)
 		qa.fifos[i].idx = i;
 
-	bpf_for(i, 0, max_tasks)
-		slab[i].next_free = (i + 1 < max_tasks) ? &slab[i + 1] : NULL;
-	qa.task_free_head = &slab[0];
+	bpf_for(i, 0, max_tasks) {
+		task_ctx_t *cur = (task_ctx_t *)(slab + i * TASK_CTX_STRIDE);
+		task_ctx_t *next = (i + 1 < max_tasks) ?
+			(task_ctx_t *)(slab + (i + 1) * TASK_CTX_STRIDE) : NULL;
+		cur->next_free = next;
+	}
+	qa.task_free_head = (task_ctx_t *)slab;
+
+	/*
+	 * Allocate and initialize the idle cmask. Starts empty - update_idle
+	 * fills it as cpus enter idle.
+	 */
+	qa_idle_cids = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!qa_idle_cids) {
+		scx_bpf_error("failed to allocate idle cmask");
+		return -ENOMEM;
+	}
+	cmask_init(qa_idle_cids, 0, scx_bpf_nr_cids());
 
 	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
 	if (ret) {
@@ -1102,6 +1199,8 @@ SCX_OPS_DEFINE(qmap_ops,
 	       .dispatch		= (void *)qmap_dispatch,
 	       .tick			= (void *)qmap_tick,
 	       .core_sched_before	= (void *)qmap_core_sched_before,
+	       .set_cpumask		= (void *)qmap_set_cpumask,
+	       .update_idle		= (void *)qmap_update_idle,
 	       .init_task		= (void *)qmap_init_task,
 	       .exit_task		= (void *)qmap_exit_task,
 	       .dump			= (void *)qmap_dump,
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 15/16] tools/sched_ext: scx_qmap: Port to cid-form struct_ops
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (13 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 14/16] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21  7:19 ` [PATCH 16/16] sched_ext: Require cid-form struct_ops for sub-sched support Tejun Heo
  2026-04-21 18:18 ` [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Cheng-Yang Chou
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Flip qmap's struct_ops to bpf_sched_ext_ops_cid. The kernel now passes
cids and cmasks to callbacks directly, so the per-callback cpu<->cid
translations that the prior patch added drop out and cpu_ctxs[] is
reindexed by cid. Cpu-form kfunc calls switch to their cid-form
counterparts.

The cpu-only kfuncs (idle/any pick, cpumask iteration) have no cid
substitute. Their callers already moved to cmask scans against
qa_idle_cids and taskc->cpus_allowed in the prior patch, so the kfunc
calls drop here without behavior changes.

set_cmask is wired up via cmask_copy_from_kernel() to copy the
kernel-supplied cmask into the arena-resident taskc cmask. The
cpuperf monitor iterates the cid-form perf kfuncs.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 195 +++++++++++++++------------------
 tools/sched_ext/scx_qmap.c     |  14 ++-
 tools/sched_ext/scx_qmap.h     |   2 +-
 3 files changed, 98 insertions(+), 113 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index d30ec914a118..ceb136935ffa 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -179,25 +179,24 @@ static int qmap_spin_lock(struct bpf_res_spin_lock *lock)
 }
 
 /*
- * Try prev_cpu's cid, then scan taskc->cpus_allowed AND qa_idle_cids
- * round-robin from prev_cid + 1. Atomic claim retries on race; bounded
- * by IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
+ * Try prev_cid, then scan taskc->cpus_allowed AND qa_idle_cids round-robin
+ * from prev_cid + 1. Atomic claim retries on race; bounded by
+ * IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
  */
 #define IDLE_PICK_RETRIES	16
 
-static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu,
+static s32 pick_direct_dispatch_cid(struct task_struct *p, s32 prev_cid,
 				    task_ctx_t *taskc)
 {
 	u32 nr_cids = scx_bpf_nr_cids();
-	s32 prev_cid, cid;
+	s32 cid;
 	u32 i;
 
 	if (!always_enq_immed && p->nr_cpus_allowed == 1)
-		return prev_cpu;
+		return prev_cid;
 
-	prev_cid = scx_bpf_cpu_to_cid(prev_cpu);
 	if (cmask_test_and_clear(qa_idle_cids, prev_cid))
-		return prev_cpu;
+		return prev_cid;
 
 	cid = prev_cid;
 	bpf_for(i, 0, IDLE_PICK_RETRIES) {
@@ -207,7 +206,7 @@ static s32 pick_direct_dispatch_cpu(struct task_struct *p, s32 prev_cpu,
 		if (cid >= nr_cids)
 			return -1;
 		if (cmask_test_and_clear(qa_idle_cids, cid))
-			return scx_bpf_cid_to_cpu(cid);
+			return cid;
 	}
 	return -1;
 }
@@ -308,25 +307,25 @@ static void qmap_fifo_remove(task_ctx_t *taskc)
 	bpf_res_spin_unlock(lock);
 }
 
-s32 BPF_STRUCT_OPS(qmap_select_cpu, struct task_struct *p,
-		   s32 prev_cpu, u64 wake_flags)
+s32 BPF_STRUCT_OPS(qmap_select_cid, struct task_struct *p,
+		   s32 prev_cid, u64 wake_flags)
 {
 	task_ctx_t *taskc;
-	s32 cpu;
+	s32 cid;
 
 	if (!(taskc = lookup_task_ctx(p)))
 		return -ESRCH;
 
 	if (p->scx.weight < 2 && !(p->flags & PF_KTHREAD))
-		return prev_cpu;
+		return prev_cid;
 
-	cpu = pick_direct_dispatch_cpu(p, prev_cpu, taskc);
+	cid = pick_direct_dispatch_cid(p, prev_cid, taskc);
 
-	if (cpu >= 0) {
+	if (cid >= 0) {
 		taskc->force_local = true;
-		return cpu;
+		return cid;
 	} else {
-		return prev_cpu;
+		return prev_cid;
 	}
 }
 
@@ -350,12 +349,12 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	static u32 user_cnt, kernel_cnt;
 	task_ctx_t *taskc;
 	int idx = weight_to_idx(p->scx.weight);
-	s32 cpu;
+	s32 cid;
 
 	if (enq_flags & SCX_ENQ_REENQ) {
 		__sync_fetch_and_add(&qa.nr_reenqueued, 1);
-		if (scx_bpf_task_cpu(p) == 0)
-			__sync_fetch_and_add(&qa.nr_reenqueued_cpu0, 1);
+		if (scx_bpf_task_cid(p) == 0)
+			__sync_fetch_and_add(&qa.nr_reenqueued_cid0, 1);
 	}
 
 	if (p->flags & PF_KTHREAD) {
@@ -388,14 +387,14 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 
 		if (!(++immed_stress_cnt % immed_stress_nth)) {
 			taskc->force_local = false;
-			scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cpu(p),
+			scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | scx_bpf_task_cid(p),
 					   slice_ns, enq_flags);
 			return;
 		}
 	}
 
 	/*
-	 * If qmap_select_cpu() is telling us to or this is the last runnable
+	 * If qmap_select_cid() is telling us to or this is the last runnable
 	 * task on the CPU, enqueue locally.
 	 */
 	if (taskc->force_local) {
@@ -411,11 +410,11 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		return;
 	}
 
-	/* if select_cpu() wasn't called, try direct dispatch */
+	/* if select_cid() wasn't called, try direct dispatch */
 	if (!__COMPAT_is_enq_cpu_selected(enq_flags) &&
-	    (cpu = pick_direct_dispatch_cpu(p, scx_bpf_task_cpu(p), taskc)) >= 0) {
+	    (cid = pick_direct_dispatch_cid(p, scx_bpf_task_cid(p), taskc)) >= 0) {
 		__sync_fetch_and_add(&qa.nr_ddsp_from_enq, 1);
-		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cpu, slice_ns, enq_flags);
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cid, slice_ns, enq_flags);
 		return;
 	}
 
@@ -423,15 +422,16 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	 * If the task was re-enqueued due to the CPU being preempted by a
 	 * higher priority scheduling class, just re-enqueue the task directly
 	 * on the global DSQ. As we want another CPU to pick it up, find and
-	 * kick an idle CPU.
+	 * kick an idle cid.
 	 */
 	if (enq_flags & SCX_ENQ_REENQ) {
-		s32 cpu;
+		s32 cid;
 
 		scx_bpf_dsq_insert(p, SHARED_DSQ, 0, enq_flags);
-		cpu = scx_bpf_pick_idle_cpu(p->cpus_ptr, 0);
-		if (cpu >= 0)
-			scx_bpf_kick_cpu(cpu, SCX_KICK_IDLE);
+		cid = cmask_next_and_set_wrap(&taskc->cpus_allowed,
+					      qa_idle_cids, 0);
+		if (cid < scx_bpf_nr_cids())
+			scx_bpf_kick_cid(cid, SCX_KICK_IDLE);
 		return;
 	}
 
@@ -483,7 +483,8 @@ static void update_core_sched_head_seq(struct task_struct *p)
 static bool dispatch_highpri(bool from_timer)
 {
 	struct task_struct *p;
-	s32 this_cpu = bpf_get_smp_processor_id();
+	s32 this_cid = scx_bpf_this_cid();
+	u32 nr_cids = scx_bpf_nr_cids();
 
 	/* scan SHARED_DSQ and move highpri tasks to HIGHPRI_DSQ */
 	bpf_for_each(scx_dsq, p, SHARED_DSQ, 0) {
@@ -502,21 +503,29 @@ static bool dispatch_highpri(bool from_timer)
 	}
 
 	/*
-	 * Scan HIGHPRI_DSQ and dispatch until a task that can run on this CPU
-	 * is found.
+	 * Scan HIGHPRI_DSQ and dispatch until a task that can run here is
+	 * found. Prefer this_cid if the task allows it; otherwise RR-scan the
+	 * task's cpus_allowed starting after this_cid.
 	 */
 	bpf_for_each(scx_dsq, p, HIGHPRI_DSQ, 0) {
+		task_ctx_t *taskc;
 		bool dispatched = false;
-		s32 cpu;
+		s32 cid;
+
+		if (!(taskc = lookup_task_ctx(p)))
+			return false;
 
-		if (bpf_cpumask_test_cpu(this_cpu, p->cpus_ptr))
-			cpu = this_cpu;
+		if (cmask_test(&taskc->cpus_allowed, this_cid))
+			cid = this_cid;
 		else
-			cpu = scx_bpf_pick_any_cpu(p->cpus_ptr, 0);
+			cid = cmask_next_set_wrap(&taskc->cpus_allowed,
+						  this_cid + 1);
+		if (cid >= nr_cids)
+			continue;
 
-		if (scx_bpf_dsq_move(BPF_FOR_EACH_ITER, p, SCX_DSQ_LOCAL_ON | cpu,
+		if (scx_bpf_dsq_move(BPF_FOR_EACH_ITER, p, SCX_DSQ_LOCAL_ON | cid,
 				     SCX_ENQ_PREEMPT)) {
-			if (cpu == this_cpu) {
+			if (cid == this_cid) {
 				dispatched = true;
 				__sync_fetch_and_add(&qa.nr_expedited_local, 1);
 			} else {
@@ -535,7 +544,7 @@ static bool dispatch_highpri(bool from_timer)
 	return false;
 }
 
-void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
+void BPF_STRUCT_OPS(qmap_dispatch, s32 cid, struct task_struct *prev)
 {
 	struct task_struct *p;
 	struct cpu_ctx __arena *cpuc;
@@ -563,7 +572,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 		}
 	}
 
-	cpuc = &qa.cpu_ctxs[bpf_get_smp_processor_id()];
+	cpuc = &qa.cpu_ctxs[scx_bpf_this_cid()];
 
 	for (i = 0; i < 5; i++) {
 		/* Advance the dispatch cursor and pick the fifo. */
@@ -628,8 +637,8 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 			 * document this class of issue -- other schedulers
 			 * seeing similar warnings can use this as a reference.
 			 */
-			if (!bpf_cpumask_test_cpu(cpu, p->cpus_ptr))
-				scx_bpf_kick_cpu(scx_bpf_task_cpu(p), 0);
+			if (!cmask_test(&taskc->cpus_allowed, cid))
+				scx_bpf_kick_cid(scx_bpf_task_cid(p), 0);
 
 			batch--;
 			cpuc->dsp_cnt--;
@@ -666,7 +675,7 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cpu, struct task_struct *prev)
 
 void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
 {
-	struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[bpf_get_smp_processor_id()];
+	struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[scx_bpf_this_cid()];
 	int idx;
 
 	/*
@@ -678,7 +687,7 @@ void BPF_STRUCT_OPS(qmap_tick, struct task_struct *p)
 	idx = weight_to_idx(cpuc->avg_weight);
 	cpuc->cpuperf_target = qidx_to_cpuperf_target[idx];
 
-	scx_bpf_cpuperf_set(scx_bpf_task_cpu(p), cpuc->cpuperf_target);
+	scx_bpf_cidperf_set(scx_bpf_task_cid(p), cpuc->cpuperf_target);
 }
 
 /*
@@ -826,9 +835,9 @@ void BPF_STRUCT_OPS(qmap_dump, struct scx_dump_ctx *dctx)
 	}
 }
 
-void BPF_STRUCT_OPS(qmap_dump_cpu, struct scx_dump_ctx *dctx, s32 cpu, bool idle)
+void BPF_STRUCT_OPS(qmap_dump_cid, struct scx_dump_ctx *dctx, s32 cid, bool idle)
 {
-	struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[cpu];
+	struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[cid];
 
 	if (suppress_dump || idle)
 		return;
@@ -879,46 +888,24 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
 			   cgrp->kn->id, period_us, quota_us, burst_us);
 }
 
-void BPF_STRUCT_OPS(qmap_update_idle, s32 cpu, bool idle)
+void BPF_STRUCT_OPS(qmap_update_idle, s32 cid, bool idle)
 {
-	s32 cid = scx_bpf_cpu_to_cid(cpu);
-
 	QMAP_TOUCH_ARENA();
-	if (cid < 0)
-		return;
 	if (idle)
 		cmask_set(qa_idle_cids, cid);
 	else
 		cmask_clear(qa_idle_cids, cid);
 }
 
-/*
- * The cpumask received here is kernel-address memory; walk it bit by bit
- * (bpf_cpumask_test_cpu handles the access), convert each set cpu to its
- * cid, and populate the arena-resident taskc cmask.
- */
-void BPF_STRUCT_OPS(qmap_set_cpumask, struct task_struct *p,
-		    const struct cpumask *cpumask)
+void BPF_STRUCT_OPS(qmap_set_cmask, struct task_struct *p,
+		    const struct scx_cmask *cmask)
 {
 	task_ctx_t *taskc;
-	u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
-	s32 cpu;
 
 	taskc = lookup_task_ctx(p);
 	if (!taskc)
 		return;
-
-	cmask_zero(&taskc->cpus_allowed);
-
-	bpf_for(cpu, 0, nr_cpu_ids) {
-		s32 cid;
-
-		if (!bpf_cpumask_test_cpu(cpu, cpumask))
-			continue;
-		cid = scx_bpf_cpu_to_cid(cpu);
-		if (cid >= 0)
-			__cmask_set(&taskc->cpus_allowed, cid);
-	}
+	cmask_copy_from_kernel(&taskc->cpus_allowed, cmask);
 }
 
 struct monitor_timer {
@@ -933,59 +920,49 @@ struct {
 } monitor_timer SEC(".maps");
 
 /*
- * Print out the min, avg and max performance levels of CPUs every second to
- * demonstrate the cpuperf interface.
+ * Aggregate cidperf across the first nr_online_cids cids. Post-hotplug
+ * the first-N-are-online invariant drifts, so some cap/cur values may
+ * be stale. For this demo monitor that's fine; the scheduler exits on
+ * the enable-time hotplug_seq mismatch and userspace restarts, which
+ * rebuilds the layout.
  */
 static void monitor_cpuperf(void)
 {
-	u32 nr_cpu_ids;
+	u32 nr_online = scx_bpf_nr_online_cids();
 	u64 cap_sum = 0, cur_sum = 0, cur_min = SCX_CPUPERF_ONE, cur_max = 0;
 	u64 target_sum = 0, target_min = SCX_CPUPERF_ONE, target_max = 0;
-	const struct cpumask *online;
-	int i, nr_online_cpus = 0;
-
-	nr_cpu_ids = scx_bpf_nr_cpu_ids();
-	online = scx_bpf_get_online_cpumask();
-
-	bpf_for(i, 0, nr_cpu_ids) {
-		struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[i];
-		u32 cap, cur;
+	s32 cid;
 
-		if (!bpf_cpumask_test_cpu(i, online))
-			continue;
-		nr_online_cpus++;
+	QMAP_TOUCH_ARENA();
 
-		/* collect the capacity and current cpuperf */
-		cap = scx_bpf_cpuperf_cap(i);
-		cur = scx_bpf_cpuperf_cur(i);
+	bpf_for(cid, 0, nr_online) {
+		struct cpu_ctx __arena *cpuc = &qa.cpu_ctxs[cid];
+		u32 cap = scx_bpf_cidperf_cap(cid);
+		u32 cur = scx_bpf_cidperf_cur(cid);
+		u32 target;
 
 		cur_min = cur < cur_min ? cur : cur_min;
 		cur_max = cur > cur_max ? cur : cur_max;
 
-		/*
-		 * $cur is relative to $cap. Scale it down accordingly so that
-		 * it's in the same scale as other CPUs and $cur_sum/$cap_sum
-		 * makes sense.
-		 */
-		cur_sum += cur * cap / SCX_CPUPERF_ONE;
+		cur_sum += (u64)cur * cap / SCX_CPUPERF_ONE;
 		cap_sum += cap;
 
-		/* collect target */
-		cur = cpuc->cpuperf_target;
-		target_sum += cur;
-		target_min = cur < target_min ? cur : target_min;
-		target_max = cur > target_max ? cur : target_max;
+		target = cpuc->cpuperf_target;
+		target_sum += target;
+		target_min = target < target_min ? target : target_min;
+		target_max = target > target_max ? target : target_max;
 	}
 
+	if (!nr_online || !cap_sum)
+		return;
+
 	qa.cpuperf_min = cur_min;
 	qa.cpuperf_avg = cur_sum * SCX_CPUPERF_ONE / cap_sum;
 	qa.cpuperf_max = cur_max;
 
 	qa.cpuperf_target_min = target_min;
-	qa.cpuperf_target_avg = target_sum / nr_online_cpus;
+	qa.cpuperf_target_avg = target_sum / nr_online;
 	qa.cpuperf_target_max = target_max;
-
-	scx_bpf_put_cpumask(online);
 }
 
 /*
@@ -1191,20 +1168,20 @@ void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args)
 	}
 }
 
-SCX_OPS_DEFINE(qmap_ops,
+SCX_OPS_CID_DEFINE(qmap_ops,
 	       .flags			= SCX_OPS_ENQ_EXITING | SCX_OPS_TID_TO_TASK,
-	       .select_cpu		= (void *)qmap_select_cpu,
+	       .select_cid		= (void *)qmap_select_cid,
 	       .enqueue			= (void *)qmap_enqueue,
 	       .dequeue			= (void *)qmap_dequeue,
 	       .dispatch		= (void *)qmap_dispatch,
 	       .tick			= (void *)qmap_tick,
 	       .core_sched_before	= (void *)qmap_core_sched_before,
-	       .set_cpumask		= (void *)qmap_set_cpumask,
+	       .set_cmask		= (void *)qmap_set_cmask,
 	       .update_idle		= (void *)qmap_update_idle,
 	       .init_task		= (void *)qmap_init_task,
 	       .exit_task		= (void *)qmap_exit_task,
 	       .dump			= (void *)qmap_dump,
-	       .dump_cpu		= (void *)qmap_dump_cpu,
+	       .dump_cid		= (void *)qmap_dump_cid,
 	       .dump_task		= (void *)qmap_dump_task,
 	       .cgroup_init		= (void *)qmap_cgroup_init,
 	       .cgroup_set_weight	= (void *)qmap_cgroup_set_weight,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 99408b1bb1ec..2cc10fd36bec 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -73,6 +73,14 @@ int main(int argc, char **argv)
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
 	signal(SIGTERM, sigint_handler);
+
+	if (libbpf_num_possible_cpus() > SCX_QMAP_MAX_CPUS) {
+		fprintf(stderr,
+			"scx_qmap: %d possible CPUs exceeds compile-time cap %d; "
+			"rebuild with larger SCX_QMAP_MAX_CPUS\n",
+			libbpf_num_possible_cpus(), SCX_QMAP_MAX_CPUS);
+		return 1;
+	}
 restart:
 	optind = 1;
 	skel = SCX_OPS_OPEN(qmap_ops, scx_qmap);
@@ -162,9 +170,9 @@ int main(int argc, char **argv)
 		long nr_enqueued = qa->nr_enqueued;
 		long nr_dispatched = qa->nr_dispatched;
 
-		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq/cpu0=%llu/%llu deq=%llu core=%llu enq_ddsp=%llu\n",
+		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq/cid0=%llu/%llu deq=%llu core=%llu enq_ddsp=%llu\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
-		       qa->nr_reenqueued, qa->nr_reenqueued_cpu0,
+		       qa->nr_reenqueued, qa->nr_reenqueued_cid0,
 		       qa->nr_dequeued,
 		       qa->nr_core_sched_execed,
 		       qa->nr_ddsp_from_enq);
@@ -173,7 +181,7 @@ int main(int argc, char **argv)
 		       qa->nr_expedited_remote,
 		       qa->nr_expedited_from_timer,
 		       qa->nr_expedited_lost);
-		if (__COMPAT_has_ksym("scx_bpf_cpuperf_cur"))
+		if (__COMPAT_has_ksym("scx_bpf_cidperf_cur"))
 			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
 			       qa->cpuperf_min,
 			       qa->cpuperf_avg,
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index 9d9af2ad90c6..d15a705d5ac5 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -45,7 +45,7 @@ struct qmap_fifo {
 
 struct qmap_arena {
 	/* userspace-visible stats */
-	__u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cpu0;
+	__u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cid0;
 	__u64 nr_dequeued, nr_ddsp_from_enq;
 	__u64 nr_core_sched_execed;
 	__u64 nr_expedited_local, nr_expedited_remote;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 16/16] sched_ext: Require cid-form struct_ops for sub-sched support
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (14 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 15/16] tools/sched_ext: scx_qmap: Port to cid-form struct_ops Tejun Heo
@ 2026-04-21  7:19 ` Tejun Heo
  2026-04-21 18:18 ` [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Cheng-Yang Chou
  16 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21  7:19 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel, Tejun Heo

Sub-scheduler support is tied to the cid-form struct_ops: sub_attach /
sub_detach will communicate allocation via cmask, and the hierarchy assumes
all participants share a single topological cid space. A cpu-form root that
accepts sub-scheds would need cpu <-> cid translation on every cross-sched
interaction, defeating the purpose.

Enforce this at validate_ops():
- A sub-scheduler (scx_parent(sch) non-NULL) must be cid-form.
- A root that exposes sub_attach / sub_detach must be cid-form.

scx_qmap, which is currently the only scheduler demoing sub-sched support,
was converted to cid-form in the preceding patch, so this doesn't cause
breakage.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 02bdd393bbe4..71d6a2b39e64 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -6753,6 +6753,23 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 	if (!sch->is_cid_type && (ops->cpu_acquire || ops->cpu_release))
 		pr_warn("ops->cpu_acquire/release() are deprecated, use sched_switch TP instead\n");
 
+	/*
+	 * Sub-scheduler support is tied to the cid-form struct_ops. A sub-sched
+	 * attaches through a cid-form-only interface (sub_attach/sub_detach),
+	 * and a root that accepts sub-scheds must expose cid-form state to
+	 * them. Reject cpu-form schedulers on either side.
+	 */
+	if (!sch->is_cid_type) {
+		if (scx_parent(sch)) {
+			scx_error(sch, "sub-sched requires cid-form struct_ops");
+			return -EINVAL;
+		}
+		if (ops->sub_attach || ops->sub_detach) {
+			scx_error(sch, "sub_attach/sub_detach requires cid-form struct_ops");
+			return -EINVAL;
+		}
+	}
+
 	return 0;
 }
 
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it
  2026-04-21  7:19 ` [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
@ 2026-04-21 13:31   ` Cheng-Yang Chou
  0 siblings, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 13:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:30PM -1000, Tejun Heo wrote:
> Rename the static ext.c helper and declare it in ext_internal.h so
> ext_idle.c and the upcoming cid code can call it directly instead of
> relying on build_policy.c textual inclusion.
> 
> Pure rename and visibility change.

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 02/16] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h
  2026-04-21  7:19 ` [PATCH 02/16] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h Tejun Heo
@ 2026-04-21 13:36   ` Cheng-Yang Chou
  0 siblings, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 13:36 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:31PM -1000, Tejun Heo wrote:
> Things shared across multiple .c files belong in a header. scx_exit() /
> scx_error() (and their scx_vexit() / scx_verror() siblings) are already
> called from ext_idle.c and the upcoming ext_cid.c, and it was only
> build_policy.c's textual inclusion of ext.c that made the references
> resolve. Move the whole family to ext_internal.h.
> 
> Pure visibility change.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 03/16] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu()
  2026-04-21  7:19 ` [PATCH 03/16] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu() Tejun Heo
@ 2026-04-21 13:49   ` Cheng-Yang Chou
  0 siblings, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 13:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:32PM -1000, Tejun Heo wrote:
> Callers that already know the cpu is valid shouldn't have to pay for a
> redundant check. scx_kick_cpu() is called from the in-kernel balance loop
> break-out path with the current cpu (trivially valid) and from
> scx_bpf_kick_cpu() with a BPF-supplied cpu that does need validation. Move
> the check out of scx_kick_cpu() into scx_bpf_kick_cpu() so the backend is
> reusable by callers that have already validated.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 5571f5995dd8..9e802d73f205 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -8673,9 +8673,6 @@ static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
>  	struct rq *this_rq;
>  	unsigned long irq_flags;
>  
> -	if (!scx_cpu_valid(sch, cpu, NULL))
> -		return;
> -
>  	local_irq_save(irq_flags);

I initially think of removing the guard here would left a gap, but patch
10's scx_bpf_kick_cid covers it, so

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

>  
>  	this_rq = this_rq();
> @@ -8738,7 +8735,7 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
>  
>  	guard(rcu)();
>  	sch = scx_prog_sched(aux);
> -	if (likely(sch))
> +	if (likely(sch) && scx_cpu_valid(sch, cpu, NULL))
>  		scx_kick_cpu(sch, cpu, flags);
>  }
>  
> -- 
> 2.53.0
> 
> 

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 04/16] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops
  2026-04-21  7:19 ` [PATCH 04/16] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops Tejun Heo
@ 2026-04-21 13:58   ` Cheng-Yang Chou
  0 siblings, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 13:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:33PM -1000, Tejun Heo wrote:
> cpu_acquire and cpu_release are deprecated and slated for removal. Move
> their declarations to the end of struct sched_ext_ops so an upcoming
> cid-form struct (sched_ext_ops_cid) can omit them entirely without
> disturbing the offsets of the shared fields.
> 
> Switch the two SCX_HAS_OP() callers for these ops to direct field checks
> since the relocated ops sit outside the SCX_OPI_END range covered by the
> has_op bitmap.
> 
> scx_kf_allow_flags[] auto-sizes to the highest used SCX_OP_IDX, so
> SCX_OP_IDX(cpu_release) moving to a higher index just enlarges the
> sparse array; the lookup logic is unchanged.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>

Looks good.

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

> ---
>  kernel/sched/ext.c          |  4 +--
>  kernel/sched/ext_internal.h | 54 ++++++++++++++++++++++---------------
>  2 files changed, 34 insertions(+), 24 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 9e802d73f205..74e4271e44e9 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2813,7 +2813,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
>  		 * core. This callback complements ->cpu_release(), which is
>  		 * emitted in switch_class().
>  		 */
> -		if (SCX_HAS_OP(sch, cpu_acquire))
> +		if (sch->ops.cpu_acquire)
>  			SCX_CALL_OP(sch, cpu_acquire, rq, cpu, NULL);
>  		rq->scx.cpu_released = false;
>  	}
> @@ -2959,7 +2959,7 @@ static void switch_class(struct rq *rq, struct task_struct *next)
>  	 *  next time that balance_one() is invoked.
>  	 */
>  	if (!rq->scx.cpu_released) {
> -		if (SCX_HAS_OP(sch, cpu_release)) {
> +		if (sch->ops.cpu_release) {
>  			struct scx_cpu_release_args args = {
>  				.reason = preempt_reason_from_class(next_class),
>  				.task = next,
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index 350b84876b2a..1d73fcc19aaf 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -555,28 +555,6 @@ struct sched_ext_ops {
>  	 */
>  	void (*update_idle)(s32 cpu, bool idle);
>  
> -	/**
> -	 * @cpu_acquire: A CPU is becoming available to the BPF scheduler
> -	 * @cpu: The CPU being acquired by the BPF scheduler.
> -	 * @args: Acquire arguments, see the struct definition.
> -	 *
> -	 * A CPU that was previously released from the BPF scheduler is now once
> -	 * again under its control.
> -	 */
> -	void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
> -
> -	/**
> -	 * @cpu_release: A CPU is taken away from the BPF scheduler
> -	 * @cpu: The CPU being released by the BPF scheduler.
> -	 * @args: Release arguments, see the struct definition.
> -	 *
> -	 * The specified CPU is no longer under the control of the BPF
> -	 * scheduler. This could be because it was preempted by a higher
> -	 * priority sched_class, though there may be other reasons as well. The
> -	 * caller should consult @args->reason to determine the cause.
> -	 */
> -	void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
> -
>  	/**
>  	 * @init_task: Initialize a task to run in a BPF scheduler
>  	 * @p: task to initialize for BPF scheduling
> @@ -867,6 +845,38 @@ struct sched_ext_ops {
>  
>  	/* internal use only, must be NULL */
>  	void __rcu *priv;
> +
> +	/*
> +	 * Deprecated callbacks. Kept at the end of the struct so the cid-form
> +	 * struct (sched_ext_ops_cid) can omit them without affecting the
> +	 * shared field offsets. Use SCX_ENQ_IMMED instead. Sitting past
> +	 * SCX_OPI_END means has_op doesn't cover them, so SCX_HAS_OP() cannot
> +	 * be used; callers must test sch->ops.cpu_acquire / cpu_release
> +	 * directly.
> +	 */
> +
> +	/**
> +	 * @cpu_acquire: A CPU is becoming available to the BPF scheduler
> +	 * @cpu: The CPU being acquired by the BPF scheduler.
> +	 * @args: Acquire arguments, see the struct definition.
> +	 *
> +	 * A CPU that was previously released from the BPF scheduler is now once
> +	 * again under its control. Deprecated; use SCX_ENQ_IMMED instead.
> +	 */
> +	void (*cpu_acquire)(s32 cpu, struct scx_cpu_acquire_args *args);
> +
> +	/**
> +	 * @cpu_release: A CPU is taken away from the BPF scheduler
> +	 * @cpu: The CPU being released by the BPF scheduler.
> +	 * @args: Release arguments, see the struct definition.
> +	 *
> +	 * The specified CPU is no longer under the control of the BPF
> +	 * scheduler. This could be because it was preempted by a higher
> +	 * priority sched_class, though there may be other reasons as well. The
> +	 * caller should consult @args->reason to determine the cause.
> +	 * Deprecated; use SCX_ENQ_IMMED instead.
> +	 */
> +	void (*cpu_release)(s32 cpu, struct scx_cpu_release_args *args);
>  };
>  
>  enum scx_opi {
> -- 
> 2.53.0
> 
> 

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 05/16] sched_ext: Make scx_enable() take scx_enable_cmd
  2026-04-21  7:19 ` [PATCH 05/16] sched_ext: Make scx_enable() take scx_enable_cmd Tejun Heo
@ 2026-04-21 14:25   ` Cheng-Yang Chou
  0 siblings, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:34PM -1000, Tejun Heo wrote:
> Pass struct scx_enable_cmd to scx_enable() rather than unpacking @ops
> at every call site and re-packing into a fresh cmd inside. bpf_scx_reg()
> now builds the cmd on its stack and hands it in; scx_enable() just
> wires up the kthread work and waits.

Verified stack lifetime.

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

> 
> Relocate struct scx_enable_cmd above scx_alloc_and_add_sched() so
> upcoming patches that also want the cmd can see it.
> 
> No behavior change.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 74e4271e44e9..62aab432dbf4 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -6424,6 +6424,19 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
>  	return pnode;
>  }
>  
> +/*
> + * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
> + * starvation. During the READY -> ENABLED task switching loop, the calling
> + * thread's sched_class gets switched from fair to ext. As fair has higher
> + * priority than ext, the calling thread can be indefinitely starved under
> + * fair-class saturation, leading to a system hang.
> + */
> +struct scx_enable_cmd {
> +	struct kthread_work	work;
> +	struct sched_ext_ops	*ops;
> +	int			ret;
> +};
> +
>  /*
>   * Allocate and initialize a new scx_sched. @cgrp's reference is always
>   * consumed whether the function succeeds or fails.
> @@ -6655,19 +6668,6 @@ static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
>  	return 0;
>  }
>  
> -/*
> - * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
> - * starvation. During the READY -> ENABLED task switching loop, the calling
> - * thread's sched_class gets switched from fair to ext. As fair has higher
> - * priority than ext, the calling thread can be indefinitely starved under
> - * fair-class saturation, leading to a system hang.
> - */
> -struct scx_enable_cmd {
> -	struct kthread_work	work;
> -	struct sched_ext_ops	*ops;
> -	int			ret;
> -};
> -
>  static void scx_root_enable_workfn(struct kthread_work *work)
>  {
>  	struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
> @@ -7243,11 +7243,10 @@ static s32 __init scx_cgroup_lifetime_notifier_init(void)
>  core_initcall(scx_cgroup_lifetime_notifier_init);
>  #endif	/* CONFIG_EXT_SUB_SCHED */
>  
> -static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
> +static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link)
>  {
>  	static struct kthread_worker *helper;
>  	static DEFINE_MUTEX(helper_mutex);
> -	struct scx_enable_cmd cmd;
>  
>  	if (!cpumask_equal(housekeeping_cpumask(HK_TYPE_DOMAIN),
>  			   cpu_possible_mask)) {
> @@ -7271,16 +7270,15 @@ static s32 scx_enable(struct sched_ext_ops *ops, struct bpf_link *link)
>  	}
>  
>  #ifdef CONFIG_EXT_SUB_SCHED
> -	if (ops->sub_cgroup_id > 1)
> -		kthread_init_work(&cmd.work, scx_sub_enable_workfn);
> +	if (cmd->ops->sub_cgroup_id > 1)
> +		kthread_init_work(&cmd->work, scx_sub_enable_workfn);
>  	else
>  #endif	/* CONFIG_EXT_SUB_SCHED */
> -		kthread_init_work(&cmd.work, scx_root_enable_workfn);
> -	cmd.ops = ops;
> +		kthread_init_work(&cmd->work, scx_root_enable_workfn);
>  
> -	kthread_queue_work(READ_ONCE(helper), &cmd.work);
> -	kthread_flush_work(&cmd.work);
> -	return cmd.ret;
> +	kthread_queue_work(READ_ONCE(helper), &cmd->work);
> +	kthread_flush_work(&cmd->work);
> +	return cmd->ret;
>  }
>  
>  
> @@ -7452,7 +7450,9 @@ static int bpf_scx_check_member(const struct btf_type *t,
>  
>  static int bpf_scx_reg(void *kdata, struct bpf_link *link)
>  {
> -	return scx_enable(kdata, link);
> +	struct scx_enable_cmd cmd = { .ops = kdata };
> +
> +	return scx_enable(&cmd, link);
>  }
>  
>  static void bpf_scx_unreg(void *kdata, struct bpf_link *link)
> -- 
> 2.53.0
> 
> 

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 sched_ext/for-7.2] sched_ext: Add topological CPU IDs (cids)
  2026-04-21  7:19 ` [PATCH 06/16] sched_ext: Add topological CPU IDs (cids) Tejun Heo
@ 2026-04-21 17:15   ` Tejun Heo
  0 siblings, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21 17:15 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Cheng-Yang Chou


Raw cpu numbers are clumsy for sharding and cross-sched communication,
especially from BPF. The space is sparse, numerical closeness doesn't
track topological closeness (x86 hyperthreading often scatters SMT
siblings), and a range of cpu ids doesn't describe anything meaningful.
Sub-sched support makes this acute: cpu allocation, revocation, and
state constantly flow across sub-scheds. Passing whole cpumasks scales
poorly (every op scans 4K bits) and cpumasks are awkward in BPF.

cids assign every cpu a dense, topology-ordered id. CPUs sharing a core,
LLC, or NUMA node occupy contiguous cid ranges, so a topology unit
becomes a (start, length) slice. Communication passes slices; BPF can
process a u64 word of cids at a time.

Build the mapping once at root enable by walking online cpus node -> LLC
-> core. Possible-but-not-online cpus tail the space with no-topo cids.
Expose kfuncs to map cpu <-> cid in either direction and to query each
cid's topology metadata.

v2: Use kzalloc_objs()/kmalloc_objs() for the three allocs in
    scx_cid_arrays_alloc() (Cheng-Yang Chou).

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/build_policy.c              |    1 
 kernel/sched/ext.c                       |   17 +
 kernel/sched/ext_cid.c                   |  301 +++++++++++++++++++++++++++++++
 kernel/sched/ext_cid.h                   |  147 +++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h |    3 
 5 files changed, 469 insertions(+)

--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -60,6 +60,7 @@
 #ifdef CONFIG_SCHED_CLASS_EXT
 # include "ext_internal.h"
 # include "ext.c"
+# include "ext_cid.c"
 # include "ext_idle.c"
 #endif
 
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -7,6 +7,7 @@
  * Copyright (c) 2022 David Vernet <dvernet@meta.com>
  */
 #include <linux/btf_ids.h>
+#include "ext_cid.h"
 #include "ext_idle.h"
 
 static DEFINE_RAW_SPINLOCK(scx_sched_lock);
@@ -6727,6 +6728,16 @@ static void scx_root_enable_workfn(struc
 	cpus_read_lock();
 
 	/*
+	 * Build the cid mapping before publishing scx_root. The cid kfuncs
+	 * dereference the cid arrays unconditionally once scx_prog_sched()
+	 * returns non-NULL; the rcu_assign_pointer() below pairs with their
+	 * rcu_dereference() to make the populated arrays visible.
+	 */
+	ret = scx_cid_init(sch);
+	if (ret)
+		goto err_disable;
+
+	/*
 	 * Make the scheduler instance visible. Must be inside cpus_read_lock().
 	 * See handle_hotplug().
 	 */
@@ -9774,6 +9785,12 @@ static int __init scx_init(void)
 		return ret;
 	}
 
+	ret = scx_cid_kfunc_init();
+	if (ret) {
+		pr_err("sched_ext: Failed to register cid kfuncs (%d)\n", ret);
+		return ret;
+	}
+
 	ret = register_bpf_struct_ops(&bpf_sched_ext_ops, sched_ext_ops);
 	if (ret) {
 		pr_err("sched_ext: Failed to register struct_ops (%d)\n", ret);
--- /dev/null
+++ b/kernel/sched/ext_cid.c
@@ -0,0 +1,301 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/cacheinfo.h>
+
+#include "ext_cid.h"
+
+s16 *scx_cid_to_cpu_tbl;
+s16 *scx_cpu_to_cid_tbl;
+struct scx_cid_topo *scx_cid_topo;
+
+#define SCX_CID_TOPO_NEG	(struct scx_cid_topo) {				\
+	.core_cid = -1, .core_idx = -1, .llc_cid = -1, .llc_idx = -1,		\
+	.node_cid = -1, .node_idx = -1,						\
+}
+
+/*
+ * Return @cpu's LLC shared_cpu_map. If cacheinfo isn't populated (offline or
+ * !present), record @cpu in @fallbacks and return its node mask instead - the
+ * worst that can happen is that the cpu's LLC becomes coarser than reality.
+ */
+static const struct cpumask *cpu_llc_mask(int cpu, struct cpumask *fallbacks)
+{
+	struct cpu_cacheinfo *ci = get_cpu_cacheinfo(cpu);
+
+	if (!ci || !ci->info_list || !ci->num_leaves) {
+		cpumask_set_cpu(cpu, fallbacks);
+		return cpumask_of_node(cpu_to_node(cpu));
+	}
+	return &ci->info_list[ci->num_leaves - 1].shared_cpu_map;
+}
+
+/*
+ * The cid arrays are sized by num_possible_cpus() / nr_cpu_ids which are fixed
+ * at boot, so allocate once on first enable and never free. Callers can
+ * dereference these unconditionally as long as scx_root is non-NULL
+ * (rcu_assign_pointer publishes scx_root after scx_cid_init() returns - see
+ * scx_root_enable()).
+ */
+static s32 scx_cid_arrays_alloc(void)
+{
+	u32 npossible = num_possible_cpus();
+	s16 *cid_to_cpu, *cpu_to_cid;
+	struct scx_cid_topo *cid_topo;
+
+	if (scx_cid_to_cpu_tbl)
+		return 0;
+
+	cid_to_cpu = kzalloc_objs(*scx_cid_to_cpu_tbl, npossible, GFP_KERNEL);
+	cpu_to_cid = kzalloc_objs(*scx_cpu_to_cid_tbl, nr_cpu_ids, GFP_KERNEL);
+	cid_topo = kmalloc_objs(*scx_cid_topo, npossible, GFP_KERNEL);
+
+	if (!cid_to_cpu || !cpu_to_cid || !cid_topo) {
+		kfree(cid_to_cpu);
+		kfree(cpu_to_cid);
+		kfree(cid_topo);
+		return -ENOMEM;
+	}
+
+	scx_cid_to_cpu_tbl = cid_to_cpu;
+	scx_cpu_to_cid_tbl = cpu_to_cid;
+	scx_cid_topo = cid_topo;
+	return 0;
+}
+
+/**
+ * scx_cid_init - build the cid mapping
+ * @sch: the scx_sched being initialized; used as the scx_error() target
+ *
+ * See "Topological CPU IDs" in ext_cid.h for the model. Walk online cpus by
+ * intersection at each level (parent_scratch & this_level_mask), which keeps
+ * containment correct by construction and naturally splits a physical LLC
+ * straddling two NUMA nodes into two LLC units. The caller must hold
+ * cpus_read_lock.
+ */
+s32 scx_cid_init(struct scx_sched *sch)
+{
+	cpumask_var_t to_walk __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t node_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t llc_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t core_scratch __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t llc_fallback __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	cpumask_var_t online_no_topo __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	u32 next_cid = 0;
+	s32 next_node_idx = 0, next_llc_idx = 0, next_core_idx = 0;
+	s32 cpu, ret;
+
+	/* s16 keeps the per-cid arrays compact; widen if NR_CPUS ever grows */
+	BUILD_BUG_ON(NR_CPUS > S16_MAX);
+
+	lockdep_assert_cpus_held();
+
+	ret = scx_cid_arrays_alloc();
+	if (ret)
+		return ret;
+
+	if (!zalloc_cpumask_var(&to_walk, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&node_scratch, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&llc_scratch, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&core_scratch, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&llc_fallback, GFP_KERNEL) ||
+	    !zalloc_cpumask_var(&online_no_topo, GFP_KERNEL))
+		return -ENOMEM;
+
+	/* -1 sentinels for sparse-possible cpu id holes (0 is a valid cid) */
+	for (cpu = 0; cpu < nr_cpu_ids; cpu++)
+		scx_cpu_to_cid_tbl[cpu] = -1;
+
+	cpumask_copy(to_walk, cpu_online_mask);
+
+	while (!cpumask_empty(to_walk)) {
+		s32 next_cpu = cpumask_first(to_walk);
+		s32 nid = cpu_to_node(next_cpu);
+		s32 node_cid = next_cid;
+		s32 node_idx;
+
+		/*
+		 * No NUMA info: skip and let the tail loop assign a no-topo
+		 * cid. cpumask_of_node(-1) is undefined.
+		 */
+		if (nid < 0) {
+			cpumask_clear_cpu(next_cpu, to_walk);
+			continue;
+		}
+
+		node_idx = next_node_idx++;
+
+		/* node_scratch = to_walk & this node */
+		cpumask_and(node_scratch, to_walk, cpumask_of_node(nid));
+		if (WARN_ON_ONCE(!cpumask_test_cpu(next_cpu, node_scratch)))
+			return -EINVAL;
+
+		while (!cpumask_empty(node_scratch)) {
+			s32 ncpu = cpumask_first(node_scratch);
+			const struct cpumask *llc_mask = cpu_llc_mask(ncpu, llc_fallback);
+			s32 llc_cid = next_cid;
+			s32 llc_idx = next_llc_idx++;
+
+			/* llc_scratch = node_scratch & this llc */
+			cpumask_and(llc_scratch, node_scratch, llc_mask);
+			if (WARN_ON_ONCE(!cpumask_test_cpu(ncpu, llc_scratch)))
+				return -EINVAL;
+
+			while (!cpumask_empty(llc_scratch)) {
+				s32 lcpu = cpumask_first(llc_scratch);
+				const struct cpumask *sib = topology_sibling_cpumask(lcpu);
+				s32 core_cid = next_cid;
+				s32 core_idx = next_core_idx++;
+				s32 ccpu;
+
+				/* core_scratch = llc_scratch & this core */
+				cpumask_and(core_scratch, llc_scratch, sib);
+				if (WARN_ON_ONCE(!cpumask_test_cpu(lcpu, core_scratch)))
+					return -EINVAL;
+
+				for_each_cpu(ccpu, core_scratch) {
+					s32 cid = next_cid++;
+
+					scx_cid_to_cpu_tbl[cid] = ccpu;
+					scx_cpu_to_cid_tbl[ccpu] = cid;
+					scx_cid_topo[cid] = (struct scx_cid_topo){
+						.core_cid = core_cid,
+						.core_idx = core_idx,
+						.llc_cid = llc_cid,
+						.llc_idx = llc_idx,
+						.node_cid = node_cid,
+						.node_idx = node_idx,
+					};
+
+					cpumask_clear_cpu(ccpu, llc_scratch);
+					cpumask_clear_cpu(ccpu, node_scratch);
+					cpumask_clear_cpu(ccpu, to_walk);
+				}
+			}
+		}
+	}
+
+	/*
+	 * No-topo section: any possible cpu without a cid - normally just the
+	 * not-online ones. Collect any currently-online cpus that land here in
+	 * @online_no_topo so we can warn about them at the end.
+	 */
+	for_each_cpu(cpu, cpu_possible_mask) {
+		s32 cid;
+
+		if (__scx_cpu_to_cid(cpu) != -1)
+			continue;
+		if (cpu_online(cpu))
+			cpumask_set_cpu(cpu, online_no_topo);
+
+		cid = next_cid++;
+		scx_cid_to_cpu_tbl[cid] = cpu;
+		scx_cpu_to_cid_tbl[cpu] = cid;
+		scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+	}
+
+	if (!cpumask_empty(llc_fallback))
+		pr_warn("scx_cid: cpus without cacheinfo, using node mask as llc: %*pbl\n",
+			cpumask_pr_args(llc_fallback));
+	if (!cpumask_empty(online_no_topo))
+		pr_warn("scx_cid: online cpus with no usable topology: %*pbl\n",
+			cpumask_pr_args(online_no_topo));
+
+	return 0;
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_cid_to_cpu - Return the raw CPU id for @cid
+ * @cid: cid to look up
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Return the raw CPU id for @cid. Trigger scx_error() and return -EINVAL if
+ * @cid is invalid. The cid<->cpu mapping is static for the lifetime of the
+ * loaded scheduler, so the BPF side can cache the result to avoid repeated
+ * kfunc invocations.
+ */
+__bpf_kfunc s32 scx_bpf_cid_to_cpu(s32 cid, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return -EINVAL;
+	return scx_cid_to_cpu(sch, cid);
+}
+
+/**
+ * scx_bpf_cpu_to_cid - Return the cid for @cpu
+ * @cpu: cpu to look up
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Return the cid for @cpu. Trigger scx_error() and return -EINVAL if @cpu is
+ * invalid. The cid<->cpu mapping is static for the lifetime of the loaded
+ * scheduler, so the BPF side can cache the result to avoid repeated kfunc
+ * invocations.
+ */
+__bpf_kfunc s32 scx_bpf_cpu_to_cid(s32 cpu, const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return -EINVAL;
+	return scx_cpu_to_cid(sch, cpu);
+}
+
+/**
+ * scx_bpf_cid_topo - Copy out per-cid topology info
+ * @cid: cid to look up
+ * @out__uninit: where to copy the topology info; fully written by this call
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Fill @out__uninit with the topology info for @cid. Trigger scx_error() if
+ * @cid is out of range. If @cid is valid but in the no-topo section, all fields
+ * are set to -1.
+ */
+__bpf_kfunc void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out__uninit,
+				  const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch) || !cid_valid(sch, cid)) {
+		*out__uninit = SCX_CID_TOPO_NEG;
+		return;
+	}
+
+	*out__uninit = scx_cid_topo[cid];
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(scx_kfunc_ids_cid)
+BTF_ID_FLAGS(func, scx_bpf_cid_to_cpu, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cpu_to_cid, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_cid_topo, KF_IMPLICIT_ARGS)
+BTF_KFUNCS_END(scx_kfunc_ids_cid)
+
+static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
+	.owner	= THIS_MODULE,
+	.set	= &scx_kfunc_ids_cid,
+};
+
+int scx_cid_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
+		register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_cid) ?:
+		register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_cid);
+}
--- /dev/null
+++ b/kernel/sched/ext_cid.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Topological CPU IDs (cids)
+ * --------------------------
+ *
+ * Raw cpu numbers are clumsy for sharding work and communication across
+ * topology units, especially from BPF: the space can be sparse, numerical
+ * closeness doesn't imply topological closeness (x86 hyperthreading often puts
+ * SMT siblings far apart), and a range of cpu ids doesn't mean anything.
+ * Sub-scheds make this acute - cpu allocation, revocation and other state are
+ * constantly communicated across sub-scheds, and passing whole cpumasks scales
+ * poorly with cpu count. cpumasks are also awkward in BPF: a variable-length
+ * kernel type sized for the maximum NR_CPUS (4k), with verbose helper sequences
+ * for every op.
+ *
+ * cids give every cpu a dense, topology-ordered id. CPUs sharing a core, LLC or
+ * NUMA node get contiguous cid ranges, so a topology unit becomes a (start,
+ * length) slice of cid space. Communication can pass a slice instead of a
+ * cpumask, and BPF code can process, for example, a u64 word's worth of cids at
+ * a time.
+ *
+ * The mapping is built once at root scheduler enable time by walking the
+ * topology of online cpus only. Going by online cpus is out of necessity:
+ * depending on the arch, topology info isn't reliably available for offline
+ * cpus. The expected usage model is restarting the scheduler on hotplug events
+ * so the mapping is rebuilt against the new online set. A scheduler that wants
+ * to handle hotplug without a restart can provide its own cid and shard mapping
+ * through the override interface.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_CID_H
+#define _KERNEL_SCHED_EXT_CID_H
+
+struct scx_sched;
+
+/*
+ * Per-cid topology info. For each topology level (core, LLC, node), records the
+ * first cid in the unit and its global index. Global indices are consecutive
+ * integers assigned in cid-walk order, so e.g. core_idx ranges over [0,
+ * nr_cores_at_init) with no gaps. No-topo cids have all fields set to -1.
+ *
+ * @core_cid:	first cid of this cid's core (smt-sibling group)
+ * @core_idx:	global index of that core, in [0, nr_cores_at_init)
+ * @llc_cid:	first cid of this cid's LLC
+ * @llc_idx:	global index of that LLC, in [0, nr_llcs_at_init)
+ * @node_cid:	first cid of this cid's NUMA node
+ * @node_idx:	global index of that node, in [0, nr_nodes_at_init)
+ */
+struct scx_cid_topo {
+	s32 core_cid;
+	s32 core_idx;
+	s32 llc_cid;
+	s32 llc_idx;
+	s32 node_cid;
+	s32 node_idx;
+};
+
+/*
+ * Cid space (total is always num_possible_cpus()) is laid out with
+ * topology-annotated cids first, then no-topo cids at the tail. The
+ * topology-annotated block covers the cpus that were online when scx_cid_init()
+ * ran and remains valid even after those cpus go offline. The tail block covers
+ * possible-but-not-online cpus and carries all-(-1) topo info (see
+ * scx_cid_topo); callers detect it via the -1 sentinels.
+ */
+extern s16 *scx_cid_to_cpu_tbl;
+extern s16 *scx_cpu_to_cid_tbl;
+extern struct scx_cid_topo *scx_cid_topo;
+
+s32 scx_cid_init(struct scx_sched *sch);
+int scx_cid_kfunc_init(void);
+
+/**
+ * cid_valid - Verify a cid value, to be used on ops input args
+ * @sch: scx_sched to abort on error
+ * @cid: cid which came from a BPF ops
+ *
+ * Return true if @cid is in [0, num_possible_cpus()). On failure, trigger
+ * scx_error() and return false.
+ */
+static inline bool cid_valid(struct scx_sched *sch, s32 cid)
+{
+	if (likely(cid >= 0 && cid < num_possible_cpus()))
+		return true;
+	scx_error(sch, "invalid cid %d", cid);
+	return false;
+}
+
+/**
+ * __scx_cid_to_cpu - Unchecked cid->cpu table lookup
+ * @cid: cid to look up. Must be in [0, num_possible_cpus()).
+ *
+ * Intended for callsites that have already validated @cid (or otherwise know
+ * it's valid).
+ */
+static inline s32 __scx_cid_to_cpu(s32 cid)
+{
+	return scx_cid_to_cpu_tbl[cid];
+}
+
+/**
+ * __scx_cpu_to_cid - Unchecked cpu->cid table lookup
+ * @cpu: cpu to look up. Must be a valid possible cpu id.
+ *
+ * Intended for callsites that have already validated @cpu (or know it must be
+ * valid by construction, e.g. task_cpu() or smp_processor_id()).
+ */
+static inline s32 __scx_cpu_to_cid(s32 cpu)
+{
+	return scx_cpu_to_cid_tbl[cpu];
+}
+
+/**
+ * scx_cid_to_cpu - Translate @cid to its cpu
+ * @sch: scx_sched for error reporting
+ * @cid: cid to look up
+ *
+ * Return the cpu for @cid or a negative errno on failure. Invalid cid triggers
+ * scx_error() on @sch. The cid arrays are allocated on first scheduler enable
+ * and never freed, so the returned cpu is stable for the lifetime of the loaded
+ * scheduler.
+ */
+static inline s32 scx_cid_to_cpu(struct scx_sched *sch, s32 cid)
+{
+	if (!cid_valid(sch, cid))
+		return -EINVAL;
+	return __scx_cid_to_cpu(cid);
+}
+
+/**
+ * scx_cpu_to_cid - Translate @cpu to its cid
+ * @sch: scx_sched for error reporting
+ * @cpu: cpu to look up
+ *
+ * Return the cid for @cpu or a negative errno on failure. Invalid cpu triggers
+ * scx_error() on @sch. Same lifetime guarantee as scx_cid_to_cpu().
+ */
+static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
+{
+	if (!scx_cpu_valid(sch, cpu, NULL))
+		return -EINVAL;
+	return __scx_cpu_to_cid(cpu);
+}
+
+#endif /* _KERNEL_SCHED_EXT_CID_H */
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -102,6 +102,9 @@ struct task_struct *scx_bpf_cpu_curr(s32
 struct task_struct *scx_bpf_tid_to_task(u64 tid) __ksym __weak;
 u64 scx_bpf_now(void) __ksym __weak;
 void scx_bpf_events(struct scx_event_stats *events, size_t events__sz) __ksym __weak;
+s32 scx_bpf_cpu_to_cid(s32 cpu) __ksym __weak;
+s32 scx_bpf_cid_to_cpu(s32 cid) __ksym __weak;
+void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out) __ksym __weak;
 
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space
  2026-04-21  7:19 ` [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
@ 2026-04-21 17:30   ` Cheng-Yang Chou
  2026-04-21 23:21   ` [PATCH v2] " Tejun Heo
  1 sibling, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 17:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:38PM -1000, Tejun Heo wrote:
> Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid
> space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes
> most of its bits for a small window and is awkward in BPF.
> 
> scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global
> 64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two
> cmasks therefore address bits[] against the same global windows, so
> cross-cmask word ops reduce to
> 
> 	dest->bits[i] OP= operand->bits[i - delta]
> 
> with no bit-shifting, at the cost of up to one extra storage word for
> head misalignment. This alignment guarantee is the reason binary ops
> can stay word-level; every mutating helper preserves it.
> 
> Binary ops are op(dest, operand) and only touch the intersection. Single-
> bit ops follow kernel bitops convention: bare = atomic, __-prefixed =
> non-atomic. Bulk and find ops are non-atomic.
> 
> Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/
> cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and
> adds the extra helpers that basic idle-cpu selection needs.
> 
> No callers yet.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  kernel/sched/ext_cid.c                   | 139 ++++++
>  kernel/sched/ext_cid.h                   | 169 +++++++
>  tools/sched_ext/include/scx/cid.bpf.h    | 595 +++++++++++++++++++++++
>  tools/sched_ext/include/scx/common.bpf.h |   1 +
>  4 files changed, 904 insertions(+)
>  create mode 100644 tools/sched_ext/include/scx/cid.bpf.h
> 
> diff --git a/kernel/sched/ext_cid.c b/kernel/sched/ext_cid.c
> index 4ee727d27c78..c8b7cdaf82d5 100644
> --- a/kernel/sched/ext_cid.c
> +++ b/kernel/sched/ext_cid.c
> @@ -365,6 +365,145 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
>  	.set	= &scx_kfunc_ids_cid,
>  };
>  
> +/*
> + * cmask bulk ops. See ext_cid.h for the layout and semantics: binary ops only
> + * touch the intersection of dest and operand ranges; dest bits outside the
> + * intersection, and dest head/tail padding, are left untouched. The 64-cid grid
> + * alignment of bits[] makes the word-to-word correspondence trivial.
> + */
> +enum {
> +	CMASK_OP_AND,
> +	CMASK_OP_OR,
> +	CMASK_OP_COPY,
> +};
> +
> +void scx_cmask_zero(struct scx_cmask *m)
> +{
> +	memset(m->bits, 0, SCX_CMASK_NR_WORDS(m->nr_bits) * sizeof(u64));
> +}
> +
> +/*
> + * Apply @op to one word - dest[@di] = (dest[@di] & ~@mask) | (op(...) & @mask).
> + * Only bits in @mask within the word are touched.
> + */
> +static void cmask_op_word(struct scx_cmask *dest, const struct scx_cmask *operand,
> +			  u32 di, u32 oi, u64 mask, int op)
> +{
> +	u64 dv = dest->bits[di];
> +	u64 ov = operand->bits[oi];
> +	u64 rv;
> +
> +	switch (op) {
> +	case CMASK_OP_AND:
> +		rv = dv & ov;
> +		break;
> +	case CMASK_OP_OR:
> +		rv = dv | ov;
> +		break;
> +	case CMASK_OP_COPY:
> +		rv = ov;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	dest->bits[di] = (dv & ~mask) | (rv & mask);
> +}
> +
> +static void cmask_op(struct scx_cmask *dest, const struct scx_cmask *operand, int op)
> +{
> +	u32 lo = max(dest->base, operand->base);
> +	u32 hi = min(dest->base + dest->nr_bits,
> +		     operand->base + operand->nr_bits);
> +	u32 d_base = dest->base / 64;
> +	u32 o_base = operand->base / 64;
> +	u32 lo_word, hi_word, w;
> +	u64 head_mask, tail_mask;
> +
> +	if (lo >= hi)
> +		return;
> +
> +	lo_word = lo / 64;
> +	hi_word = (hi - 1) / 64;
> +	head_mask = GENMASK_U64(63, lo & 63);
> +	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
> +
> +	/* intersection fits in a single word - apply both head and tail */
> +	if (lo_word == hi_word) {
> +		cmask_op_word(dest, operand, lo_word - d_base, lo_word - o_base,
> +			      head_mask & tail_mask, op);
> +		return;
> +	}
> +
> +	/* first word: head mask */
> +	cmask_op_word(dest, operand, lo_word - d_base, lo_word - o_base, head_mask, op);
> +
> +	/* interior words: unmasked */
> +	for (w = lo_word + 1; w < hi_word; w++)
> +		cmask_op_word(dest, operand, w - d_base, w - o_base,
> +			      GENMASK_U64(63, 0), op);
> +
> +	/* last word: tail mask */
> +	cmask_op_word(dest, operand, hi_word - d_base, hi_word - o_base, tail_mask, op);
> +}
> +
> +/*
> + * scx_cmask_and/or/copy only modify @dest bits that lie in the intersection
> + * of [@dest->base, @dest->base + @dest->nr_bits) and [@operand->base,
> + * @operand->base + @operand->nr_bits). Bits in @dest outside that window keep
> + * their prior values - in particular, scx_cmask_copy() does NOT zero @dest
> + * bits that lie outside @operand's range.
> + */
> +void scx_cmask_and(struct scx_cmask *dest, const struct scx_cmask *operand)
> +{
> +	cmask_op(dest, operand, CMASK_OP_AND);
> +}
> +
> +void scx_cmask_or(struct scx_cmask *dest, const struct scx_cmask *operand)
> +{
> +	cmask_op(dest, operand, CMASK_OP_OR);
> +}
> +
> +void scx_cmask_copy(struct scx_cmask *dest, const struct scx_cmask *operand)
> +{
> +	cmask_op(dest, operand, CMASK_OP_COPY);
> +}
> +
> +/**
> + * scx_cmask_next_set - find the first set bit at or after @cid
> + * @m: cmask to search
> + * @cid: starting cid (clamped to @m->base if below)
> + *
> + * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
> + * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
> + * termination condition used by scx_cmask_for_each_set()).
> + */
> +u32 scx_cmask_next_set(const struct scx_cmask *m, u32 cid)
> +{
> +	u32 end = m->base + m->nr_bits;
> +	u32 base = m->base / 64;
> +	u32 last_wi = (end - 1) / 64 - base;

Nit: scx_cmask_next_set() relies on "cid >= end" early-return to avoid
underflowing, maybe worth a comment?

> +	u32 wi;
> +	u64 word;
> +
> +	if (cid < m->base)
> +		cid = m->base;
> +	if (cid >= end)
> +		return end;
> +
> +	wi = cid / 64 - base;
> +	word = m->bits[wi] & GENMASK_U64(63, cid & 63);
> +
> +	while (!word) {
> +		if (++wi > last_wi)
> +			return end;
> +		word = m->bits[wi];
> +	}
> +
> +	cid = (base + wi) * 64 + __ffs64(word);
> +	return cid < end ? cid : end;
> +}
> +
>  int scx_cid_kfunc_init(void)
>  {
>  	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init) ?:
> diff --git a/kernel/sched/ext_cid.h b/kernel/sched/ext_cid.h
> index 19848fa9e8fc..46f03f2150c2 100644
> --- a/kernel/sched/ext_cid.h
> +++ b/kernel/sched/ext_cid.h
> @@ -145,4 +145,173 @@ static inline s32 scx_cpu_to_cid(struct scx_sched *sch, s32 cpu)
>  	return __scx_cpu_to_cid(cpu);
>  }
>  
> +/*
> + * cmask: variable-length, base-windowed bitmap over cid space
> + * -----------------------------------------------------------
> + *
> + * A cmask covers the cid range [base, base + nr_bits). bits[] is aligned to the
> + * global 64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64), so the
> + * first (base & 63) bits of bits[0] are head padding and any tail past base +
> + * nr_bits is tail padding. Both must stay zero for the lifetime of the mask;
> + * all mutating helpers preserve that invariant.
> + *
> + * Grid alignment means two cmasks always address bits[] against the same global
> + * 64-cid windows, so cross-cmask word ops (AND, OR, ...) reduce to
> + *
> + *	dest->bits[i] OP= operand->bits[i - delta]
> + *
> + * with no bit-shifting, regardless of how the two bases relate mod 64.
> + *
> + * Binary ops take the form op(dest, operand) and only touch the intersection of
> + * the two ranges on dest; dest bits outside the intersection are left
> + * unchanged. Single-bit ops follow kernel bitops conventions: the bare name is
> + * atomic, the __-prefixed variant is non-atomic. Bulk ops are non-atomic.
> + *
> + * Single-bit ops use atomic64_*() rather than set_bit()/clear_bit() so the u64
> + * storage is addressed consistently across 64-bit and 32-bit-LE kernels
> + * (set_bit() addresses as unsigned long[], which diverges from u64 on
> + * 32-bit-BE). If test_and_set/test_and_clear codegen on x86 matters - they fall
> + * to a LOCK CMPXCHG loop here vs a single LOCK BTS/BTR with the bitops family -
> + * those two can be ifdef'd to the bitops primitives under BITS_PER_LONG == 64.
> + */
> +struct scx_cmask {
> +	u32 base;
> +	u32 nr_bits;
> +	DECLARE_FLEX_ARRAY(u64, bits);
> +};
> +
> +/*
> + * Number of u64 words of bits[] storage that covers @nr_bits regardless of base
> + * alignment. The +1 absorbs up to 63 bits of head padding when base is not
> + * 64-aligned - always allocating one extra word beats branching on base or
> + * splitting the compute.
> + */
> +#define SCX_CMASK_NR_WORDS(nr_bits)	(((nr_bits) + 63) / 64 + 1)
> +
> +/*
> + * Define an on-stack cmask for up to @cap_bits. @name is a struct scx_cmask *
> + * aliasing zero-initialized storage; call scx_cmask_init() to set base/nr_bits.
> + */
> +#define SCX_CMASK_DEFINE(name, cap_bits)	\
> +	DEFINE_RAW_FLEX(struct scx_cmask, name, bits, SCX_CMASK_NR_WORDS(cap_bits))
> +
> +static inline bool __scx_cmask_contains(const struct scx_cmask *m, u32 cid)
> +{
> +	return likely(cid >= m->base && cid < m->base + m->nr_bits);
> +}
> +
> +/* Word in bits[] covering @cid. @cid must satisfy __scx_cmask_contains(). */
> +static inline u64 *__scx_cmask_word(const struct scx_cmask *m, u32 cid)
> +{
> +	return (u64 *)&m->bits[cid / 64 - m->base / 64];
> +}
> +
> +static inline void scx_cmask_init(struct scx_cmask *m, u32 base, u32 nr_bits)
> +{
> +	m->base = base;
> +	m->nr_bits = nr_bits;
> +	memset(m->bits, 0, SCX_CMASK_NR_WORDS(nr_bits) * sizeof(u64));
> +}
> +
> +static inline bool scx_cmask_test(const struct scx_cmask *m, u32 cid)
> +{
> +	if (!__scx_cmask_contains(m, cid))
> +		return false;
> +	return READ_ONCE(*__scx_cmask_word(m, cid)) & BIT_U64(cid & 63);
> +}
> +
> +static inline void scx_cmask_set(struct scx_cmask *m, u32 cid)
> +{
> +	if (!__scx_cmask_contains(m, cid))
> +		return;
> +	atomic64_or(BIT_U64(cid & 63), (atomic64_t *)__scx_cmask_word(m, cid));
> +}
> +
> +static inline void scx_cmask_clear(struct scx_cmask *m, u32 cid)
> +{
> +	if (!__scx_cmask_contains(m, cid))
> +		return;
> +	atomic64_and(~BIT_U64(cid & 63), (atomic64_t *)__scx_cmask_word(m, cid));
> +}
> +
> +/*
> + * test_and_set/test_and_clear use atomic64_fetch_or/and which lower to a LOCK
> + * CMPXCHG loop on x86 (vs a single LOCK BTS/BTR with test_and_set_bit). If this
> + * ever matters, these two can be ifdef'd to the bitops primitives under
> + * BITS_PER_LONG == 64.
> + */
> +static inline bool scx_cmask_test_and_set(struct scx_cmask *m, u32 cid)
> +{
> +	u64 bit = BIT_U64(cid & 63);
> +
> +	if (!__scx_cmask_contains(m, cid))
> +		return false;
> +	return atomic64_fetch_or(bit, (atomic64_t *)__scx_cmask_word(m, cid)) & bit;
> +}
> +
> +static inline bool scx_cmask_test_and_clear(struct scx_cmask *m, u32 cid)
> +{
> +	u64 bit = BIT_U64(cid & 63);
> +
> +	if (!__scx_cmask_contains(m, cid))
> +		return false;
> +	return atomic64_fetch_and(~bit, (atomic64_t *)__scx_cmask_word(m, cid)) & bit;
> +}
> +
> +static inline void __scx_cmask_set(struct scx_cmask *m, u32 cid)
> +{
> +	if (!__scx_cmask_contains(m, cid))
> +		return;
> +	*__scx_cmask_word(m, cid) |= BIT_U64(cid & 63);
> +}
> +
> +static inline void __scx_cmask_clear(struct scx_cmask *m, u32 cid)
> +{
> +	if (!__scx_cmask_contains(m, cid))
> +		return;
> +	*__scx_cmask_word(m, cid) &= ~BIT_U64(cid & 63);
> +}
> +
> +static inline bool __scx_cmask_test_and_set(struct scx_cmask *m, u32 cid)
> +{
> +	u64 bit = BIT_U64(cid & 63);
> +	u64 *w, prev;
> +
> +	if (!__scx_cmask_contains(m, cid))
> +		return false;
> +	w = __scx_cmask_word(m, cid);
> +	prev = *w & bit;
> +	*w |= bit;
> +	return prev;
> +}
> +
> +static inline bool __scx_cmask_test_and_clear(struct scx_cmask *m, u32 cid)
> +{
> +	u64 bit = BIT_U64(cid & 63);
> +	u64 *w, prev;
> +
> +	if (!__scx_cmask_contains(m, cid))
> +		return false;
> +	w = __scx_cmask_word(m, cid);
> +	prev = *w & bit;
> +	*w &= ~bit;
> +	return prev;
> +}
> +
> +void scx_cmask_zero(struct scx_cmask *m);
> +void scx_cmask_copy(struct scx_cmask *dest, const struct scx_cmask *operand);
> +void scx_cmask_and(struct scx_cmask *dest, const struct scx_cmask *operand);
> +void scx_cmask_or(struct scx_cmask *dest, const struct scx_cmask *operand);
> +u32  scx_cmask_next_set(const struct scx_cmask *m, u32 cid);
> +
> +static inline u32 scx_cmask_first_set(const struct scx_cmask *m)
> +{
> +	return scx_cmask_next_set(m, m->base);
> +}
> +
> +#define scx_cmask_for_each_set(cid, m)						\
> +	for ((cid) = scx_cmask_first_set(m);					\
> +	     (cid) < (m)->base + (m)->nr_bits;					\
> +	     (cid) = scx_cmask_next_set((m), (cid) + 1))
> +
>  #endif /* _KERNEL_SCHED_EXT_CID_H */
> diff --git a/tools/sched_ext/include/scx/cid.bpf.h b/tools/sched_ext/include/scx/cid.bpf.h
> new file mode 100644
> index 000000000000..a0d7beb62384
> --- /dev/null
> +++ b/tools/sched_ext/include/scx/cid.bpf.h
> @@ -0,0 +1,595 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * BPF-side helpers for cids and cmasks. See kernel/sched/ext_cid.h for the
> + * authoritative layout and semantics. The BPF-side helpers use the cmask_*
> + * naming (no scx_ prefix); cmask is the SCX bitmap type so the prefix is
> + * redundant in BPF code. Atomics use __sync_val_compare_and_swap and every
> + * helper is inline (no .c counterpart).
> + *
> + * Included by scx/common.bpf.h; don't include directly.
> + *
> + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
> + */
> +#ifndef __SCX_CID_BPF_H
> +#define __SCX_CID_BPF_H
> +
> +#include "bpf_arena_common.bpf.h"
> +
> +#ifndef BIT_U64
> +#define BIT_U64(nr)		(1ULL << (nr))
> +#endif
> +#ifndef GENMASK_U64
> +#define GENMASK_U64(h, l)	((~0ULL << (l)) & (~0ULL >> (63 - (h))))
> +#endif
> +
> +/*
> + * Storage cap for bounded loops over bits[]. Sized to cover NR_CPUS=8192 with
> + * one extra word for head-misalignment. Increase if deployment targets larger
> + * NR_CPUS.
> + */
> +#ifndef CMASK_MAX_WORDS
> +#define CMASK_MAX_WORDS 129
> +#endif
> +
> +#define CMASK_NR_WORDS(nr_bits)		(((nr_bits) + 63) / 64 + 1)
> +
> +static __always_inline bool __cmask_contains(const struct scx_cmask __arena *m, u32 cid)
> +{
> +	return cid >= m->base && cid < m->base + m->nr_bits;
> +}
> +
> +static __always_inline u64 __arena *__cmask_word(const struct scx_cmask __arena *m, u32 cid)
> +{
> +	return (u64 __arena *)&m->bits[cid / 64 - m->base / 64];
> +}
> +
> +static __always_inline void cmask_init(struct scx_cmask __arena *m, u32 base, u32 nr_bits)
> +{
> +	u32 nr_words = CMASK_NR_WORDS(nr_bits), i;
> +
> +	m->base = base;
> +	m->nr_bits = nr_bits;
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		if (i >= nr_words)
> +			break;
> +		m->bits[i] = 0;
> +	}
> +}
> +
> +static __always_inline bool cmask_test(const struct scx_cmask __arena *m, u32 cid)
> +{
> +	if (!__cmask_contains(m, cid))
> +		return false;
> +	return *__cmask_word(m, cid) & BIT_U64(cid & 63);
> +}
> +
> +/*
> + * x86 BPF JIT rejects BPF_OR | BPF_FETCH and BPF_AND | BPF_FETCH on arena
> + * pointers (see bpf_jit_supports_insn() in arch/x86/net/bpf_jit_comp.c). Only
> + * BPF_CMPXCHG / BPF_XCHG / BPF_ADD with FETCH are allowed. Implement
> + * test_and_{set,clear} and the atomic set/clear via a cmpxchg loop.
> + *
> + * CMASK_CAS_TRIES is far above what any non-pathological contention needs.
> + * Exhausting it means the bit update was lost, which corrupts the caller's view
> + * of the bitmap, so raise scx_bpf_error() to abort the scheduler.
> + */
> +#define CMASK_CAS_TRIES		1024
> +
> +static __always_inline void cmask_set(struct scx_cmask __arena *m, u32 cid)
> +{
> +	u64 __arena *w;
> +	u64 bit, old, new;
> +	u32 i;
> +
> +	if (!__cmask_contains(m, cid))
> +		return;
> +	w = __cmask_word(m, cid);
> +	bit = BIT_U64(cid & 63);
> +	bpf_for(i, 0, CMASK_CAS_TRIES) {
> +		old = *w;
> +		if (old & bit)
> +			return;
> +		new = old | bit;
> +		if (__sync_val_compare_and_swap(w, old, new) == old)
> +			return;
> +	}
> +	scx_bpf_error("cmask_set CAS exhausted at cid %u", cid);
> +}
> +
> +static __always_inline void cmask_clear(struct scx_cmask __arena *m, u32 cid)
> +{
> +	u64 __arena *w;
> +	u64 bit, old, new;
> +	u32 i;
> +
> +	if (!__cmask_contains(m, cid))
> +		return;
> +	w = __cmask_word(m, cid);
> +	bit = BIT_U64(cid & 63);
> +	bpf_for(i, 0, CMASK_CAS_TRIES) {
> +		old = *w;
> +		if (!(old & bit))
> +			return;
> +		new = old & ~bit;
> +		if (__sync_val_compare_and_swap(w, old, new) == old)
> +			return;
> +	}
> +	scx_bpf_error("cmask_clear CAS exhausted at cid %u", cid);
> +}
> +
> +static __always_inline bool cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
> +{
> +	u64 __arena *w;
> +	u64 bit, old, new;
> +	u32 i;
> +
> +	if (!__cmask_contains(m, cid))
> +		return false;
> +	w = __cmask_word(m, cid);
> +	bit = BIT_U64(cid & 63);
> +	bpf_for(i, 0, CMASK_CAS_TRIES) {
> +		old = *w;
> +		if (old & bit)
> +			return true;
> +		new = old | bit;
> +		if (__sync_val_compare_and_swap(w, old, new) == old)
> +			return false;
> +	}
> +	scx_bpf_error("cmask_test_and_set CAS exhausted at cid %u", cid);
> +	return false;
> +}
> +
> +static __always_inline bool cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
> +{
> +	u64 __arena *w;
> +	u64 bit, old, new;
> +	u32 i;
> +
> +	if (!__cmask_contains(m, cid))
> +		return false;
> +	w = __cmask_word(m, cid);
> +	bit = BIT_U64(cid & 63);
> +	bpf_for(i, 0, CMASK_CAS_TRIES) {
> +		old = *w;
> +		if (!(old & bit))
> +			return false;
> +		new = old & ~bit;
> +		if (__sync_val_compare_and_swap(w, old, new) == old)
> +			return true;
> +	}
> +	scx_bpf_error("cmask_test_and_clear CAS exhausted at cid %u", cid);
> +	return false;
> +}
> +
> +static __always_inline void __cmask_set(struct scx_cmask __arena *m, u32 cid)
> +{
> +	if (!__cmask_contains(m, cid))
> +		return;
> +	*__cmask_word(m, cid) |= BIT_U64(cid & 63);
> +}
> +
> +static __always_inline void __cmask_clear(struct scx_cmask __arena *m, u32 cid)
> +{
> +	if (!__cmask_contains(m, cid))
> +		return;
> +	*__cmask_word(m, cid) &= ~BIT_U64(cid & 63);
> +}
> +
> +static __always_inline bool __cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
> +{
> +	u64 bit = BIT_U64(cid & 63);
> +	u64 __arena *w;
> +	u64 prev;
> +
> +	if (!__cmask_contains(m, cid))
> +		return false;
> +	w = __cmask_word(m, cid);
> +	prev = *w & bit;
> +	*w |= bit;
> +	return prev;
> +}
> +
> +static __always_inline bool __cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
> +{
> +	u64 bit = BIT_U64(cid & 63);
> +	u64 __arena *w;
> +	u64 prev;
> +
> +	if (!__cmask_contains(m, cid))
> +		return false;
> +	w = __cmask_word(m, cid);
> +	prev = *w & bit;
> +	*w &= ~bit;
> +	return prev;
> +}
> +
> +static __always_inline void cmask_zero(struct scx_cmask __arena *m)
> +{
> +	u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		if (i >= nr_words)
> +			break;
> +		m->bits[i] = 0;
> +	}
> +}
> +
> +/*
> + * BPF_-prefixed to avoid colliding with the kernel's anonymous CMASK_OP_*
> + * enum in ext_cid.c, which is exported via BTF and reachable through
> + * vmlinux.h.
> + */
> +enum {
> +	BPF_CMASK_OP_AND,
> +	BPF_CMASK_OP_OR,
> +	BPF_CMASK_OP_COPY,
> +};
> +
> +static __always_inline void cmask_op_word(struct scx_cmask __arena *dest,
> +					  const struct scx_cmask __arena *operand,
> +					  u32 di, u32 oi, u64 mask, int op)
> +{
> +	u64 dv = dest->bits[di];
> +	u64 ov = operand->bits[oi];
> +	u64 rv;
> +
> +	if (op == BPF_CMASK_OP_AND)
> +		rv = dv & ov;
> +	else if (op == BPF_CMASK_OP_OR)
> +		rv = dv | ov;
> +	else
> +		rv = ov;
> +
> +	dest->bits[di] = (dv & ~mask) | (rv & mask);
> +}
> +
> +static __always_inline void cmask_op(struct scx_cmask __arena *dest,
> +				     const struct scx_cmask __arena *operand, int op)
> +{
> +	u32 d_end = dest->base + dest->nr_bits;
> +	u32 o_end = operand->base + operand->nr_bits;
> +	u32 lo = dest->base > operand->base ? dest->base : operand->base;
> +	u32 hi = d_end < o_end ? d_end : o_end;
> +	u32 d_base = dest->base / 64;
> +	u32 o_base = operand->base / 64;
> +	u32 lo_word, hi_word, i;
> +	u64 head_mask, tail_mask;
> +
> +	if (lo >= hi)
> +		return;
> +
> +	lo_word = lo / 64;
> +	hi_word = (hi - 1) / 64;
> +	head_mask = GENMASK_U64(63, lo & 63);
> +	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		u32 w = lo_word + i;
> +		u64 m;
> +
> +		if (w > hi_word)
> +			break;
> +
> +		m = GENMASK_U64(63, 0);
> +		if (w == lo_word)
> +			m &= head_mask;
> +		if (w == hi_word)
> +			m &= tail_mask;
> +
> +		cmask_op_word(dest, operand, w - d_base, w - o_base, m, op);
> +	}
> +}
> +
> +/*
> + * cmask_and/or/copy only modify @dest bits that lie in the intersection of
> + * [@dest->base, @dest->base + @dest->nr_bits) and [@operand->base,
> + * @operand->base + @operand->nr_bits). Bits in @dest outside that window
> + * keep their prior values - in particular, cmask_copy() does NOT zero @dest
> + * bits that lie outside @operand's range.
> + */
> +static __always_inline void cmask_and(struct scx_cmask __arena *dest,
> +				      const struct scx_cmask __arena *operand)
> +{
> +	cmask_op(dest, operand, BPF_CMASK_OP_AND);
> +}
> +
> +static __always_inline void cmask_or(struct scx_cmask __arena *dest,
> +				     const struct scx_cmask __arena *operand)
> +{
> +	cmask_op(dest, operand, BPF_CMASK_OP_OR);
> +}
> +
> +static __always_inline void cmask_copy(struct scx_cmask __arena *dest,
> +				       const struct scx_cmask __arena *operand)
> +{
> +	cmask_op(dest, operand, BPF_CMASK_OP_COPY);
> +}
> +
> +/**
> + * cmask_next_set - find the first set bit at or after @cid
> + * @m: cmask to search
> + * @cid: starting cid (clamped to @m->base if below)
> + *
> + * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
> + * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
> + * termination condition used by cmask_for_each()).
> + */
> +static __always_inline u32 cmask_next_set(const struct scx_cmask __arena *m, u32 cid)
> +{
> +	u32 end = m->base + m->nr_bits;
> +	u32 base = m->base / 64;
> +	u32 last_wi = (end - 1) / 64 - base;
> +	u32 start_wi, start_bit, i;
> +
> +	if (cid < m->base)
> +		cid = m->base;
> +	if (cid >= end)
> +		return end;
> +
> +	start_wi = cid / 64 - base;
> +	start_bit = cid & 63;
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		u32 wi = start_wi + i;
> +		u64 word;
> +		u32 found;
> +
> +		if (wi > last_wi)
> +			break;
> +
> +		word = m->bits[wi];
> +		if (i == 0)
> +			word &= GENMASK_U64(63, start_bit);
> +		if (!word)
> +			continue;
> +
> +		found = (base + wi) * 64 + __builtin_ctzll(word);
> +		if (found >= end)
> +			return end;
> +		return found;
> +	}
> +	return end;
> +}
> +
> +static __always_inline u32 cmask_first_set(const struct scx_cmask __arena *m)
> +{
> +	return cmask_next_set(m, m->base);
> +}
> +
> +#define cmask_for_each(cid, m)							\
> +	for ((cid) = cmask_first_set(m);					\
> +	     (cid) < (m)->base + (m)->nr_bits;					\
> +	     (cid) = cmask_next_set((m), (cid) + 1))
> +
> +/*
> + * Population count over [base, base + nr_bits). Padding bits in the head/tail
> + * words are guaranteed zero by the mutating helpers, so a flat popcount over
> + * all words is correct.
> + */
> +static __always_inline u32 cmask_weight(const struct scx_cmask __arena *m)
> +{
> +	u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
> +	u32 count = 0;
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		if (i >= nr_words)
> +			break;
> +		count += __builtin_popcountll(m->bits[i]);
> +	}
> +	return count;
> +}
> +
> +/*
> + * True if @a and @b share any set bit. Walk only the intersection of their
> + * ranges, matching the semantics of cmask_and().
> + */
> +static __always_inline bool cmask_intersects(const struct scx_cmask __arena *a,
> +					     const struct scx_cmask __arena *b)
> +{
> +	u32 a_end = a->base + a->nr_bits;
> +	u32 b_end = b->base + b->nr_bits;
> +	u32 lo = a->base > b->base ? a->base : b->base;
> +	u32 hi = a_end < b_end ? a_end : b_end;
> +	u32 a_base = a->base / 64;
> +	u32 b_base = b->base / 64;
> +	u32 lo_word, hi_word, i;
> +	u64 head_mask, tail_mask;
> +
> +	if (lo >= hi)
> +		return false;
> +
> +	lo_word = lo / 64;
> +	hi_word = (hi - 1) / 64;
> +	head_mask = GENMASK_U64(63, lo & 63);
> +	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		u32 w = lo_word + i;
> +		u64 mask, av, bv;
> +
> +		if (w > hi_word)
> +			break;
> +
> +		mask = GENMASK_U64(63, 0);
> +		if (w == lo_word)
> +			mask &= head_mask;
> +		if (w == hi_word)
> +			mask &= tail_mask;
> +
> +		av = a->bits[w - a_base] & mask;
> +		bv = b->bits[w - b_base] & mask;
> +		if (av & bv)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/*
> + * Find the next cid set in both @a and @b at or after @start, bounded by the
> + * intersection of the two ranges. Return a->base + a->nr_bits if none found.
> + *
> + * Building block for cmask_next_and_set_wrap(). Callers that want a bounded
> + * scan without wrap call this directly.
> + */
> +static __always_inline u32 cmask_next_and_set(const struct scx_cmask __arena *a,
> +					      const struct scx_cmask __arena *b,
> +					      u32 start)
> +{
> +	u32 a_end = a->base + a->nr_bits;
> +	u32 b_end = b->base + b->nr_bits;
> +	u32 a_wbase = a->base / 64;
> +	u32 b_wbase = b->base / 64;
> +	u32 lo = a->base > b->base ? a->base : b->base;
> +	u32 hi = a_end < b_end ? a_end : b_end;
> +	u32 last_wi, start_wi, start_bit, i;
> +
> +	if (lo >= hi)
> +		return a_end;
> +	if (start < lo)
> +		start = lo;
> +	if (start >= hi)
> +		return a_end;
> +
> +	last_wi = (hi - 1) / 64;
> +	start_wi = start / 64;
> +	start_bit = start & 63;
> +
> +	bpf_for(i, 0, CMASK_MAX_WORDS) {
> +		u32 abs_wi = start_wi + i;
> +		u64 word;
> +		u32 found;
> +
> +		if (abs_wi > last_wi)
> +			break;
> +
> +		word = a->bits[abs_wi - a_wbase] & b->bits[abs_wi - b_wbase];
> +		if (i == 0)
> +			word &= GENMASK_U64(63, start_bit);
> +		if (!word)
> +			continue;
> +
> +		found = abs_wi * 64 + __builtin_ctzll(word);
> +		if (found >= hi)
> +			return a_end;
> +		return found;
> +	}
> +	return a_end;
> +}
> +
> +/*
> + * Find the next set cid in @m at or after @start, wrapping to @m->base if no
> + * set bit is found in [start, m->base + m->nr_bits). Return m->base +
> + * m->nr_bits if @m is empty.
> + *
> + * Callers do round-robin distribution by passing (last_cid + 1) as @start.
> + */
> +static __always_inline u32 cmask_next_set_wrap(const struct scx_cmask __arena *m,
> +					       u32 start)
> +{
> +	u32 end = m->base + m->nr_bits;
> +	u32 found;
> +
> +	found = cmask_next_set(m, start);
> +	if (found < end || start <= m->base)
> +		return found;
> +
> +	found = cmask_next_set(m, m->base);
> +	return found < start ? found : end;
> +}
> +
> +/*
> + * Find the next cid set in both @a and @b at or after @start, wrapping to
> + * @a->base if none found in the forward half. Return a->base + a->nr_bits
> + * if the intersection is empty.
> + *
> + * Callers do round-robin distribution by passing (last_cid + 1) as @start.
> + */
> +static __always_inline u32 cmask_next_and_set_wrap(const struct scx_cmask __arena *a,
> +						   const struct scx_cmask __arena *b,
> +						   u32 start)
> +{
> +	u32 a_end = a->base + a->nr_bits;
> +	u32 found;
> +
> +	found = cmask_next_and_set(a, b, start);
> +	if (found < a_end || start <= a->base)
> +		return found;
> +
> +	found = cmask_next_and_set(a, b, a->base);
> +	return found < start ? found : a_end;
> +}
> +
> +/**
> + * cmask_from_cpumask - translate a kernel cpumask to a cid-space cmask
> + * @m: cmask to fill. Zeroed first; only bits within [@m->base, @m->base +
> + *     @m->nr_bits) are updated - cpus mapping to cids outside that range
> + *     are ignored.
> + * @cpumask: kernel cpumask to translate
> + *
> + * For each cpu in @cpumask, set the cpu's cid in @m. Caller must ensure
> + * @cpumask stays stable across the call (e.g. RCU read lock for
> + * task->cpus_ptr).
> + */
> +static __always_inline void cmask_from_cpumask(struct scx_cmask __arena *m,
> +					       const struct cpumask *cpumask)
> +{
> +	u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
> +	s32 cpu;
> +
> +	cmask_zero(m);
> +	bpf_for(cpu, 0, nr_cpu_ids) {
> +		s32 cid;
> +
> +		if (!bpf_cpumask_test_cpu(cpu, cpumask))
> +			continue;
> +		cid = scx_bpf_cpu_to_cid(cpu);
> +		if (cid >= 0)
> +			__cmask_set(m, cid);
> +	}
> +}
> +
> +/**
> + * cmask_copy_from_kernel - copy a kernel-memory scx_cmask into an arena cmask
> + * @dst: arena cmask to fill. Must be sized for at least @src's bit count.
> + * @src: kernel-memory cmask (e.g. the @cmask arg delivered to ops.set_cmask()).
> + *       Kernel guarantees @src->base == 0.
> + *
> + * Probe the kernel header for nr_bits, zero @dst, then copy @src->bits[]
> + * word by word via bpf_probe_read_kernel. Call scx_bpf_error() on any probe
> + * failure. Intended for set_cmask callbacks where @src is kernel memory that
> + * BPF cmask helpers (which expect __arena pointers) can't touch directly.
> + */
> +static __always_inline void cmask_copy_from_kernel(struct scx_cmask __arena *dst,
> +						   const struct scx_cmask *src)
> +{
> +	u32 nr_bits = 0, nr_words, dst_nr_words, wi;
> +
> +	if (bpf_probe_read_kernel(&nr_bits, sizeof(nr_bits), &src->nr_bits)) {
> +		scx_bpf_error("probe-read cmask->nr_bits failed");
> +		return;
> +	}
> +
> +	nr_words = CMASK_NR_WORDS(nr_bits);
> +	dst_nr_words = CMASK_NR_WORDS(dst->nr_bits);
> +	if (nr_words > dst_nr_words) {
> +		scx_bpf_error("src cmask nr_bits=%u exceeds dst capacity",
> +			      nr_bits);
> +		return;
> +	}
> +
> +	cmask_zero(dst);
> +	bpf_for(wi, 0, CMASK_MAX_WORDS) {
> +		u64 word = 0;
> +		if (wi >= nr_words)
> +			break;
> +		if (bpf_probe_read_kernel(&word, sizeof(u64), &src->bits[wi])) {
> +			scx_bpf_error("probe-read cmask->bits[%u] failed", wi);
> +			return;
> +		}
> +		dst->bits[wi] = word;
> +	}
> +}
> +
> +#endif /* __SCX_CID_BPF_H */
> diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
> index 4bf959a8cd08..3e353dfafb46 100644
> --- a/tools/sched_ext/include/scx/common.bpf.h
> +++ b/tools/sched_ext/include/scx/common.bpf.h
> @@ -1055,5 +1055,6 @@ static inline u64 scx_clock_irq(u32 cpu)
>  
>  #include "compat.bpf.h"
>  #include "enums.bpf.h"
> +#include "cid.bpf.h"
>  
>  #endif	/* __SCX_COMMON_BPF_H */
> -- 
> 2.53.0
> 
> 

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
  2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
                   ` (15 preceding siblings ...)
  2026-04-21  7:19 ` [PATCH 16/16] sched_ext: Require cid-form struct_ops for sub-sched support Tejun Heo
@ 2026-04-21 18:18 ` Cheng-Yang Chou
  2026-04-21 18:33   ` Tejun Heo
  16 siblings, 1 reply; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-21 18:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Mon, Apr 20, 2026 at 09:19:29PM -1000, Tejun Heo wrote:
> Hello,
> 
> This patchset introduces topological CPU IDs (cids) - dense,
> topology-ordered cpu identifiers - and an alternative cid-form struct_ops
> type that lets BPF schedulers operate in cid space directly.
> 
> Key pieces:
> 
> - cid space: scx_cid_init() walks nodes * LLCs * cores * threads and packs
>   a dense cid mapping. The mapping can be overridden via
>   scx_bpf_cid_override(). See "Topological CPU IDs" in ext_cid.h for the
>   model.
> 
> - cmask: a base-windowed bitmap over cid space. Kernel and BPF helpers with
>   identical semantics. Used by scx_qmap for per-task affinity and idle-cid
>   tracking; meant to be the substrate for sub-sched cid allocation.
> 
> - bpf_sched_ext_ops_cid: a parallel struct_ops type whose callbacks take
>   cids/cmasks instead of cpus/cpumasks. Kernel translates at the boundary
>   via scx_cpu_arg() / scx_cpu_ret(); the two struct types share offsets up
>   through @priv (verified by BUILD_BUG_ON) so the union view in scx_sched
>   works without function-pointer casts. Sub-sched support is tied to
>   cid-form: validate_ops() rejects cpu-form sub-scheds and cpu-form roots
>   that expose sub_attach / sub_detach.
> 
> - cid-form kfuncs: scx_bpf_kick_cid, scx_bpf_cidperf_{cap,cur,set},
>   scx_bpf_cid_curr, scx_bpf_task_cid, scx_bpf_this_cid,
>   scx_bpf_nr_{cids,online_cids}, scx_bpf_cid_to_cpu, scx_bpf_cpu_to_cid.
>   A cid-form program may not call cpu-only kfuncs (enforced at verifier
>   load via scx_kfunc_context_filter); the reverse is intentionally
>   permissive to ease migration.
> 
> - scx_qmap port: scx_qmap is converted to cid-form. It uses the cmask-based
>   idle picker, per-task cid-space cpus_allowed, and cid-form kfuncs
>   throughout. Sub-sched dispatching via scx_bpf_sub_dispatch() continues to
>   work.

I have gone through the entire patchset, and it lgtm.
For the whole series:

Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

I have two questions regarding the current implementation:

1. Regarding the ext_cid feature (same as ext_idle), is it feasible to
   implement this within the BPF arena instead of the current approach?

2. I noticed rust/kernel/cpumask.rs is already in tree. While I understand
   that arch support for Rust is currently limited, would it be a good time
   to start adding Rust abstractions to maintain parity or reduce code
   duplication? (as I discussed offline w/ Andrea about re-implementing
   ext_idle in Rust, and now w/ ext_cid as well)

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
  2026-04-21 18:18 ` [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Cheng-Yang Chou
@ 2026-04-21 18:33   ` Tejun Heo
  2026-04-22  1:23     ` Cheng-Yang Chou
  0 siblings, 1 reply; 28+ messages in thread
From: Tejun Heo @ 2026-04-21 18:33 UTC (permalink / raw)
  To: Cheng-Yang Chou
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hello, Cheng-Yang.

On Wed, Apr 22, 2026 at 02:18:56AM +0800, Cheng-Yang Chou wrote:
> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com>

Thanks, applied.

> 1. Regarding the ext_cid feature (same as ext_idle), is it feasible to
>    implement this within the BPF arena instead of the current approach?

No - cid-form struct_ops translates cids <-> cpus at the ops boundary,
so the kernel needs direct access to the mapping. Building it also
pulls from cpu_to_node() / cacheinfo / sibling masks which aren't
reachable from BPF.

> 2. I noticed rust/kernel/cpumask.rs is already in tree. ... would it be
>    a good time to start adding Rust abstractions ...

I'd wait for a concrete in-tree consumer before growing Rust
wrappers speculatively.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2] sched_ext: Add cmask, a base-windowed bitmap over cid space
  2026-04-21  7:19 ` [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
  2026-04-21 17:30   ` Cheng-Yang Chou
@ 2026-04-21 23:21   ` Tejun Heo
  1 sibling, 0 replies; 28+ messages in thread
From: Tejun Heo @ 2026-04-21 23:21 UTC (permalink / raw)
  To: void, arighi, changwoo; +Cc: sched-ext, emil, linux-kernel

Sub-scheduler code built on cids needs bitmaps scoped to a slice of cid
space (e.g. the idle cids of a shard). A cpumask sized for NR_CPUS wastes
most of its bits for a small window and is awkward in BPF.

scx_cmask covers [base, base + nr_bits). bits[] is aligned to the global
64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64). Any two
cmasks therefore address bits[] against the same global windows, so
cross-cmask word ops reduce to

	dest->bits[i] OP= operand->bits[i - delta]

with no bit-shifting, at the cost of up to one extra storage word for
head misalignment. This alignment guarantee is the reason binary ops
can stay word-level; every mutating helper preserves it.

Kernel side in ext_cid.[hc]; BPF side in tools/sched_ext/include/scx/
cid.bpf.h. BPF side drops the scx_ prefix (redundant in BPF code) and
adds the extra helpers that basic idle-cpu selection needs.

No callers yet.

v2: Narrow to helpers that will be used in the planned changes;
    set/bit/find/zero ops will be added as usage develops.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext_cid.h                   |   63 +++
 tools/sched_ext/include/scx/cid.bpf.h    |  595 +++++++++++++++++++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h |    1 
 3 files changed, 659 insertions(+)

--- a/kernel/sched/ext_cid.h
+++ b/kernel/sched/ext_cid.h
@@ -145,4 +145,67 @@ static inline s32 scx_cpu_to_cid(struct
 	return __scx_cpu_to_cid(cpu);
 }
 
+/*
+ * cmask: variable-length, base-windowed bitmap over cid space
+ * -----------------------------------------------------------
+ *
+ * A cmask covers the cid range [base, base + nr_bits). bits[] is aligned to the
+ * global 64-cid grid: bits[0] spans [base & ~63, (base & ~63) + 64), so the
+ * first (base & 63) bits of bits[0] are head padding and any tail past base +
+ * nr_bits is tail padding. Both must stay zero for the lifetime of the mask;
+ * all mutating helpers preserve that invariant.
+ *
+ * Grid alignment means two cmasks always address bits[] against the same global
+ * 64-cid windows, so cross-cmask word ops (AND, OR, ...) reduce to
+ *
+ *	dest->bits[i] OP= operand->bits[i - delta]
+ *
+ * with no bit-shifting, regardless of how the two bases relate mod 64.
+ */
+struct scx_cmask {
+	u32 base;
+	u32 nr_bits;
+	DECLARE_FLEX_ARRAY(u64, bits);
+};
+
+/*
+ * Number of u64 words of bits[] storage that covers @nr_bits regardless of base
+ * alignment. The +1 absorbs up to 63 bits of head padding when base is not
+ * 64-aligned - always allocating one extra word beats branching on base or
+ * splitting the compute.
+ */
+#define SCX_CMASK_NR_WORDS(nr_bits)	(((nr_bits) + 63) / 64 + 1)
+
+/*
+ * Define an on-stack cmask for up to @cap_bits. @name is a struct scx_cmask *
+ * aliasing zero-initialized storage; call scx_cmask_init() to set base/nr_bits.
+ */
+#define SCX_CMASK_DEFINE(name, cap_bits)	\
+	DEFINE_RAW_FLEX(struct scx_cmask, name, bits, SCX_CMASK_NR_WORDS(cap_bits))
+
+static inline bool __scx_cmask_contains(const struct scx_cmask *m, u32 cid)
+{
+	return likely(cid >= m->base && cid < m->base + m->nr_bits);
+}
+
+/* Word in bits[] covering @cid. @cid must satisfy __scx_cmask_contains(). */
+static inline u64 *__scx_cmask_word(const struct scx_cmask *m, u32 cid)
+{
+	return (u64 *)&m->bits[cid / 64 - m->base / 64];
+}
+
+static inline void scx_cmask_init(struct scx_cmask *m, u32 base, u32 nr_bits)
+{
+	m->base = base;
+	m->nr_bits = nr_bits;
+	memset(m->bits, 0, SCX_CMASK_NR_WORDS(nr_bits) * sizeof(u64));
+}
+
+static inline void __scx_cmask_set(struct scx_cmask *m, u32 cid)
+{
+	if (!__scx_cmask_contains(m, cid))
+		return;
+	*__scx_cmask_word(m, cid) |= BIT_U64(cid & 63);
+}
+
 #endif /* _KERNEL_SCHED_EXT_CID_H */
--- /dev/null
+++ b/tools/sched_ext/include/scx/cid.bpf.h
@@ -0,0 +1,595 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF-side helpers for cids and cmasks. See kernel/sched/ext_cid.h for the
+ * authoritative layout and semantics. The BPF-side helpers use the cmask_*
+ * naming (no scx_ prefix); cmask is the SCX bitmap type so the prefix is
+ * redundant in BPF code. Atomics use __sync_val_compare_and_swap and every
+ * helper is inline (no .c counterpart).
+ *
+ * Included by scx/common.bpf.h; don't include directly.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef __SCX_CID_BPF_H
+#define __SCX_CID_BPF_H
+
+#include "bpf_arena_common.bpf.h"
+
+#ifndef BIT_U64
+#define BIT_U64(nr)		(1ULL << (nr))
+#endif
+#ifndef GENMASK_U64
+#define GENMASK_U64(h, l)	((~0ULL << (l)) & (~0ULL >> (63 - (h))))
+#endif
+
+/*
+ * Storage cap for bounded loops over bits[]. Sized to cover NR_CPUS=8192 with
+ * one extra word for head-misalignment. Increase if deployment targets larger
+ * NR_CPUS.
+ */
+#ifndef CMASK_MAX_WORDS
+#define CMASK_MAX_WORDS 129
+#endif
+
+#define CMASK_NR_WORDS(nr_bits)		(((nr_bits) + 63) / 64 + 1)
+
+static __always_inline bool __cmask_contains(const struct scx_cmask __arena *m, u32 cid)
+{
+	return cid >= m->base && cid < m->base + m->nr_bits;
+}
+
+static __always_inline u64 __arena *__cmask_word(const struct scx_cmask __arena *m, u32 cid)
+{
+	return (u64 __arena *)&m->bits[cid / 64 - m->base / 64];
+}
+
+static __always_inline void cmask_init(struct scx_cmask __arena *m, u32 base, u32 nr_bits)
+{
+	u32 nr_words = CMASK_NR_WORDS(nr_bits), i;
+
+	m->base = base;
+	m->nr_bits = nr_bits;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		if (i >= nr_words)
+			break;
+		m->bits[i] = 0;
+	}
+}
+
+static __always_inline bool cmask_test(const struct scx_cmask __arena *m, u32 cid)
+{
+	if (!__cmask_contains(m, cid))
+		return false;
+	return *__cmask_word(m, cid) & BIT_U64(cid & 63);
+}
+
+/*
+ * x86 BPF JIT rejects BPF_OR | BPF_FETCH and BPF_AND | BPF_FETCH on arena
+ * pointers (see bpf_jit_supports_insn() in arch/x86/net/bpf_jit_comp.c). Only
+ * BPF_CMPXCHG / BPF_XCHG / BPF_ADD with FETCH are allowed. Implement
+ * test_and_{set,clear} and the atomic set/clear via a cmpxchg loop.
+ *
+ * CMASK_CAS_TRIES is far above what any non-pathological contention needs.
+ * Exhausting it means the bit update was lost, which corrupts the caller's view
+ * of the bitmap, so raise scx_bpf_error() to abort the scheduler.
+ */
+#define CMASK_CAS_TRIES		1024
+
+static __always_inline void cmask_set(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (old & bit)
+			return;
+		new = old | bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return;
+	}
+	scx_bpf_error("cmask_set CAS exhausted at cid %u", cid);
+}
+
+static __always_inline void cmask_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (!(old & bit))
+			return;
+		new = old & ~bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return;
+	}
+	scx_bpf_error("cmask_clear CAS exhausted at cid %u", cid);
+}
+
+static __always_inline bool cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (old & bit)
+			return true;
+		new = old | bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return false;
+	}
+	scx_bpf_error("cmask_test_and_set CAS exhausted at cid %u", cid);
+	return false;
+}
+
+static __always_inline bool cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 __arena *w;
+	u64 bit, old, new;
+	u32 i;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	bit = BIT_U64(cid & 63);
+	bpf_for(i, 0, CMASK_CAS_TRIES) {
+		old = *w;
+		if (!(old & bit))
+			return false;
+		new = old & ~bit;
+		if (__sync_val_compare_and_swap(w, old, new) == old)
+			return true;
+	}
+	scx_bpf_error("cmask_test_and_clear CAS exhausted at cid %u", cid);
+	return false;
+}
+
+static __always_inline void __cmask_set(struct scx_cmask __arena *m, u32 cid)
+{
+	if (!__cmask_contains(m, cid))
+		return;
+	*__cmask_word(m, cid) |= BIT_U64(cid & 63);
+}
+
+static __always_inline void __cmask_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	if (!__cmask_contains(m, cid))
+		return;
+	*__cmask_word(m, cid) &= ~BIT_U64(cid & 63);
+}
+
+static __always_inline bool __cmask_test_and_set(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+	u64 __arena *w;
+	u64 prev;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	prev = *w & bit;
+	*w |= bit;
+	return prev;
+}
+
+static __always_inline bool __cmask_test_and_clear(struct scx_cmask __arena *m, u32 cid)
+{
+	u64 bit = BIT_U64(cid & 63);
+	u64 __arena *w;
+	u64 prev;
+
+	if (!__cmask_contains(m, cid))
+		return false;
+	w = __cmask_word(m, cid);
+	prev = *w & bit;
+	*w &= ~bit;
+	return prev;
+}
+
+static __always_inline void cmask_zero(struct scx_cmask __arena *m)
+{
+	u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		if (i >= nr_words)
+			break;
+		m->bits[i] = 0;
+	}
+}
+
+/*
+ * BPF_-prefixed to avoid colliding with the kernel's anonymous CMASK_OP_*
+ * enum in ext_cid.c, which is exported via BTF and reachable through
+ * vmlinux.h.
+ */
+enum {
+	BPF_CMASK_OP_AND,
+	BPF_CMASK_OP_OR,
+	BPF_CMASK_OP_COPY,
+};
+
+static __always_inline void cmask_op_word(struct scx_cmask __arena *dest,
+					  const struct scx_cmask __arena *operand,
+					  u32 di, u32 oi, u64 mask, int op)
+{
+	u64 dv = dest->bits[di];
+	u64 ov = operand->bits[oi];
+	u64 rv;
+
+	if (op == BPF_CMASK_OP_AND)
+		rv = dv & ov;
+	else if (op == BPF_CMASK_OP_OR)
+		rv = dv | ov;
+	else
+		rv = ov;
+
+	dest->bits[di] = (dv & ~mask) | (rv & mask);
+}
+
+static __always_inline void cmask_op(struct scx_cmask __arena *dest,
+				     const struct scx_cmask __arena *operand, int op)
+{
+	u32 d_end = dest->base + dest->nr_bits;
+	u32 o_end = operand->base + operand->nr_bits;
+	u32 lo = dest->base > operand->base ? dest->base : operand->base;
+	u32 hi = d_end < o_end ? d_end : o_end;
+	u32 d_base = dest->base / 64;
+	u32 o_base = operand->base / 64;
+	u32 lo_word, hi_word, i;
+	u64 head_mask, tail_mask;
+
+	if (lo >= hi)
+		return;
+
+	lo_word = lo / 64;
+	hi_word = (hi - 1) / 64;
+	head_mask = GENMASK_U64(63, lo & 63);
+	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 w = lo_word + i;
+		u64 m;
+
+		if (w > hi_word)
+			break;
+
+		m = GENMASK_U64(63, 0);
+		if (w == lo_word)
+			m &= head_mask;
+		if (w == hi_word)
+			m &= tail_mask;
+
+		cmask_op_word(dest, operand, w - d_base, w - o_base, m, op);
+	}
+}
+
+/*
+ * cmask_and/or/copy only modify @dest bits that lie in the intersection of
+ * [@dest->base, @dest->base + @dest->nr_bits) and [@operand->base,
+ * @operand->base + @operand->nr_bits). Bits in @dest outside that window
+ * keep their prior values - in particular, cmask_copy() does NOT zero @dest
+ * bits that lie outside @operand's range.
+ */
+static __always_inline void cmask_and(struct scx_cmask __arena *dest,
+				      const struct scx_cmask __arena *operand)
+{
+	cmask_op(dest, operand, BPF_CMASK_OP_AND);
+}
+
+static __always_inline void cmask_or(struct scx_cmask __arena *dest,
+				     const struct scx_cmask __arena *operand)
+{
+	cmask_op(dest, operand, BPF_CMASK_OP_OR);
+}
+
+static __always_inline void cmask_copy(struct scx_cmask __arena *dest,
+				       const struct scx_cmask __arena *operand)
+{
+	cmask_op(dest, operand, BPF_CMASK_OP_COPY);
+}
+
+/**
+ * cmask_next_set - find the first set bit at or after @cid
+ * @m: cmask to search
+ * @cid: starting cid (clamped to @m->base if below)
+ *
+ * Returns the smallest set cid in [@cid, @m->base + @m->nr_bits), or
+ * @m->base + @m->nr_bits if none (the out-of-range sentinel matches the
+ * termination condition used by cmask_for_each()).
+ */
+static __always_inline u32 cmask_next_set(const struct scx_cmask __arena *m, u32 cid)
+{
+	u32 end = m->base + m->nr_bits;
+	u32 base = m->base / 64;
+	u32 last_wi = (end - 1) / 64 - base;
+	u32 start_wi, start_bit, i;
+
+	if (cid < m->base)
+		cid = m->base;
+	if (cid >= end)
+		return end;
+
+	start_wi = cid / 64 - base;
+	start_bit = cid & 63;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 wi = start_wi + i;
+		u64 word;
+		u32 found;
+
+		if (wi > last_wi)
+			break;
+
+		word = m->bits[wi];
+		if (i == 0)
+			word &= GENMASK_U64(63, start_bit);
+		if (!word)
+			continue;
+
+		found = (base + wi) * 64 + __builtin_ctzll(word);
+		if (found >= end)
+			return end;
+		return found;
+	}
+	return end;
+}
+
+static __always_inline u32 cmask_first_set(const struct scx_cmask __arena *m)
+{
+	return cmask_next_set(m, m->base);
+}
+
+#define cmask_for_each(cid, m)							\
+	for ((cid) = cmask_first_set(m);					\
+	     (cid) < (m)->base + (m)->nr_bits;					\
+	     (cid) = cmask_next_set((m), (cid) + 1))
+
+/*
+ * Population count over [base, base + nr_bits). Padding bits in the head/tail
+ * words are guaranteed zero by the mutating helpers, so a flat popcount over
+ * all words is correct.
+ */
+static __always_inline u32 cmask_weight(const struct scx_cmask __arena *m)
+{
+	u32 nr_words = CMASK_NR_WORDS(m->nr_bits), i;
+	u32 count = 0;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		if (i >= nr_words)
+			break;
+		count += __builtin_popcountll(m->bits[i]);
+	}
+	return count;
+}
+
+/*
+ * True if @a and @b share any set bit. Walk only the intersection of their
+ * ranges, matching the semantics of cmask_and().
+ */
+static __always_inline bool cmask_intersects(const struct scx_cmask __arena *a,
+					     const struct scx_cmask __arena *b)
+{
+	u32 a_end = a->base + a->nr_bits;
+	u32 b_end = b->base + b->nr_bits;
+	u32 lo = a->base > b->base ? a->base : b->base;
+	u32 hi = a_end < b_end ? a_end : b_end;
+	u32 a_base = a->base / 64;
+	u32 b_base = b->base / 64;
+	u32 lo_word, hi_word, i;
+	u64 head_mask, tail_mask;
+
+	if (lo >= hi)
+		return false;
+
+	lo_word = lo / 64;
+	hi_word = (hi - 1) / 64;
+	head_mask = GENMASK_U64(63, lo & 63);
+	tail_mask = GENMASK_U64((hi - 1) & 63, 0);
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 w = lo_word + i;
+		u64 mask, av, bv;
+
+		if (w > hi_word)
+			break;
+
+		mask = GENMASK_U64(63, 0);
+		if (w == lo_word)
+			mask &= head_mask;
+		if (w == hi_word)
+			mask &= tail_mask;
+
+		av = a->bits[w - a_base] & mask;
+		bv = b->bits[w - b_base] & mask;
+		if (av & bv)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * Find the next cid set in both @a and @b at or after @start, bounded by the
+ * intersection of the two ranges. Return a->base + a->nr_bits if none found.
+ *
+ * Building block for cmask_next_and_set_wrap(). Callers that want a bounded
+ * scan without wrap call this directly.
+ */
+static __always_inline u32 cmask_next_and_set(const struct scx_cmask __arena *a,
+					      const struct scx_cmask __arena *b,
+					      u32 start)
+{
+	u32 a_end = a->base + a->nr_bits;
+	u32 b_end = b->base + b->nr_bits;
+	u32 a_wbase = a->base / 64;
+	u32 b_wbase = b->base / 64;
+	u32 lo = a->base > b->base ? a->base : b->base;
+	u32 hi = a_end < b_end ? a_end : b_end;
+	u32 last_wi, start_wi, start_bit, i;
+
+	if (lo >= hi)
+		return a_end;
+	if (start < lo)
+		start = lo;
+	if (start >= hi)
+		return a_end;
+
+	last_wi = (hi - 1) / 64;
+	start_wi = start / 64;
+	start_bit = start & 63;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 abs_wi = start_wi + i;
+		u64 word;
+		u32 found;
+
+		if (abs_wi > last_wi)
+			break;
+
+		word = a->bits[abs_wi - a_wbase] & b->bits[abs_wi - b_wbase];
+		if (i == 0)
+			word &= GENMASK_U64(63, start_bit);
+		if (!word)
+			continue;
+
+		found = abs_wi * 64 + __builtin_ctzll(word);
+		if (found >= hi)
+			return a_end;
+		return found;
+	}
+	return a_end;
+}
+
+/*
+ * Find the next set cid in @m at or after @start, wrapping to @m->base if no
+ * set bit is found in [start, m->base + m->nr_bits). Return m->base +
+ * m->nr_bits if @m is empty.
+ *
+ * Callers do round-robin distribution by passing (last_cid + 1) as @start.
+ */
+static __always_inline u32 cmask_next_set_wrap(const struct scx_cmask __arena *m,
+					       u32 start)
+{
+	u32 end = m->base + m->nr_bits;
+	u32 found;
+
+	found = cmask_next_set(m, start);
+	if (found < end || start <= m->base)
+		return found;
+
+	found = cmask_next_set(m, m->base);
+	return found < start ? found : end;
+}
+
+/*
+ * Find the next cid set in both @a and @b at or after @start, wrapping to
+ * @a->base if none found in the forward half. Return a->base + a->nr_bits
+ * if the intersection is empty.
+ *
+ * Callers do round-robin distribution by passing (last_cid + 1) as @start.
+ */
+static __always_inline u32 cmask_next_and_set_wrap(const struct scx_cmask __arena *a,
+						   const struct scx_cmask __arena *b,
+						   u32 start)
+{
+	u32 a_end = a->base + a->nr_bits;
+	u32 found;
+
+	found = cmask_next_and_set(a, b, start);
+	if (found < a_end || start <= a->base)
+		return found;
+
+	found = cmask_next_and_set(a, b, a->base);
+	return found < start ? found : a_end;
+}
+
+/**
+ * cmask_from_cpumask - translate a kernel cpumask to a cid-space cmask
+ * @m: cmask to fill. Zeroed first; only bits within [@m->base, @m->base +
+ *     @m->nr_bits) are updated - cpus mapping to cids outside that range
+ *     are ignored.
+ * @cpumask: kernel cpumask to translate
+ *
+ * For each cpu in @cpumask, set the cpu's cid in @m. Caller must ensure
+ * @cpumask stays stable across the call (e.g. RCU read lock for
+ * task->cpus_ptr).
+ */
+static __always_inline void cmask_from_cpumask(struct scx_cmask __arena *m,
+					       const struct cpumask *cpumask)
+{
+	u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
+	s32 cpu;
+
+	cmask_zero(m);
+	bpf_for(cpu, 0, nr_cpu_ids) {
+		s32 cid;
+
+		if (!bpf_cpumask_test_cpu(cpu, cpumask))
+			continue;
+		cid = scx_bpf_cpu_to_cid(cpu);
+		if (cid >= 0)
+			__cmask_set(m, cid);
+	}
+}
+
+/**
+ * cmask_copy_from_kernel - copy a kernel-memory scx_cmask into an arena cmask
+ * @dst: arena cmask to fill. Must be sized for at least @src's bit count.
+ * @src: kernel-memory cmask (e.g. the @cmask arg delivered to ops.set_cmask()).
+ *       Kernel guarantees @src->base == 0.
+ *
+ * Probe the kernel header for nr_bits, zero @dst, then copy @src->bits[]
+ * word by word via bpf_probe_read_kernel. Call scx_bpf_error() on any probe
+ * failure. Intended for set_cmask callbacks where @src is kernel memory that
+ * BPF cmask helpers (which expect __arena pointers) can't touch directly.
+ */
+static __always_inline void cmask_copy_from_kernel(struct scx_cmask __arena *dst,
+						   const struct scx_cmask *src)
+{
+	u32 nr_bits = 0, nr_words, dst_nr_words, wi;
+
+	if (bpf_probe_read_kernel(&nr_bits, sizeof(nr_bits), &src->nr_bits)) {
+		scx_bpf_error("probe-read cmask->nr_bits failed");
+		return;
+	}
+
+	nr_words = CMASK_NR_WORDS(nr_bits);
+	dst_nr_words = CMASK_NR_WORDS(dst->nr_bits);
+	if (nr_words > dst_nr_words) {
+		scx_bpf_error("src cmask nr_bits=%u exceeds dst capacity",
+			      nr_bits);
+		return;
+	}
+
+	cmask_zero(dst);
+	bpf_for(wi, 0, CMASK_MAX_WORDS) {
+		u64 word = 0;
+		if (wi >= nr_words)
+			break;
+		if (bpf_probe_read_kernel(&word, sizeof(u64), &src->bits[wi])) {
+			scx_bpf_error("probe-read cmask->bits[%u] failed", wi);
+			return;
+		}
+		dst->bits[wi] = word;
+	}
+}
+
+#endif /* __SCX_CID_BPF_H */
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -1055,5 +1055,6 @@ static inline u64 scx_clock_irq(u32 cpu)
 
 #include "compat.bpf.h"
 #include "enums.bpf.h"
+#include "cid.bpf.h"
 
 #endif	/* __SCX_COMMON_BPF_H */

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops
  2026-04-21 18:33   ` Tejun Heo
@ 2026-04-22  1:23     ` Cheng-Yang Chou
  0 siblings, 0 replies; 28+ messages in thread
From: Cheng-Yang Chou @ 2026-04-22  1:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: void, arighi, changwoo, sched-ext, emil, linux-kernel,
	Ching-Chun Huang, Chia-Ping Tsai

Hi Tejun,

On Tue, Apr 21, 2026 at 08:33:51AM -1000, Tejun Heo wrote:
> > 1. Regarding the ext_cid feature (same as ext_idle), is it feasible to
> >    implement this within the BPF arena instead of the current approach?
> 
> No - cid-form struct_ops translates cids <-> cpus at the ops boundary,
> so the kernel needs direct access to the mapping. Building it also
> pulls from cpu_to_node() / cacheinfo / sibling masks which aren't
> reachable from BPF.

I see. Thanks for the details. (I was wondering if there might be an
opportunity to contribute here, but it sounds non-trivial. I'll just
keep it in mind for now.)

> 
> > 2. I noticed rust/kernel/cpumask.rs is already in tree. ... would it be
> >    a good time to start adding Rust abstractions ...
> 
> I'd wait for a concrete in-tree consumer before growing Rust
> wrappers speculatively.

Agreed, it's definitely too early for that.

Should we also update the doc? Perhaps adding a dedicated 'Topological
CPU IDs (cids)' section would be better than just a one-liner. Also,
is there a need for a standalone selftest? I noticed scx_qmap already
utilizes ext_cid, so I wasn't sure if a separate test is required.

Happy to help with either, thanks!

-- 
Cheers,
Cheng-Yang

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2026-04-22  1:23 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21  7:19 [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Tejun Heo
2026-04-21  7:19 ` [PATCH 01/16] sched_ext: Rename ops_cpu_valid() to scx_cpu_valid() and expose it Tejun Heo
2026-04-21 13:31   ` Cheng-Yang Chou
2026-04-21  7:19 ` [PATCH 02/16] sched_ext: Move scx_exit(), scx_error() and friends to ext_internal.h Tejun Heo
2026-04-21 13:36   ` Cheng-Yang Chou
2026-04-21  7:19 ` [PATCH 03/16] sched_ext: Shift scx_kick_cpu() validity check to scx_bpf_kick_cpu() Tejun Heo
2026-04-21 13:49   ` Cheng-Yang Chou
2026-04-21  7:19 ` [PATCH 04/16] sched_ext: Relocate cpu_acquire/cpu_release to end of struct sched_ext_ops Tejun Heo
2026-04-21 13:58   ` Cheng-Yang Chou
2026-04-21  7:19 ` [PATCH 05/16] sched_ext: Make scx_enable() take scx_enable_cmd Tejun Heo
2026-04-21 14:25   ` Cheng-Yang Chou
2026-04-21  7:19 ` [PATCH 06/16] sched_ext: Add topological CPU IDs (cids) Tejun Heo
2026-04-21 17:15   ` [PATCH v2 sched_ext/for-7.2] " Tejun Heo
2026-04-21  7:19 ` [PATCH 07/16] sched_ext: Add scx_bpf_cid_override() kfunc Tejun Heo
2026-04-21  7:19 ` [PATCH 08/16] tools/sched_ext: Add struct_size() helpers to common.bpf.h Tejun Heo
2026-04-21  7:19 ` [PATCH 09/16] sched_ext: Add cmask, a base-windowed bitmap over cid space Tejun Heo
2026-04-21 17:30   ` Cheng-Yang Chou
2026-04-21 23:21   ` [PATCH v2] " Tejun Heo
2026-04-21  7:19 ` [PATCH 10/16] sched_ext: Add cid-form kfunc wrappers alongside cpu-form Tejun Heo
2026-04-21  7:19 ` [PATCH 11/16] sched_ext: Add bpf_sched_ext_ops_cid struct_ops type Tejun Heo
2026-04-21  7:19 ` [PATCH 12/16] sched_ext: Forbid cpu-form kfuncs from cid-form schedulers Tejun Heo
2026-04-21  7:19 ` [PATCH 13/16] tools/sched_ext: scx_qmap: Restart on hotplug instead of cpu_online/offline Tejun Heo
2026-04-21  7:19 ` [PATCH 14/16] tools/sched_ext: scx_qmap: Add cmask-based idle tracking and cid-based idle pick Tejun Heo
2026-04-21  7:19 ` [PATCH 15/16] tools/sched_ext: scx_qmap: Port to cid-form struct_ops Tejun Heo
2026-04-21  7:19 ` [PATCH 16/16] sched_ext: Require cid-form struct_ops for sub-sched support Tejun Heo
2026-04-21 18:18 ` [PATCHSET sched_ext/for-7.2] sched_ext: Topological CPU IDs and cid-form struct_ops Cheng-Yang Chou
2026-04-21 18:33   ` Tejun Heo
2026-04-22  1:23     ` Cheng-Yang Chou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox