[PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers

Sched_ext development
 help / color / mirror / Atom feed

* [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers
@ 2026-07-03  8:01 Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
                   ` (31 more replies)
  0 siblings, 32 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Hello,

The existing sub-scheduler support only covers a part of scheduling. An
scx_sched can attach to a cgroup subtree under another scx_sched and detach,
and the dispatch path has programmatic delegation: dispatch runs as top-down
recursion via scx_bpf_sub_dispatch(), so a parent decides when a child gets
to pick tasks for the local cpu by calling into it.

The enqueue path is not implemented yet. Every scheduler in the hierarchy
can insert into every local DSQ, so a parent has no way to partition CPUs
among its children or protect its own share from them.

This patchset adds that control. Most importantly it enables sub-sched
support on the enqueue path, and it extends the same control to the other
paths that touch cpu access: occupancy, kicks and preemption, and idle
reporting.

The enqueue path cannot use the dispatch path's programmatic model.
Dispatch is top-down and the delegation is implicit in the call chain.
Enqueue starts at the other end: a leaf picks a cid for a task, and which
cids it may use is the cumulative result of its ancestors' delegation
decisions. Resolving that programmatically on every enqueue would mean a
cross-sched round-trip call chain, possibly retrying when a request
cannot be granted as-is. Delegation is therefore state, not calls:

- Ownership is tracked per (scheduler, cid) as a set of capabilities:

  - SCX_CAP_ENQ_IMMED: insert an IMMED task onto the cid's local DSQ.
    This is the baseline cap (SCX_CAP_BASE) required to make any use of
    a cpu.

  - SCX_CAP_ENQ: insert any task onto the cid's local DSQ.

  - SCX_CAP_PREEMPT: preempt a task outside the scheduler's own subtree.

  Higher caps imply the lower ones for the holder's own use.

- The root owns every cap on every cid. A parent delegates caps to a
  child with scx_bpf_sub_grant() and takes them back with
  scx_bpf_sub_revoke(). A child's cap set is always a subset of its
  parent's.

- The cap state is sharded. The cid space is split into topology-aligned
  shards, and each scheduler tracks its caps in per-shard cmasks with a
  per-shard lock. Operations are broken up on shard boundaries and different
  shards never contend. Shards are expected to serve as the locality unit
  when cids are handed out to schedulers, so lock granularity scales
  naturally with the allocation pattern.

- Hot paths check a per-cpu effective-caps copy with a single read.
  Grant/revoke rings a per-cpu doorbell and the owning cpu folds the new
  config in at the top of balance(). Cross-sched communication happens only
  when the delegation set changes.

Enforcement:

- An insert that lacks the required cap is diverted to a kernel-internal
  per-rq reject DSQ and handed back to the BPF scheduler to re-decide,
  tagged with SCX_TASK_REENQ_CAP and the missing caps.

- A revoke evicts already-queued and running tasks through the same
  reenqueue path.

- CPU occupancy is tied to the task slice. Extending a slice requires
  baseline access on the cpu.

- Kicks are enforced at delivery. Any kick needs baseline access on the
  target cid, and SCX_KICK_PREEMPT degrades to a plain reschedule if the
  victim is outside the kicker's subtree and the kicker lacks
  SCX_CAP_PREEMPT.

Notifications:

- ops.sub_caps_updated() reports config-level changes per shard with
  coalescing.

- ops.sub_ecaps_updated() reports per-cpu effective changes when they
  take effect.

- ops.update_idle() is routed to the cid's owner, with a re-notification
  when ownership changes while the cpu sits idle.

A parent can evict a misbehaving child with scx_bpf_sub_kill().

To make it concrete, consider a root scheduler R with a child A, which
in turn has a child A1:

- On enable, R owns every cap on every cid. A child starts with no caps
  and cannot use any cpu on its own, so a parent normally makes an
  initial grant when a child attaches.

- R grants cids 8-15 to A by calling scx_bpf_sub_grant() with A's cgroup
  id and a cmask. A grant is all-or-nothing per cid and can only hand
  down caps R holds literally.

- A learns of the grant through ops.sub_caps_updated(). As each cpu
  folds the change into its effective caps, A gets
  ops.sub_ecaps_updated() for that cid, and if the cpu is sitting idle,
  an ops.update_idle() re-notification so A can use it right away.

- Nesting chains through the same notifier: from its
  ops.sub_caps_updated(), A calls scx_bpf_sub_grant() to pass a subset, say
  cids 12-15, down to A1, whose own ops.sub_caps_updated() then fires. A1's
  cap set is a subset of A's, which is a subset of R's. Caps A holds only
  through implication cannot be re-delegated.

From A's point of view, using and then losing a cpu looks like this:

- Holding SCX_CAP_ENQ on cid 8, A inserts its tasks into cid 8's local
  DSQ from its enqueue and dispatch paths, extends their slices, and
  kicks the cpu as usual.

- R revokes cid 8. The revoke clears it across A's whole subtree, so A1
  loses whatever it held on the cid too.

- Once the revoke reaches the cpu's effective caps, A's tasks queued on
  cid 8 are handed back to A's ops.enqueue() tagged with
  SCX_TASK_REENQ_CAP and the missing caps, a running task is evicted by
  zeroing its slice, and new inserts are rejected and handed back the
  same way.

- A places the bounced tasks somewhere it still holds, and the notifiers
  tell it that its holdings shrank.

The final two patches expand the scx_qmap hierarchical demo: a qmap
instance splits the cpus it fully owns among itself and child qmaps in
proportion to cpu.weight, time-shares the rounding leftovers through a
round-robin pool, and can fault-inject dispatches to unheld cids to
demonstrate the kernel-side enforcement.

Based on sched_ext/for-7.3 (daf8e166ba59).

This patchset contains the following 32 patches.

 0001 sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched()
 0002 tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight()
 0003 sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants
 0004 tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types
 0005 sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers
 0006 sched_ext: Make scx_bpf_kick_cid() return void
 0007 sched_ext: Make the kick machinery per-sched
 0008 sched_ext: Add ops.init_cids() to finalize the cid layout before init
 0009 sched_ext: Add CID sharding
 0010 sched_ext: Add shard boundaries to scx_bpf_cid_override()
 0011 sched_ext: Defer scx_sched kobj sysfs add into the enable workfns
 0012 sched_ext: Add per-shard scx_sched storage scaffolding
 0013 sched_ext: Add scx_cmask_ref for validated arena cmask access
 0014 sched_ext: RCU-protect the sub-sched tree's children/sibling lists
 0015 sched_ext: Add scx_skip_subtree_pre()
 0016 sched_ext: Add per-shard cap delegation for sub-schedulers
 0017 sched_ext: Add coalescing sub_caps_updated() notifier for sub-schedulers
 0018 sched_ext: Maintain per-cpu effective cap copies for single-read checks
 0019 sched_ext: Add sub_ecaps_updated() effective-cap change notifier
 0020 sched_ext: Generalize local-DSQ handling to rq-owned DSQs
 0021 sched_ext: Add reject DSQ for cap-rejected dispatches
 0022 sched_ext: Add the SCX_CAP_ENQ_IMMED cap
 0023 sched_ext: Assign a unique id to each scheduler instance
 0024 sched_ext: Route task slice writes through set_task_slice()
 0025 sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice
 0026 sched_ext: Add the SCX_CAP_ENQ cap
 0027 sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT
 0028 sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds
 0029 sched_ext: Replay ecaps notifications suppressed by bypass
 0030 sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler
 0031 tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling
 0032 tools/sched_ext: scx_qmap - Add sub-sched cap fault injection

The patches are organized as follows:

- 01-05: fixes and small prep.

- 06-15: plumbing. Per-sched kick machinery (06-07), ops.init_cids()
  (08), cid sharding (09-10), sysfs and per-shard storage prep (11-12),
  validated arena cmask access (13) and RCU-safe tree walking (14-15).

- 16-19: cap delegation. Per-shard caps with grant/revoke (16), the
  coalescing sub_caps_updated() notifier (17), per-cpu effective caps
  (18) and the sub_ecaps_updated() notifier (19).

- 20-27: enforcement. The reject DSQ (20-21), SCX_CAP_ENQ_IMMED (22),
  slice-based occupancy (23-25), SCX_CAP_ENQ (26) and preemption and
  kick gating (27).

- 28-30: idle routing and re-notification (28), bypass replay (29) and
  scx_bpf_sub_kill() (30).

- 31-32: scx_qmap hierarchical demo.

The patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-sub-caps

diffstat follows. Thanks.

 include/linux/sched/ext.h                |   32 +-
 kernel/sched/ext/cid.c                   |  474 ++++++++++---
 kernel/sched/ext/cid.h                   |   15 +-
 kernel/sched/ext/ext.c                   |  792 +++++++++++++++++-----
 kernel/sched/ext/idle.c                  |   68 +-
 kernel/sched/ext/internal.h              |  303 ++++++++-
 kernel/sched/ext/sub.c                   | 1058 +++++++++++++++++++++++++++++-
 kernel/sched/ext/sub.h                   |  117 +++-
 kernel/sched/ext/types.h                 |   72 +-
 kernel/sched/sched.h                     |   14 +-
 tools/sched_ext/include/scx/cid.bpf.h    |   88 ++-
 tools/sched_ext/include/scx/common.bpf.h |   26 +-
 tools/sched_ext/include/scx/compat.bpf.h |   11 +-
 tools/sched_ext/scx_qmap.bpf.c           |  821 +++++++++++++++++++++--
 tools/sched_ext/scx_qmap.c               |  375 ++++++++++-
 tools/sched_ext/scx_qmap.h               |  147 ++++-
 16 files changed, 3979 insertions(+), 434 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched()
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight() Tejun Heo
                   ` (30 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

scx_alloc_and_add_sched() publishes @sch through ops->priv before allocating
the cgroup path. If that allocation fails, the unwind path clears ops->priv
and frees @sch immediately. scx_prog_sched() callers can dereference
ops->priv from RCU context the moment it is set, so freeing without a grace
period can use-after-free a concurrent kfunc caller.

Move the publication below the cgroup path allocation so that every failure
path after publication frees @sch through kobject_put(), whose release path
defers the freeing by a grace period.

Fixes: 105dcd005be2 ("sched_ext: Introduce scx_prog_sched()")
Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 1a0ec985da77..f4725698f5ef 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -6363,11 +6363,6 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 		sch->ops = *cmd->ops;
 	}
 
-	rcu_assign_pointer(ops->priv, sch);
-
-	sch->kobj.kset = scx_kset;
-	INIT_LIST_HEAD(&sch->all);
-
 #ifdef CONFIG_EXT_SUB_SCHED
 	char *buf = kzalloc(PATH_MAX, GFP_KERNEL);
 	if (!buf) {
@@ -6385,7 +6380,19 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 	sch->cgrp = cgrp;
 	INIT_LIST_HEAD(&sch->children);
 	INIT_LIST_HEAD(&sch->sibling);
+#endif	/* CONFIG_EXT_SUB_SCHED */
 
+	/*
+	 * Publishing makes @sch visible to scx_prog_sched() readers. Failure
+	 * paths after this point must free @sch through kobject_put() whose
+	 * release path defers the actual freeing by an RCU grace period.
+	 */
+	rcu_assign_pointer(ops->priv, sch);
+
+	sch->kobj.kset = scx_kset;
+	INIT_LIST_HEAD(&sch->all);
+
+#ifdef CONFIG_EXT_SUB_SCHED
 	if (parent) {
 		/*
 		 * Pin @parent for @sch's lifetime. The kobject hierarchy pins
@@ -6440,7 +6447,6 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 
 #ifdef CONFIG_EXT_SUB_SCHED
 err_free_lb_resched:
-	RCU_INIT_POINTER(ops->priv, NULL);
 	free_cpumask_var(sch->stall_cpus);
 #endif
 err_free_lb_resched_cpumask:
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight()
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

cmask_equal(), cmask_weight() and cmask_subset() bounded their word walks
with CMASK_NR_WORDS(nr_cids), which pads by one word and can't tell the last
word in use without @base. The walks could thus cover a slack word past the
active range, which cmask_reframe() leaves non-zero: a stale bit there gave
cmask_equal() a spurious mismatch, cmask_weight() an inflated count, and
cmask_subset() a spurious violation. cmask_subset() could also read
@b->bits[] one word past its allocation (within the arena's fault-recovered
range, so harmless), and deviated from the kernel scx_cmask_subset() by
failing any @a range that doesn't nest inside @b's even when the overhanging
bits are all clear.

Bound the cmask_equal() and cmask_weight() walks by the words the range
actually spans, with early returns for empty ranges. Rewrite cmask_subset()
to match the kernel semantics: scan @a's overhangs for set bits with
cmask_next_set() and walk the words of the range intersection.
cmask_subset() moves below cmask_next_set(), which it now uses. Padding bits
don't need masking as every cmask helper keeps them clear.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/include/scx/cid.bpf.h | 88 +++++++++++++++++----------
 1 file changed, 55 insertions(+), 33 deletions(-)

diff --git a/tools/sched_ext/include/scx/cid.bpf.h b/tools/sched_ext/include/scx/cid.bpf.h
index db247e42fb45..6b0b4e41b288 100644
--- a/tools/sched_ext/include/scx/cid.bpf.h
+++ b/tools/sched_ext/include/scx/cid.bpf.h
@@ -391,7 +391,9 @@ static __always_inline bool cmask_equal(const struct scx_cmask __arena *a,
 
 	if (a->base != b->base || a->nr_cids != b->nr_cids)
 		return false;
-	nr_words = CMASK_NR_WORDS(a->nr_cids);
+	if (a->nr_cids == 0)
+		return true;
+	nr_words = (a->base + a->nr_cids - 1) / 64 - a->base / 64 + 1;
 
 	bpf_for(i, 0, CMASK_MAX_WORDS) {
 		if (i >= nr_words)
@@ -402,36 +404,6 @@ static __always_inline bool cmask_equal(const struct scx_cmask __arena *a,
 	return true;
 }
 
-/*
- * True iff every bit set in @a is also set in @b over the intersection of
- * their ranges. Bits of @a outside @b's range fail the test.
- */
-static __always_inline bool cmask_subset(const struct scx_cmask __arena *a,
-					 const struct scx_cmask __arena *b)
-{
-	u32 a_end = a->base + a->nr_cids;
-	u32 b_end = b->base + b->nr_cids;
-	u32 a_wbase = a->base / 64;
-	u32 b_wbase = b->base / 64;
-	u32 nr_words, i;
-
-	/* any bit of @a outside @b's range is a subset violation */
-	if (a->base < b->base || a_end > b_end)
-		return false;
-
-	nr_words = CMASK_NR_WORDS(a->nr_cids);
-	bpf_for(i, 0, CMASK_MAX_WORDS) {
-		u32 wi_b;
-
-		if (i >= nr_words)
-			break;
-		wi_b = a_wbase + i - b_wbase;
-		if (a->bits[i] & ~b->bits[wi_b])
-			return false;
-	}
-	return true;
-}
-
 /**
  * cmask_next_set - find the first set bit at or after @cid
  * @m: cmask to search
@@ -488,16 +460,66 @@ static __always_inline u32 cmask_first_set(const struct scx_cmask __arena *m)
 	     (cid) < (m)->base + (m)->nr_cids;					\
 	     (cid) = cmask_next_set((m), (cid) + 1))
 
+/*
+ * True iff every bit set in @a is also set in @b. Matches the kernel-side
+ * scx_cmask_subset(): ranges don't need to nest, and set bits of @a outside
+ * @b's range fail the test.
+ */
+static __always_inline bool cmask_subset(const struct scx_cmask __arena *a,
+					 const struct scx_cmask __arena *b)
+{
+	u32 a_end = a->base + a->nr_cids;
+	u32 b_end = b->base + b->nr_cids;
+	u32 a_wbase = a->base / 64;
+	u32 b_wbase = b->base / 64;
+	u32 lo = a->base > b->base ? a->base : b->base;
+	u32 hi = a_end < b_end ? a_end : b_end;
+	u32 lo_word, hi_word, i;
+
+	/* set bits of @a outside @b's range can't be in @b */
+	if (a->base < b->base &&
+	    cmask_next_set(a, a->base) < (b->base < a_end ? b->base : a_end))
+		return false;
+	if (a_end > b_end &&
+	    cmask_next_set(a, a->base > b_end ? a->base : b_end) < a_end)
+		return false;
+
+	if (lo >= hi)
+		return true;
+
+	/*
+	 * Walk the words the range intersection spans. Plain word tests
+	 * suffice: the scans above guarantee @a has no set bit outside @b's
+	 * range and padding bits are kept clear by all cmask helpers.
+	 */
+	lo_word = lo / 64;
+	hi_word = (hi - 1) / 64;
+
+	bpf_for(i, 0, CMASK_MAX_WORDS) {
+		u32 w = lo_word + i;
+
+		if (w > hi_word)
+			break;
+		if (a->bits[w - a_wbase] & ~b->bits[w - b_wbase])
+			return false;
+	}
+	return true;
+}
+
 /*
  * Population count over [base, base + nr_cids). Padding bits in the head/tail
  * words are guaranteed zero by the mutating helpers, so a flat popcount over
- * all words is correct.
+ * the words the range spans is correct.
  */
 static __always_inline u32 cmask_weight(const struct scx_cmask __arena *m)
 {
-	u32 nr_words = CMASK_NR_WORDS(m->nr_cids), i;
+	u32 nr_words, i;
 	u32 count = 0;
 
+	if (!m->nr_cids)
+		return 0;
+	nr_words = (m->base + m->nr_cids - 1) / 64 - m->base / 64 + 1;
+
 	bpf_for(i, 0, CMASK_MAX_WORDS) {
 		if (i >= nr_words)
 			break;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight() Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:33   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types Tejun Heo
                   ` (28 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

The cmask ops can operate on BPF-arena cmasks which BPF programs may read
and write concurrently. The _RACY op variants existed to make such lockless
reads explicit but this turned out to be too restrictive. Mark the word
accesses in all the two-cmask ops with READ_ONCE/WRITE_ONCE instead and drop
the _RACY variants.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/cid.c | 53 +++++++-----------------------------------
 kernel/sched/ext/cid.h |  2 --
 2 files changed, 9 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
index af83084ec740..4b7279ca740f 100644
--- a/kernel/sched/ext/cid.c
+++ b/kernel/sched/ext/cid.c
@@ -395,17 +395,15 @@ __bpf_kfunc s32 scx_bpf_cpu_to_cid(s32 cpu, const struct bpf_prog_aux *aux)
  * bits outside stay untouched. In particular, scx_cmask_copy() does NOT zero
  * @dst bits that lie outside @src's range.
  *
- * The _RACY variants are otherwise identical to their non-racy counterpart but
- * read @src word-by-word via data_race(). Memory ordering with concurrent
- * writers is the caller's responsibility.
+ * Word accesses use READ_ONCE/WRITE_ONCE so a caller may read @src
+ * locklessly. Memory ordering against concurrent writers is the caller's
+ * responsibility.
  */
 enum cmask_op2 {
 	/* mutating */
 	CMASK_OP2_AND,
 	CMASK_OP2_OR,
-	CMASK_OP2_OR_RACY,
 	CMASK_OP2_COPY,
-	CMASK_OP2_COPY_RACY,
 	CMASK_OP2_ANDNOT,
 	/* predicates - short-circuit when the per-word result is true */
 	CMASK_OP2_SUBSET,
@@ -422,28 +420,22 @@ static __always_inline bool cmask_word_op2(u64 *av, const u64 *bp, u64 mask,
 {
 	switch (op) {
 	case CMASK_OP2_AND:
-		*av &= ~mask | *bp;
+		WRITE_ONCE(*av, *av & (~mask | READ_ONCE(*bp)));
 		return false;
 	case CMASK_OP2_OR:
-		*av |= *bp & mask;
-		return false;
-	case CMASK_OP2_OR_RACY:
-		*av |= data_race(*bp) & mask;
+		WRITE_ONCE(*av, *av | (READ_ONCE(*bp) & mask));
 		return false;
 	case CMASK_OP2_COPY:
-		*av = (*av & ~mask) | (*bp & mask);
-		return false;
-	case CMASK_OP2_COPY_RACY:
-		*av = (*av & ~mask) | (data_race(*bp) & mask);
+		WRITE_ONCE(*av, (*av & ~mask) | (READ_ONCE(*bp) & mask));
 		return false;
 	case CMASK_OP2_ANDNOT:
-		*av &= ~(*bp & mask);
+		WRITE_ONCE(*av, *av & ~(READ_ONCE(*bp) & mask));
 		return false;
 	case CMASK_OP2_SUBSET:
 		/* stop on the first bit in @sub not set in @super */
-		return (*bp & ~*av) & mask;
+		return (READ_ONCE(*bp) & ~READ_ONCE(*av)) & mask;
 	case CMASK_OP2_INTERSECTS:
-		return (*av & *bp) & mask;
+		return (READ_ONCE(*av) & READ_ONCE(*bp)) & mask;
 	}
 	unreachable();
 }
@@ -556,39 +548,12 @@ void scx_cmask_or(struct scx_cmask *dst, const struct scx_cmask *src)
 		       src->bits, src->base, src->nr_cids, CMASK_OP2_OR);
 }
 
-/**
- * scx_cmask_or_racy - OR @src into @dst, reading @src without locking
- *
- * @src is read word-by-word through data_race(). Same per-bit independence
- * rationale as scx_cmask_copy_racy(). Memory ordering with writers is the
- * caller's responsibility.
- */
-void scx_cmask_or_racy(struct scx_cmask *dst, const struct scx_cmask *src)
-{
-	cmask_walk_op2(dst->bits, dst->base, dst->nr_cids,
-		       src->bits, src->base, src->nr_cids, CMASK_OP2_OR_RACY);
-}
-
 void scx_cmask_copy(struct scx_cmask *dst, const struct scx_cmask *src)
 {
 	cmask_walk_op2(dst->bits, dst->base, dst->nr_cids,
 		       src->bits, src->base, src->nr_cids, CMASK_OP2_COPY);
 }
 
-/**
- * scx_cmask_copy_racy - Snapshot @src into @dst without locking
- *
- * @src is read word-by-word through data_race(). Head/tail masking matches
- * scx_cmask_copy(). Each bit in a cmask is independent, so partial updates
- * just leave some bits fresher than others. Memory ordering with writers is
- * the caller's responsibility.
- */
-void scx_cmask_copy_racy(struct scx_cmask *dst, const struct scx_cmask *src)
-{
-	cmask_walk_op2(dst->bits, dst->base, dst->nr_cids,
-		       src->bits, src->base, src->nr_cids, CMASK_OP2_COPY_RACY);
-}
-
 void scx_cmask_andnot(struct scx_cmask *dst, const struct scx_cmask *src)
 {
 	cmask_walk_op2(dst->bits, dst->base, dst->nr_cids,
diff --git a/kernel/sched/ext/cid.h b/kernel/sched/ext/cid.h
index 9c4f4b907f12..54b10df32fd5 100644
--- a/kernel/sched/ext/cid.h
+++ b/kernel/sched/ext/cid.h
@@ -57,9 +57,7 @@ void scx_cmask_clear(struct scx_cmask *m);
 void scx_cmask_fill(struct scx_cmask *m);
 void scx_cmask_and(struct scx_cmask *dst, const struct scx_cmask *src);
 void scx_cmask_or(struct scx_cmask *dst, const struct scx_cmask *src);
-void scx_cmask_or_racy(struct scx_cmask *dst, const struct scx_cmask *src);
 void scx_cmask_copy(struct scx_cmask *dst, const struct scx_cmask *src);
-void scx_cmask_copy_racy(struct scx_cmask *dst, const struct scx_cmask *src);
 void scx_cmask_andnot(struct scx_cmask *dst, const struct scx_cmask *src);
 bool scx_cmask_subset(const struct scx_cmask *sub, const struct scx_cmask *super);
 bool scx_cmask_intersects(const struct scx_cmask *a, const struct scx_cmask *b);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (2 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers Tejun Heo
                   ` (27 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

scx_qmap.c and the shared scx_qmap.h mixed __u64/__u32/__s32 with the bare
typedefs that scx/common.h provides. Convert the remaining __-prefixed
integer types to the bare forms for consistency. The struct fields become
bare u64 (uint64_t), so the stats printfs that fed them to %llu now cast to
unsigned long long. No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.c | 21 +++++++++++----------
 tools/sched_ext/scx_qmap.h | 34 +++++++++++++++++-----------------
 2 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 67ddd483a4c7..f1eaebcab5dc 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -67,8 +67,8 @@ int main(int argc, char **argv)
 	struct scx_qmap *skel;
 	struct bpf_link *link;
 	struct qmap_arena *qa;
-	__u32 test_error_cnt = 0;
-	__u64 ecode;
+	u32 test_error_cnt = 0;
+	u64 ecode;
 	int opt;
 
 	libbpf_set_print(libbpf_print_fn);
@@ -202,15 +202,16 @@ int main(int argc, char **argv)
 
 		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq/cid0=%llu/%llu deq=%llu core=%llu enq_ddsp=%llu\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
-		       qa->nr_reenqueued, qa->nr_reenqueued_cid0,
-		       qa->nr_dequeued,
-		       qa->nr_core_sched_execed,
-		       qa->nr_ddsp_from_enq);
+		       (unsigned long long)qa->nr_reenqueued,
+		       (unsigned long long)qa->nr_reenqueued_cid0,
+		       (unsigned long long)qa->nr_dequeued,
+		       (unsigned long long)qa->nr_core_sched_execed,
+		       (unsigned long long)qa->nr_ddsp_from_enq);
 		printf("         exp_local=%llu exp_remote=%llu exp_timer=%llu exp_lost=%llu\n",
-		       qa->nr_expedited_local,
-		       qa->nr_expedited_remote,
-		       qa->nr_expedited_from_timer,
-		       qa->nr_expedited_lost);
+		       (unsigned long long)qa->nr_expedited_local,
+		       (unsigned long long)qa->nr_expedited_remote,
+		       (unsigned long long)qa->nr_expedited_from_timer,
+		       (unsigned long long)qa->nr_expedited_lost);
 		if (__COMPAT_has_ksym("scx_bpf_cidperf_cur"))
 			printf("cpuperf: cur min/avg/max=%u/%u/%u target min/avg/max=%u/%u/%u\n",
 			       qa->cpuperf_min,
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index d15a705d5ac5..808237540f5c 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -28,10 +28,10 @@
 #define SCX_QMAP_MAX_CPUS	1024
 
 struct cpu_ctx {
-	__u64 dsp_idx;		/* dispatch index */
-	__u64 dsp_cnt;		/* remaining count */
-	__u32 avg_weight;
-	__u32 cpuperf_target;
+	u64 dsp_idx;		/* dispatch index */
+	u64 dsp_cnt;		/* remaining count */
+	u32 avg_weight;
+	u32 cpuperf_target;
 };
 
 /* Opaque to userspace; defined in scx_qmap.bpf.c. */
@@ -40,25 +40,25 @@ struct task_ctx;
 struct qmap_fifo {
 	struct task_ctx __arena *head;
 	struct task_ctx __arena *tail;
-	__s32 idx;
+	s32 idx;
 };
 
 struct qmap_arena {
 	/* userspace-visible stats */
-	__u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cid0;
-	__u64 nr_dequeued, nr_ddsp_from_enq;
-	__u64 nr_core_sched_execed;
-	__u64 nr_expedited_local, nr_expedited_remote;
-	__u64 nr_expedited_lost, nr_expedited_from_timer;
-	__u64 nr_highpri_queued;
-	__u32 test_error_cnt;
-	__u32 cpuperf_min, cpuperf_avg, cpuperf_max;
-	__u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
+	u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cid0;
+	u64 nr_dequeued, nr_ddsp_from_enq;
+	u64 nr_core_sched_execed;
+	u64 nr_expedited_local, nr_expedited_remote;
+	u64 nr_expedited_lost, nr_expedited_from_timer;
+	u64 nr_highpri_queued;
+	u32 test_error_cnt;
+	u32 cpuperf_min, cpuperf_avg, cpuperf_max;
+	u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
 	/* kernel-side runtime state */
-	__u64 sub_sched_cgroup_ids[MAX_SUB_SCHEDS];
-	__u64 core_sched_head_seqs[5];
-	__u64 core_sched_tail_seqs[5];
+	u64 sub_sched_cgroup_ids[MAX_SUB_SCHEDS];
+	u64 core_sched_head_seqs[5];
+	u64 core_sched_tail_seqs[5];
 
 	struct cpu_ctx cpu_ctxs[SCX_QMAP_MAX_CPUS];
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (3 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void Tejun Heo
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Direct writes to p->scx.slice and p->scx.dsq_vtime bypass
scx_bpf_task_set_slice/dsq_vtime() and the authority checks they carry.
Those checks exist for sub-schedulers, which attach only through the
cid-form struct_ops, so the direct writes only need to be closed there.

Give sched_ext_ops_cid its own verifier ops that reject the two fields.
cid-form is a new interface with no legacy users, so there is no
compatibility to keep. The cpu-form keeps direct writes, and the deprecation
warning they carried is dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c | 44 ++++++++++++++++++++++++++++++------------
 1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index f4725698f5ef..4a93ed72e5aa 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -7003,6 +7003,21 @@ static bool bpf_scx_is_valid_access(int off, int size,
 	return btf_ctx_access(off, size, type, prog, info);
 }
 
+/* common to both forms: only scx.disallow is writable */
+static int bpf_scx_btf_struct_access_common(const struct bpf_reg_state *reg,
+					    int off, int size)
+{
+	const struct btf_type *t;
+
+	t = btf_type_by_id(reg->btf, reg->btf_id);
+	if (t == task_struct_type &&
+	    off >= offsetof(struct task_struct, scx.disallow) &&
+	    off + size <= offsetofend(struct task_struct, scx.disallow))
+		return SCALAR_VALUE;
+
+	return -EACCES;
+}
+
 static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
 				     const struct bpf_reg_state *reg, int off,
 				     int size)
@@ -7011,23 +7026,22 @@ static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
 
 	t = btf_type_by_id(reg->btf, reg->btf_id);
 	if (t == task_struct_type) {
-		/*
-		 * COMPAT: Will be removed in v6.23.
-		 */
 		if ((off >= offsetof(struct task_struct, scx.slice) &&
 		     off + size <= offsetofend(struct task_struct, scx.slice)) ||
 		    (off >= offsetof(struct task_struct, scx.dsq_vtime) &&
-		     off + size <= offsetofend(struct task_struct, scx.dsq_vtime))) {
-			pr_warn_ratelimited("sched_ext: Writing directly to p->scx.slice/dsq_vtime is deprecated, use scx_bpf_task_set_slice/dsq_vtime()\n");
-			return SCALAR_VALUE;
-		}
-
-		if (off >= offsetof(struct task_struct, scx.disallow) &&
-		    off + size <= offsetofend(struct task_struct, scx.disallow))
+		     off + size <= offsetofend(struct task_struct, scx.dsq_vtime)))
 			return SCALAR_VALUE;
 	}
 
-	return -EACCES;
+	return bpf_scx_btf_struct_access_common(reg, off, size);
+}
+
+/* cid-form rejects direct slice and dsq_vtime writes in favor of the kfuncs */
+static int bpf_scx_cid_btf_struct_access(struct bpf_verifier_log *log,
+					 const struct bpf_reg_state *reg, int off,
+					 int size)
+{
+	return bpf_scx_btf_struct_access_common(reg, off, size);
 }
 
 static const struct bpf_verifier_ops bpf_scx_verifier_ops = {
@@ -7036,6 +7050,12 @@ static const struct bpf_verifier_ops bpf_scx_verifier_ops = {
 	.btf_struct_access = bpf_scx_btf_struct_access,
 };
 
+static const struct bpf_verifier_ops bpf_scx_cid_verifier_ops = {
+	.get_func_proto = bpf_base_func_proto,
+	.is_valid_access = bpf_scx_is_valid_access,
+	.btf_struct_access = bpf_scx_cid_btf_struct_access,
+};
+
 static int bpf_scx_init_member(const struct btf_type *t,
 			       const struct btf_member *member,
 			       void *kdata, const void *udata)
@@ -7376,7 +7396,7 @@ static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
  * verified to match by the BUILD_BUG_ON checks in scx_init().
  */
 static struct bpf_struct_ops bpf_sched_ext_ops_cid = {
-	.verifier_ops = &bpf_scx_verifier_ops,
+	.verifier_ops = &bpf_scx_cid_verifier_ops,
 	.reg = bpf_scx_reg_cid,
 	.unreg = bpf_scx_unreg,
 	.check_member = bpf_scx_check_member,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (4 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

scx_bpf_kick_cid() returned an error code, but the value conveys nothing
actionable and no caller consumes it. The kick is asynchronous, so a
successful return only means it was queued. An invalid @cid is already
reported through scx_error() by scx_cid_to_cpu(), and a missing scheduler
leaves nothing to kick.

Make scx_bpf_kick_cid() return void to match scx_bpf_kick_cpu(). The
cid-form kfuncs are not in practical use yet, so the ABI change is safe.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c                   | 11 +++++------
 tools/sched_ext/include/scx/common.bpf.h |  2 +-
 2 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4a93ed72e5aa..13af43bca850 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -8541,10 +8541,10 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
  * @flags: %SCX_KICK_* flags
  * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
  *
- * cid-addressed equivalent of scx_bpf_kick_cpu(). Return 0 on success,
- * -errno otherwise.
+ * cid-addressed equivalent of scx_bpf_kick_cpu(). An invalid @cid aborts the
+ * scheduler via scx_cid_to_cpu().
  */
-__bpf_kfunc s32 scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *aux)
+__bpf_kfunc void scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *aux)
 {
 	struct scx_sched *sch;
 	s32 cpu;
@@ -8552,12 +8552,11 @@ __bpf_kfunc s32 scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *
 	guard(rcu)();
 	sch = scx_prog_sched(aux);
 	if (unlikely(!sch))
-		return -ENODEV;
+		return;
 	cpu = scx_cid_to_cpu(sch, cid);
 	if (cpu < 0)
-		return cpu;
+		return;
 	scx_kick_cpu(sch, cpu, flags);
-	return 0;
 }
 
 /**
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index bd51986c4c42..e7b3ba491c5e 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -104,7 +104,7 @@ void scx_bpf_events(struct scx_event_stats *events, size_t events__sz) __ksym __
 s32 scx_bpf_cpu_to_cid(s32 cpu) __ksym __weak;
 s32 scx_bpf_cid_to_cpu(s32 cid) __ksym __weak;
 void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out) __ksym __weak;
-s32 scx_bpf_kick_cid(s32 cid, u64 flags) __ksym __weak;
+void scx_bpf_kick_cid(s32 cid, u64 flags) __ksym __weak;
 s32 scx_bpf_task_cid(const struct task_struct *p) __ksym __weak;
 s32 scx_bpf_this_cid(void) __ksym __weak;
 struct task_struct *scx_bpf_cid_curr(s32 cid) __ksym __weak;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (5 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  9:02   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init Tejun Heo
                   ` (24 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

The kick machinery kept its targets in rq->scx shared by every sched on the
cpu. A preempt kick carried no record of which scheduler requested it.

A later patch needs preempt kicks scoped to the requesting scheduler so a
sub-scheduler can preempt only tasks in its own subtree. Move the kick masks
into the per-sched per-cpu scx_sched_pcpu and have scx_kick_cpu() link the
sched onto a per-cpu list (rq->scx.sched_pcpus_to_kick). The cpu's single
kick irq_work walks that list and kicks each sched's targets on its behalf,
so a kick stays attributed to its scheduler.

The SCX_KICK_WAIT sync set (cpus_to_sync, the kick_sync snapshot and the
balance-callback trigger) stays in rq->scx: the waiter is the cpu, not the
scheduler, and its only writers, the kick irq_work and the wait balance
callback, are cpu-local.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      | 106 +++++++++++++++++++++++++-----------
 kernel/sched/ext/internal.h |  13 +++++
 kernel/sched/sched.h        |   6 +-
 3 files changed, 90 insertions(+), 35 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 13af43bca850..8445e34e205f 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -4632,6 +4632,17 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		 */
 		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
 
+		/*
+		 * Bypass blocks new kicks. Flush the kick irq_work so this
+		 * pcpu's to_kick_node is off the list before it is freed.
+		 */
+		irq_work_sync(&cpu_rq(cpu)->scx.kick_cpus_irq_work);
+		WARN_ON_ONCE(!list_empty(&pcpu->to_kick_node));
+		free_cpumask_var(pcpu->cpus_to_kick);
+		free_cpumask_var(pcpu->cpus_to_kick_if_idle);
+		free_cpumask_var(pcpu->cpus_to_preempt);
+		free_cpumask_var(pcpu->cpus_to_wait);
+
 		exit_dsq(scx_bypass_dsq(sch, cpu));
 	}
 
@@ -5975,6 +5986,7 @@ static void scx_dump_cpu(struct scx_sched *sch, struct seq_buf *s,
 			 bool dump_all_tasks)
 {
 	struct rq *rq = cpu_rq(cpu);
+	struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 	struct rq_flags rf;
 	struct task_struct *p;
 	struct seq_buf ns;
@@ -6007,18 +6019,18 @@ static void scx_dump_cpu(struct scx_sched *sch, struct seq_buf *s,
 	dump_line(&ns, "          curr=%s[%d] class=%ps",
 		  rq->curr->comm, rq->curr->pid,
 		  rq->curr->sched_class);
-	if (!cpumask_empty(rq->scx.cpus_to_kick))
+	if (!cpumask_empty(pcpu->cpus_to_kick))
 		dump_line(&ns, "  cpus_to_kick   : %*pb",
-			  cpumask_pr_args(rq->scx.cpus_to_kick));
-	if (!cpumask_empty(rq->scx.cpus_to_kick_if_idle))
+			  cpumask_pr_args(pcpu->cpus_to_kick));
+	if (!cpumask_empty(pcpu->cpus_to_kick_if_idle))
 		dump_line(&ns, "  idle_to_kick   : %*pb",
-			  cpumask_pr_args(rq->scx.cpus_to_kick_if_idle));
-	if (!cpumask_empty(rq->scx.cpus_to_preempt))
+			  cpumask_pr_args(pcpu->cpus_to_kick_if_idle));
+	if (!cpumask_empty(pcpu->cpus_to_preempt))
 		dump_line(&ns, "  cpus_to_preempt: %*pb",
-			  cpumask_pr_args(rq->scx.cpus_to_preempt));
-	if (!cpumask_empty(rq->scx.cpus_to_wait))
+			  cpumask_pr_args(pcpu->cpus_to_preempt));
+	if (!cpumask_empty(pcpu->cpus_to_wait))
 		dump_line(&ns, "  cpus_to_wait   : %*pb",
-			  cpumask_pr_args(rq->scx.cpus_to_wait));
+			  cpumask_pr_args(pcpu->cpus_to_wait));
 	if (!cpumask_empty(rq->scx.cpus_to_sync))
 		dump_line(&ns, "  cpus_to_sync   : %*pb",
 			  cpumask_pr_args(rq->scx.cpus_to_sync));
@@ -6310,8 +6322,17 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 	for_each_possible_cpu(cpu) {
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
+		node = cpu_to_node(cpu);
 		pcpu->sch = sch;
 		INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node);
+		INIT_LIST_HEAD(&pcpu->to_kick_node);
+		if (!zalloc_cpumask_var_node(&pcpu->cpus_to_kick, GFP_KERNEL, node) ||
+		    !zalloc_cpumask_var_node(&pcpu->cpus_to_kick_if_idle, GFP_KERNEL, node) ||
+		    !zalloc_cpumask_var_node(&pcpu->cpus_to_preempt, GFP_KERNEL, node) ||
+		    !zalloc_cpumask_var_node(&pcpu->cpus_to_wait, GFP_KERNEL, node)) {
+			ret = -ENOMEM;
+			goto err_free_pcpu;
+		}
 	}
 
 	sch->helper = kthread_run_worker(0, "sched_ext_helper");
@@ -6456,6 +6477,14 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 err_stop_helper:
 	kthread_destroy_worker(sch->helper);
 err_free_pcpu:
+	for_each_possible_cpu(cpu) {
+		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+
+		free_cpumask_var(pcpu->cpus_to_kick);
+		free_cpumask_var(pcpu->cpus_to_kick_if_idle);
+		free_cpumask_var(pcpu->cpus_to_preempt);
+		free_cpumask_var(pcpu->cpus_to_wait);
+	}
 	for_each_possible_cpu(cpu) {
 		if (cpu == bypass_fail_cpu)
 			break;
@@ -7469,7 +7498,8 @@ static bool can_skip_idle_kick(struct rq *rq)
 	return !is_idle_task(rq->curr) && !(rq->scx.flags & SCX_RQ_IN_BALANCE);
 }
 
-static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
+static bool kick_one_cpu(s32 cpu, struct scx_sched_pcpu *pcpu, struct rq *this_rq,
+			 unsigned long *ksyncs)
 {
 	struct rq *rq = cpu_rq(cpu);
 	struct scx_rq *this_scx = &this_rq->scx;
@@ -7488,25 +7518,25 @@ static bool kick_one_cpu(s32 cpu, struct rq *this_rq, unsigned long *ksyncs)
 	 */
 	if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
 	    !sched_class_above(cur_class, &ext_sched_class)) {
-		if (cpumask_test_cpu(cpu, this_scx->cpus_to_preempt)) {
+		if (cpumask_test_cpu(cpu, pcpu->cpus_to_preempt)) {
 			if (cur_class == &ext_sched_class)
 				rq->curr->scx.slice = 0;
-			cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
+			cpumask_clear_cpu(cpu, pcpu->cpus_to_preempt);
 		}
 
-		if (cpumask_test_cpu(cpu, this_scx->cpus_to_wait)) {
+		if (cpumask_test_cpu(cpu, pcpu->cpus_to_wait)) {
 			if (cur_class == &ext_sched_class) {
 				cpumask_set_cpu(cpu, this_scx->cpus_to_sync);
 				ksyncs[cpu] = rq->scx.kick_sync;
 				should_wait = true;
 			}
-			cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+			cpumask_clear_cpu(cpu, pcpu->cpus_to_wait);
 		}
 
 		resched_curr(rq);
 	} else {
-		cpumask_clear_cpu(cpu, this_scx->cpus_to_preempt);
-		cpumask_clear_cpu(cpu, this_scx->cpus_to_wait);
+		cpumask_clear_cpu(cpu, pcpu->cpus_to_preempt);
+		cpumask_clear_cpu(cpu, pcpu->cpus_to_wait);
 	}
 
 	raw_spin_rq_unlock_irqrestore(rq, flags);
@@ -7533,6 +7563,7 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 	struct rq *this_rq = this_rq();
 	struct scx_rq *this_scx = &this_rq->scx;
 	struct scx_kick_syncs __rcu *ksyncs_pcpu = __this_cpu_read(scx_kick_syncs);
+	struct scx_sched_pcpu *pcpu, *tmp;
 	bool should_wait = false;
 	unsigned long *ksyncs;
 	s32 cpu;
@@ -7543,15 +7574,24 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 
 	ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs;
 
-	for_each_cpu(cpu, this_scx->cpus_to_kick) {
-		should_wait |= kick_one_cpu(cpu, this_rq, ksyncs);
-		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick);
-		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
-	}
+	/*
+	 * Walk scheds with pending kicks on this cpu. scx_kick_cpu() adds to
+	 * the list under local_irq_save() and only this irq_work consumes it.
+	 * A plain list without locking is sufficient.
+	 */
+	list_for_each_entry_safe(pcpu, tmp, &this_scx->sched_pcpus_to_kick, to_kick_node) {
+		list_del_init(&pcpu->to_kick_node);
 
-	for_each_cpu(cpu, this_scx->cpus_to_kick_if_idle) {
-		kick_one_cpu_if_idle(cpu, this_rq);
-		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
+		for_each_cpu(cpu, pcpu->cpus_to_kick) {
+			should_wait |= kick_one_cpu(cpu, pcpu, this_rq, ksyncs);
+			cpumask_clear_cpu(cpu, pcpu->cpus_to_kick);
+			cpumask_clear_cpu(cpu, pcpu->cpus_to_kick_if_idle);
+		}
+
+		for_each_cpu(cpu, pcpu->cpus_to_kick_if_idle) {
+			kick_one_cpu_if_idle(cpu, this_rq);
+			cpumask_clear_cpu(cpu, pcpu->cpus_to_kick_if_idle);
+		}
 	}
 
 	/*
@@ -7676,11 +7716,8 @@ void __init init_sched_ext_class(void)
 		INIT_LIST_HEAD(&rq->scx.runnable_list);
 		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
 
-		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick, GFP_KERNEL, n));
-		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_kick_if_idle, GFP_KERNEL, n));
-		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_preempt, GFP_KERNEL, n));
-		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_wait, GFP_KERNEL, n));
 		BUG_ON(!zalloc_cpumask_var_node(&rq->scx.cpus_to_sync, GFP_KERNEL, n));
+		INIT_LIST_HEAD(&rq->scx.sched_pcpus_to_kick);
 		raw_spin_lock_init(&rq->scx.deferred_reenq_lock);
 		INIT_LIST_HEAD(&rq->scx.deferred_reenq_locals);
 		INIT_LIST_HEAD(&rq->scx.deferred_reenq_users);
@@ -8466,12 +8503,14 @@ __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime,
 
 void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 {
+	struct scx_sched_pcpu *pcpu;
 	struct rq *this_rq;
 	unsigned long irq_flags;
 
 	local_irq_save(irq_flags);
 
 	this_rq = this_rq();
+	pcpu = this_cpu_ptr(sch->pcpu);
 
 	/*
 	 * While bypassing for PM ops, IRQ handling may not be online which can
@@ -8485,6 +8524,9 @@ void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 	 * Actual kicking is bounced to kick_cpus_irq_workfn() to avoid nesting
 	 * rq locks. We can probably be smarter and avoid bouncing if called
 	 * from ops which don't hold a rq lock.
+	 *
+	 * The kick masks are owned by @sch->pcpu, so that a preempt kick can be
+	 * attributed to @sch.
 	 */
 	if (flags & SCX_KICK_IDLE) {
 		struct rq *target_rq = cpu_rq(cpu);
@@ -8499,16 +8541,18 @@ void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 			}
 			raw_spin_rq_unlock(target_rq);
 		}
-		cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle);
+		cpumask_set_cpu(cpu, pcpu->cpus_to_kick_if_idle);
 	} else {
-		cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick);
+		cpumask_set_cpu(cpu, pcpu->cpus_to_kick);
 
 		if (flags & SCX_KICK_PREEMPT)
-			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
+			cpumask_set_cpu(cpu, pcpu->cpus_to_preempt);
 		if (flags & SCX_KICK_WAIT)
-			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait);
+			cpumask_set_cpu(cpu, pcpu->cpus_to_wait);
 	}
 
+	if (list_empty(&pcpu->to_kick_node))
+		list_add_tail(&pcpu->to_kick_node, &this_rq->scx.sched_pcpus_to_kick);
 	irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
 out:
 	local_irq_restore(irq_flags);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index f9fe7c6ebc4b..35ba79bba597 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1126,6 +1126,19 @@ struct scx_sched_pcpu {
 	struct scx_sched	*sch;
 	u64			flags;	/* protected by rq lock */
 
+	/*
+	 * Kick state owned by this cpu for this sched. scx_kick_cpu() records
+	 * targets here and links @to_kick_node onto the cpu's
+	 * rq->scx.sched_pcpus_to_kick. The cpu's single kick irq_work walks
+	 * that list and kicks each sched's targets on its behalf. Per-sched so
+	 * a kick stays attributed to its scheduler.
+	 */
+	cpumask_var_t		cpus_to_kick;
+	cpumask_var_t		cpus_to_kick_if_idle;
+	cpumask_var_t		cpus_to_preempt;
+	cpumask_var_t		cpus_to_wait;
+	struct list_head	to_kick_node;
+
 	/*
 	 * The event counters are in a per-CPU variable to minimize the
 	 * accounting overhead. A system-wide view on the event counter is
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 56acf502ba26..7da25f918382 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -805,14 +805,12 @@ struct scx_rq {
 	u32			flags;
 	u32			nr_immed;		/* ENQ_IMMED tasks on local_dsq */
 	u64			clock;			/* current per-rq clock -- see scx_bpf_now() */
-	cpumask_var_t		cpus_to_kick;
-	cpumask_var_t		cpus_to_kick_if_idle;
-	cpumask_var_t		cpus_to_preempt;
-	cpumask_var_t		cpus_to_wait;
 	cpumask_var_t		cpus_to_sync;
 	bool			kick_sync_pending;
 	unsigned long		kick_sync;
 
+	struct list_head	sched_pcpus_to_kick;	/* see kick_cpus_irq_workfn() */
+
 	struct task_struct	*sub_dispatch_prev;
 
 	raw_spinlock_t		deferred_reenq_lock;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (6 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding Tejun Heo
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A cid-form scheduler that calls scx_bpf_cid_override() to install a custom
cid layout can only do so from ops.init(). Enable-path setup that depends on
the cid layout thus has to run after ops.init(), and ops.init() itself can't
use anything derived from the final layout, which turned out to be too
restrictive.

Add an ops.init_cids() callback dedicated to finalizing the cid layout. It
runs before the rest of the enable-path setup, so the final layout is in
effect for everything that follows including ops.init(), which now runs
after the arena pool and cmask scratch allocations.

scx_bpf_cid_override() is restricted to ops.init_cids() at load time. It
sits in a kfunc set gated by SCX_KF_ALLOW_INIT_CIDS, a flag set only on the
init_cids op, so the verifier rejects a call from any other context. The
runtime root-only check is dropped as ops.init_cids() only runs during root
enable.

The qmap demo moves its override into a dedicated qmap_init_cids() and,
while at it, introduces an enum for the cid override modes instead of
hard-coded integers.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/cid.c         | 25 ++++++++-----------
 kernel/sched/ext/cid.h         |  2 +-
 kernel/sched/ext/ext.c         | 41 +++++++++++++++++++++++--------
 kernel/sched/ext/internal.h    | 16 ++++++++++---
 tools/sched_ext/scx_qmap.bpf.c | 44 +++++++++++++++++++++-------------
 tools/sched_ext/scx_qmap.c     | 12 +++++-----
 tools/sched_ext/scx_qmap.h     |  8 +++++++
 7 files changed, 96 insertions(+), 52 deletions(-)

diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
index 4b7279ca740f..654bd0f2af81 100644
--- a/kernel/sched/ext/cid.c
+++ b/kernel/sched/ext/cid.c
@@ -13,10 +13,10 @@
 /*
  * cid tables.
  *
- * Pointers are published once on first enable and never revoked. The default
- * mapping is populated before ops.init() runs; scx_bpf_cid_override() commits
- * before it returns. As long as the BPF scheduler only uses the tables from
- * those points onward, it sees a consistent view.
+ * Pointers are allocated on first enable and never freed. During root enable,
+ * the default mapping is populated and then ops.init_cids() is called which can
+ * use scx_bpf_cid_override() to change the mapping. The mapping stays stable
+ * until the root is disabled.
  */
 s16 *scx_cid_to_cpu_tbl;
 s16 *scx_cpu_to_cid_tbl;
@@ -282,7 +282,7 @@ __bpf_kfunc_start_defs();
  * @cpu_to_cid__sz: must be nr_cpu_ids * sizeof(s32) bytes
  * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
  *
- * May only be called from ops.init() of the root scheduler. Replace the
+ * May only be called from ops.init_cids() of the root scheduler. Replace the
  * topology-probed cid mapping with the caller-provided one. Each possible cpu
  * must map to a unique cid in [0, num_possible_cpus()). Topo info is cleared.
  * On invalid input, trigger scx_error() to abort the scheduler.
@@ -309,11 +309,6 @@ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
 		return;
 	}
 
-	if (scx_parent(sch)) {
-		scx_error(sch, "scx_bpf_cid_override() only allowed from root sched");
-		return;
-	}
-
 	if (cpu_to_cid__sz != nr_cpu_ids * sizeof(s32)) {
 		scx_error(sch, "scx_bpf_cid_override: expected %zu bytes, got %u",
 			  nr_cpu_ids * sizeof(s32), cpu_to_cid__sz);
@@ -645,13 +640,13 @@ __bpf_kfunc void scx_bpf_cid_topo(s32 cid, struct scx_cid_topo *out__uninit,
 
 __bpf_kfunc_end_defs();
 
-BTF_KFUNCS_START(scx_kfunc_ids_init)
+BTF_KFUNCS_START(scx_kfunc_ids_init_cids)
 BTF_ID_FLAGS(func, scx_bpf_cid_override, KF_IMPLICIT_ARGS | KF_SLEEPABLE)
-BTF_KFUNCS_END(scx_kfunc_ids_init)
+BTF_KFUNCS_END(scx_kfunc_ids_init_cids)
 
-static const struct btf_kfunc_id_set scx_kfunc_set_init = {
+static const struct btf_kfunc_id_set scx_kfunc_set_init_cids = {
 	.owner	= THIS_MODULE,
-	.set	= &scx_kfunc_ids_init,
+	.set	= &scx_kfunc_ids_init_cids,
 	.filter	= scx_kfunc_context_filter,
 };
 
@@ -668,7 +663,7 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
 
 int scx_cid_kfunc_init(void)
 {
-	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init) ?:
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init_cids) ?:
 		register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_cid) ?:
 		register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &scx_kfunc_set_cid) ?:
 		register_btf_kfunc_id_set(BPF_PROG_TYPE_SYSCALL, &scx_kfunc_set_cid);
diff --git a/kernel/sched/ext/cid.h b/kernel/sched/ext/cid.h
index 54b10df32fd5..cd0d4b9f1088 100644
--- a/kernel/sched/ext/cid.h
+++ b/kernel/sched/ext/cid.h
@@ -51,7 +51,7 @@ struct scx_sched;
 extern s16 *scx_cid_to_cpu_tbl;
 extern s16 *scx_cpu_to_cid_tbl;
 extern struct scx_cid_topo *scx_cid_topo;
-extern struct btf_id_set8 scx_kfunc_ids_init;
+extern struct btf_id_set8 scx_kfunc_ids_init_cids;
 
 void scx_cmask_clear(struct scx_cmask *m);
 void scx_cmask_fill(struct scx_cmask *m);
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 8445e34e205f..29bddfb52243 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -6692,15 +6692,19 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 
 	scx_idle_enable(ops);
 
-	if (sch->ops.init) {
-		ret = SCX_CALL_OP_RET(sch, init, NULL);
+	/*
+	 * A cid-form scheduler finalizes its cid layout in ops.init_cids(),
+	 * which may call scx_bpf_cid_override(). Run it before ops.init() so
+	 * the final layout is in effect.
+	 */
+	if (sch->is_cid_type && sch->ops_cid.init_cids) {
+		ret = SCX_CALL_OP_RET(sch, init_cids, NULL);
 		if (ret) {
-			ret = scx_ops_sanitize_err(sch, "init", ret);
+			ret = scx_ops_sanitize_err(sch, "init_cids", ret);
 			cpus_read_unlock();
-			scx_error(sch, "ops.init() failed (%d)", ret);
+			scx_error(sch, "ops.init_cids() failed (%d)", ret);
 			goto err_disable;
 		}
-		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
 	}
 
 	ret = scx_arena_pool_init(sch);
@@ -6715,6 +6719,17 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 		goto err_disable;
 	}
 
+	if (sch->ops.init) {
+		ret = SCX_CALL_OP_RET(sch, init, NULL);
+		if (ret) {
+			ret = scx_ops_sanitize_err(sch, "init", ret);
+			cpus_read_unlock();
+			scx_error(sch, "ops.init() failed (%d)", ret);
+			goto err_disable;
+		}
+		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
+	}
+
 	for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++)
 		if (((void (**)(void))ops)[i])
 			set_bit(i, sch->has_op);
@@ -7151,6 +7166,7 @@ static int bpf_scx_check_member(const struct btf_type *t,
 #endif
 	case offsetof(struct sched_ext_ops, cpu_online):
 	case offsetof(struct sched_ext_ops, cpu_offline):
+	case offsetof(struct sched_ext_ops, init_cids):
 	case offsetof(struct sched_ext_ops, init):
 	case offsetof(struct sched_ext_ops, exit):
 	case offsetof(struct sched_ext_ops, sub_attach):
@@ -7309,6 +7325,7 @@ static s32 sched_ext_ops__sub_attach(struct scx_sub_attach_args *args) { return
 static void sched_ext_ops__sub_detach(struct scx_sub_detach_args *args) {}
 static void sched_ext_ops__cpu_online(s32 cpu) {}
 static void sched_ext_ops__cpu_offline(s32 cpu) {}
+static s32 sched_ext_ops__init_cids(void) { return -EINVAL; }
 static s32 sched_ext_ops__init(void) { return -EINVAL; }
 static void sched_ext_ops__exit(struct scx_exit_info *info) {}
 static void sched_ext_ops__dump(struct scx_dump_ctx *ctx) {}
@@ -7350,6 +7367,7 @@ static struct sched_ext_ops __bpf_ops_sched_ext_ops = {
 	.sub_detach		= sched_ext_ops__sub_detach,
 	.cpu_online		= sched_ext_ops__cpu_online,
 	.cpu_offline		= sched_ext_ops__cpu_offline,
+	.init_cids		= sched_ext_ops__init_cids,
 	.init			= sched_ext_ops__init,
 	.exit			= sched_ext_ops__exit,
 	.dump			= sched_ext_ops__dump,
@@ -7412,6 +7430,7 @@ static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
 	.sub_detach		= sched_ext_ops__sub_detach,
 	.cid_online		= sched_ext_ops__cpu_online,
 	.cid_offline		= sched_ext_ops__cpu_offline,
+	.init_cids		= sched_ext_ops__init_cids,
 	.init			= sched_ext_ops__init,
 	.exit			= sched_ext_ops__exit,
 	.dump			= sched_ext_ops__dump,
@@ -9661,7 +9680,7 @@ BTF_KFUNCS_END(scx_kfunc_ids_cpu_only)
  */
 enum scx_kf_allow_flags {
 	SCX_KF_ALLOW_UNLOCKED		= 1 << 0,
-	SCX_KF_ALLOW_INIT		= 1 << 1,
+	SCX_KF_ALLOW_INIT_CIDS		= 1 << 1,
 	SCX_KF_ALLOW_CPU_RELEASE	= 1 << 2,
 	SCX_KF_ALLOW_DISPATCH		= 1 << 3,
 	SCX_KF_ALLOW_ENQUEUE		= 1 << 4,
@@ -9693,7 +9712,8 @@ static const u32 scx_kf_allow_flags[] = {
 	[SCX_OP_IDX(sub_detach)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(cpu_online)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(cpu_offline)]	= SCX_KF_ALLOW_UNLOCKED,
-	[SCX_OP_IDX(init)]		= SCX_KF_ALLOW_UNLOCKED | SCX_KF_ALLOW_INIT,
+	[SCX_OP_IDX(init_cids)]		= SCX_KF_ALLOW_UNLOCKED | SCX_KF_ALLOW_INIT_CIDS,
+	[SCX_OP_IDX(init)]		= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(exit)]		= SCX_KF_ALLOW_UNLOCKED,
 };
 
@@ -9708,7 +9728,7 @@ static const u32 scx_kf_allow_flags[] = {
 int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 {
 	bool in_unlocked = btf_id_set8_contains(&scx_kfunc_ids_unlocked, kfunc_id);
-	bool in_init = btf_id_set8_contains(&scx_kfunc_ids_init, kfunc_id);
+	bool in_init_cids = btf_id_set8_contains(&scx_kfunc_ids_init_cids, kfunc_id);
 	bool in_select_cpu = btf_id_set8_contains(&scx_kfunc_ids_select_cpu, kfunc_id);
 	bool in_enqueue = btf_id_set8_contains(&scx_kfunc_ids_enqueue_dispatch, kfunc_id);
 	bool in_dispatch = btf_id_set8_contains(&scx_kfunc_ids_dispatch, kfunc_id);
@@ -9719,7 +9739,7 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	u32 moff, flags;
 
 	/* Not an SCX kfunc - allow. */
-	if (!(in_unlocked || in_init || in_select_cpu || in_enqueue || in_dispatch ||
+	if (!(in_unlocked || in_init_cids || in_select_cpu || in_enqueue || in_dispatch ||
 	      in_cpu_release || in_idle || in_any))
 		return 0;
 
@@ -9771,7 +9791,7 @@ int scx_kfunc_context_filter(const struct bpf_prog *prog, u32 kfunc_id)
 
 	if ((flags & SCX_KF_ALLOW_UNLOCKED) && in_unlocked)
 		return 0;
-	if ((flags & SCX_KF_ALLOW_INIT) && in_init)
+	if ((flags & SCX_KF_ALLOW_INIT_CIDS) && in_init_cids)
 		return 0;
 	if ((flags & SCX_KF_ALLOW_CPU_RELEASE) && in_cpu_release)
 		return 0;
@@ -9827,6 +9847,7 @@ static int __init scx_init(void)
 	CID_OFFSET_MATCH(dump_task, dump_task);
 	CID_OFFSET_MATCH(sub_attach, sub_attach);
 	CID_OFFSET_MATCH(sub_detach, sub_detach);
+	CID_OFFSET_MATCH(init_cids, init_cids);
 	CID_OFFSET_MATCH(init, init);
 	CID_OFFSET_MATCH(exit, exit);
 #ifdef CONFIG_EXT_GROUP_SCHED
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 35ba79bba597..c8c3c6cb647d 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -780,9 +780,18 @@ struct sched_ext_ops {
 	void (*cpu_offline)(s32 cpu);
 
 	/*
-	 * All CPU hotplug ops must come before ops.init().
+	 * All CPU hotplug ops must come before ops.init_cids().
 	 */
 
+	/**
+	 * @init_cids: Finalize the cid layout (cid-form only)
+	 *
+	 * Runs after the default cid layout is built, before ops.init(). A
+	 * cid-form scheduler may call scx_bpf_cid_override() here for a custom
+	 * layout. Ignored for cpu-form schedulers.
+	 */
+	s32 (*init_cids)(void);
+
 	/**
 	 * @init: Initialize the BPF scheduler
 	 */
@@ -958,6 +967,7 @@ struct sched_ext_ops_cid {
 	void (*sub_detach)(struct scx_sub_detach_args *args);
 	void (*cid_online)(s32 cid);
 	void (*cid_offline)(s32 cid);
+	s32 (*init_cids)(void);
 	s32 (*init)(void);
 	void (*exit)(struct scx_exit_info *info);
 
@@ -982,8 +992,8 @@ enum scx_opi {
 	SCX_OPI_NORMAL_BEGIN		= 0,
 	SCX_OPI_NORMAL_END		= SCX_OP_IDX(cpu_online),
 	SCX_OPI_CPU_HOTPLUG_BEGIN	= SCX_OP_IDX(cpu_online),
-	SCX_OPI_CPU_HOTPLUG_END		= SCX_OP_IDX(init),
-	SCX_OPI_END			= SCX_OP_IDX(init),
+	SCX_OPI_CPU_HOTPLUG_END		= SCX_OP_IDX(init_cids),
+	SCX_OPI_END			= SCX_OP_IDX(init_cids),
 };
 
 /*
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index fd9a82a67627..2df7c53992dc 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -54,13 +54,9 @@ const volatile u32 max_tasks;
 
 /*
  * Optional cid-override test harness. When cid_override_mode is non-zero,
- * qmap_init() calls scx_bpf_cid_override() with the caller-supplied
- * cpu_to_cid array to exercise the kfunc's acceptance and error paths.
- *
- *   0 = disabled
- *   1 = valid reverse mapping
- *   2 = invalid: duplicate cid assignment
- *   3 = invalid: out-of-range cid
+ * qmap_init_cids() calls scx_bpf_cid_override() with the caller-supplied
+ * cpu_to_cid array to exercise the kfunc's acceptance and error paths. See enum
+ * qmap_cid_override for the modes.
  */
 const volatile u32 cid_override_mode;
 /*
@@ -1067,6 +1063,29 @@ static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer)
 	return 0;
 }
 
+/*
+ * Custom cid layout for the cid-override test. On invalid input the kfunc
+ * scx_error()s and aborts the scheduler.
+ */
+s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init_cids)
+{
+	u32 nr_cpu_ids = scx_bpf_nr_cpu_ids();
+
+	if (!cid_override_mode)
+		return 0;
+
+	/* bound the count so the verifier accepts cpu_to_cid's mem/len pair */
+	if (nr_cpu_ids > SCX_QMAP_MAX_CPUS) {
+		scx_bpf_error("nr_cpu_ids=%u exceeds SCX_QMAP_MAX_CPUS=%d",
+			      nr_cpu_ids, SCX_QMAP_MAX_CPUS);
+		return -EINVAL;
+	}
+
+	scx_bpf_cid_override((const s32 *)cid_override_cpu_to_cid,
+			     nr_cpu_ids * sizeof(s32));
+	return 0;
+}
+
 s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 {
 	u8 __arena *slab;
@@ -1089,16 +1108,6 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 		return -EINVAL;
 	}
 
-	/*
-	 * cid-override test hook. Must run before anything that reads the
-	 * cid space (scx_bpf_nr_cids, cmask_init, etc.). On invalid input,
-	 * the kfunc calls scx_error() which aborts the scheduler.
-	 */
-	if (cid_override_mode) {
-		scx_bpf_cid_override((const s32 *)cid_override_cpu_to_cid,
-				     nr_cpu_ids * sizeof(s32));
-	}
-
 	/*
 	 * Allocate the task_ctx slab in arena and thread the entire slab onto
 	 * the free list. max_tasks is set by userspace before load. Each entry
@@ -1235,6 +1244,7 @@ SCX_OPS_CID_DEFINE(qmap_ops,
 	       .cgroup_set_bandwidth	= (void *)qmap_cgroup_set_bandwidth,
 	       .sub_attach		= (void *)qmap_sub_attach,
 	       .sub_detach		= (void *)qmap_sub_detach,
+	       .init_cids		= (void *)qmap_init_cids,
 	       .init			= (void *)qmap_init,
 	       .exit			= (void *)qmap_exit,
 	       .timeout_ms		= 5000U,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index f1eaebcab5dc..c0b5cab579d6 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -157,11 +157,11 @@ int main(int argc, char **argv)
 			u32 mode, i;
 
 			if (!strcmp(optarg, "shuffle"))
-				mode = 1;
+				mode = QMAP_CID_OVR_SHUFFLE;
 			else if (!strcmp(optarg, "bad-dup"))
-				mode = 2;
+				mode = QMAP_CID_OVR_BAD_DUP;
 			else if (!strcmp(optarg, "bad-range"))
-				mode = 3;
+				mode = QMAP_CID_OVR_BAD_RANGE;
 			else {
 				fprintf(stderr, "unknown cid-override mode '%s'\n", optarg);
 				return 1;
@@ -170,14 +170,14 @@ int main(int argc, char **argv)
 
 			/* shuffle: reversed cpu_to_cid, bad-dup: dup cid 0, bad-range: identity */
 			for (i = 0; i < nr_cpus; i++) {
-				if (mode == 1)
+				if (mode == QMAP_CID_OVR_SHUFFLE)
 					skel->bss->cid_override_cpu_to_cid[i] = nr_cpus - 1 - i;
 				else
 					skel->bss->cid_override_cpu_to_cid[i] = i;
 			}
-			if (mode == 2 && nr_cpus >= 2)
+			if (mode == QMAP_CID_OVR_BAD_DUP && nr_cpus >= 2)
 				skel->bss->cid_override_cpu_to_cid[1] = 0;
-			if (mode == 3)
+			if (mode == QMAP_CID_OVR_BAD_RANGE)
 				skel->bss->cid_override_cpu_to_cid[0] = (s32)nr_cpus;
 			break;
 		}
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index 808237540f5c..3bcc3579839d 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -27,6 +27,14 @@
  */
 #define SCX_QMAP_MAX_CPUS	1024
 
+/* -C cid-override test modes. Selects cid_override_mode in scx_qmap.bpf.c. */
+enum qmap_cid_override {
+	QMAP_CID_OVR_OFF	= 0,	/* disabled */
+	QMAP_CID_OVR_SHUFFLE	= 1,	/* valid reversed cpu->cid mapping */
+	QMAP_CID_OVR_BAD_DUP	= 2,	/* invalid: duplicate cid assignment */
+	QMAP_CID_OVR_BAD_RANGE	= 3,	/* invalid: out-of-range cid */
+};
+
 struct cpu_ctx {
 	u64 dsp_idx;		/* dispatch index */
 	u64 dsp_cnt;		/* remaining count */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (7 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Sub-sched operations need a scalable locking / work domain smaller than the
whole cid space. Carve the cid space into topology-respecting shards: each
shard is a contiguous cid range that stays within one LLC, and LLCs larger
than the per-shard cap (default 24 cids, configurable via
ops.cid_shard_size) split into enough shards to fit. A hard cap of
SCX_CID_SHARD_MAX_CPUS prevents pathological sizes under custom
configurations.

No-topo cids pack into their own shards so every cid has a shard assignment.

Also build scx_cid_shard_ranges[] for O(1) shard-to-cid-range lookup and
scx_shard_node[] so callers can size or place work by NUMA without walking
cids. Auto-built shards inherit their LLC's node. No-topo shards carry
NUMA_NO_NODE.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/cid.c      | 132 ++++++++++++++++++++++++++++++++++--
 kernel/sched/ext/cid.h      |   4 ++
 kernel/sched/ext/ext.c      |   4 ++
 kernel/sched/ext/internal.h |  13 ++++
 kernel/sched/ext/types.h    |  35 ++++++++--
 5 files changed, 176 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
index 654bd0f2af81..9d75b9311978 100644
--- a/kernel/sched/ext/cid.c
+++ b/kernel/sched/ext/cid.c
@@ -18,13 +18,17 @@
  * use scx_bpf_cid_override() to change the mapping. The mapping stays stable
  * until the root is disabled.
  */
+u32 scx_nr_cid_shards;
 s16 *scx_cid_to_cpu_tbl;
 s16 *scx_cpu_to_cid_tbl;
+s32 *scx_cid_to_shard;
+s32 *scx_shard_node;
+struct scx_cid_shard *scx_cid_shard_ranges;
 struct scx_cid_topo *scx_cid_topo;
 
 #define SCX_CID_TOPO_NEG	(struct scx_cid_topo) {				\
 	.core_cid = -1, .core_idx = -1, .llc_cid = -1, .llc_idx = -1,		\
-	.node_cid = -1, .node_idx = -1,						\
+	.node_cid = -1, .node_idx = -1, .shard_cid = -1, .shard_idx = -1,	\
 }
 
 /*
@@ -43,11 +47,40 @@ static const struct cpumask *cpu_llc_mask(int cpu, struct cpumask *fallbacks)
 	return &ci->info_list[ci->num_leaves - 1].shared_cpu_map;
 }
 
+/*
+ * Compute per-LLC shard layout. Each shard holds at most @shard_size cids, and
+ * in any case no more than SCX_CID_SHARD_MAX_CPUS. Cores are spread as evenly
+ * as possible across shards so cpu count is balanced: the first *@nr_large_p
+ * shards get (*@cores_per_shard_p + 1) cores, the rest get *@cores_per_shard_p.
+ */
+static void calc_shard_layout(const struct cpumask *llc_cpus, u32 shard_size,
+			      u32 *cores_per_shard_p, u32 *nr_large_p)
+{
+	u32 nr_cores = 0, nr_cpus = 0, nr_shards;
+	int cpu;
+
+	for_each_cpu(cpu, llc_cpus) {
+		nr_cpus++;
+		if (cpumask_first(topology_sibling_cpumask(cpu)) == cpu)
+			nr_cores++;
+	}
+
+	nr_shards = max_t(u32, 1, DIV_ROUND_UP(nr_cpus, shard_size));
+	nr_shards = max_t(u32, nr_shards,
+			  DIV_ROUND_UP(nr_cpus, SCX_CID_SHARD_MAX_CPUS));
+
+	*cores_per_shard_p = nr_cores / nr_shards;
+	*nr_large_p = nr_cores % nr_shards;
+}
+
 /* Allocate the cid tables once on first enable; never freed. */
 static s32 scx_cid_arrays_alloc(void)
 {
 	u32 npossible = num_possible_cpus();
 	s16 *cid_to_cpu, *cpu_to_cid;
+	s32 *cid_to_shard;
+	s32 *shard_node;
+	struct scx_cid_shard *cid_shard_ranges;
 	struct scx_cid_topo *cid_topo;
 
 	if (scx_cid_to_cpu_tbl)
@@ -55,17 +88,27 @@ static s32 scx_cid_arrays_alloc(void)
 
 	cid_to_cpu = kzalloc_objs(*scx_cid_to_cpu_tbl, npossible, GFP_KERNEL);
 	cpu_to_cid = kzalloc_objs(*scx_cpu_to_cid_tbl, nr_cpu_ids, GFP_KERNEL);
+	cid_to_shard = kzalloc_objs(*scx_cid_to_shard, npossible, GFP_KERNEL);
+	shard_node = kmalloc_objs(*scx_shard_node, npossible, GFP_KERNEL);
+	cid_shard_ranges = kzalloc_objs(*scx_cid_shard_ranges, npossible, GFP_KERNEL);
 	cid_topo = kmalloc_objs(*scx_cid_topo, npossible, GFP_KERNEL);
 
-	if (!cid_to_cpu || !cpu_to_cid || !cid_topo) {
+	if (!cid_to_cpu || !cpu_to_cid || !cid_to_shard || !shard_node ||
+	    !cid_shard_ranges || !cid_topo) {
 		kfree(cid_to_cpu);
 		kfree(cpu_to_cid);
+		kfree(cid_to_shard);
+		kfree(shard_node);
+		kfree(cid_shard_ranges);
 		kfree(cid_topo);
 		return -ENOMEM;
 	}
 
 	WRITE_ONCE(scx_cid_to_cpu_tbl, cid_to_cpu);
 	WRITE_ONCE(scx_cpu_to_cid_tbl, cpu_to_cid);
+	WRITE_ONCE(scx_cid_to_shard, cid_to_shard);
+	WRITE_ONCE(scx_shard_node, shard_node);
+	WRITE_ONCE(scx_cid_shard_ranges, cid_shard_ranges);
 	WRITE_ONCE(scx_cid_topo, cid_topo);
 	return 0;
 }
@@ -90,17 +133,29 @@ s32 scx_cid_init(struct scx_sched *sch)
 	cpumask_var_t online_no_topo __free(free_cpumask_var) = CPUMASK_VAR_NULL;
 	u32 next_cid = 0;
 	s32 next_node_idx = 0, next_llc_idx = 0, next_core_idx = 0;
-	s32 cpu, ret;
+	s32 next_shard_idx = 0;
+	u32 shard_size, max_cids;
+	u32 notopo_in_shard;
+	s32 notopo_shard_cid, notopo_shard_idx;
+	s32 cpu, cid, si, ret;
 
 	/* CMASK_MAX_WORDS in cid.bpf.h covers NR_CPUS up to 8192 */
 	BUILD_BUG_ON(NR_CPUS > 8192);
 
 	lockdep_assert_cpus_held();
 
+	shard_size = sch->ops.cid_shard_size ?: SCX_CID_SHARD_SIZE_DFL;
+	max_cids = min_t(u32, shard_size, SCX_CID_SHARD_MAX_CPUS);
+
 	ret = scx_cid_arrays_alloc();
 	if (ret)
 		return ret;
 
+	/* clear shard ranges and reset shard_node for repopulate */
+	memset(scx_cid_shard_ranges, 0, num_possible_cpus() * sizeof(*scx_cid_shard_ranges));
+	for (si = 0; si < num_possible_cpus(); si++)
+		scx_shard_node[si] = NUMA_NO_NODE;
+
 	if (!zalloc_cpumask_var(&to_walk, GFP_KERNEL) ||
 	    !zalloc_cpumask_var(&node_scratch, GFP_KERNEL) ||
 	    !zalloc_cpumask_var(&llc_scratch, GFP_KERNEL) ||
@@ -142,29 +197,60 @@ s32 scx_cid_init(struct scx_sched *sch)
 			const struct cpumask *llc_mask = cpu_llc_mask(ncpu, llc_fallback);
 			s32 llc_cid = next_cid;
 			s32 llc_idx = next_llc_idx++;
+			u32 cores_per_shard, nr_large;
+			u32 shard_local = 0, cores_in_shard = 0, cids_in_shard = 0;
+			s32 shard_cid, shard_idx;
 
 			/* llc_scratch = node_scratch & this llc */
 			cpumask_and(llc_scratch, node_scratch, llc_mask);
 			if (WARN_ON_ONCE(!cpumask_test_cpu(ncpu, llc_scratch)))
 				return -EINVAL;
 
+			calc_shard_layout(llc_scratch, shard_size, &cores_per_shard, &nr_large);
+			shard_cid = next_cid;
+			shard_idx = next_shard_idx++;
+			scx_shard_node[shard_idx] = nid;
+
 			while (!cpumask_empty(llc_scratch)) {
 				s32 lcpu = cpumask_first(llc_scratch);
 				const struct cpumask *sib = topology_sibling_cpumask(lcpu);
 				s32 core_cid = next_cid;
 				s32 core_idx = next_core_idx++;
 				s32 ccpu;
+				u32 max_cores, cids_in_core;
 
 				/* core_scratch = llc_scratch & this core */
 				cpumask_and(core_scratch, llc_scratch, sib);
 				if (WARN_ON_ONCE(!cpumask_test_cpu(lcpu, core_scratch)))
 					return -EINVAL;
 
+				/*
+				 * Advance to a new shard when either core or
+				 * cid count reaches max. The latter bounds
+				 * shard sizes under uneven SMT. Never start an
+				 * empty shard.
+				 */
+				cids_in_core = cpumask_weight(core_scratch);
+				max_cores = cores_per_shard + (shard_local < nr_large ? 1 : 0);
+				if (cores_in_shard &&
+				    (cores_in_shard >= max_cores ||
+				     cids_in_shard + cids_in_core > max_cids)) {
+					shard_local++;
+					cores_in_shard = 0;
+					cids_in_shard = 0;
+					shard_cid = next_cid;
+					shard_idx = next_shard_idx++;
+					scx_shard_node[shard_idx] = nid;
+				}
+				cores_in_shard++;
+				cids_in_shard += cids_in_core;
+
 				for_each_cpu(ccpu, core_scratch) {
 					s32 cid = next_cid++;
 
 					scx_cid_to_cpu_tbl[cid] = ccpu;
 					scx_cpu_to_cid_tbl[ccpu] = cid;
+					scx_cid_to_shard[cid] = shard_idx;
 					scx_cid_topo[cid] = (struct scx_cid_topo){
 						.core_cid = core_cid,
 						.core_idx = core_idx,
@@ -172,6 +258,8 @@ s32 scx_cid_init(struct scx_sched *sch)
 						.llc_idx = llc_idx,
 						.node_cid = node_cid,
 						.node_idx = node_idx,
+						.shard_cid = shard_cid,
+						.shard_idx = shard_idx,
 					};
 
 					cpumask_clear_cpu(ccpu, llc_scratch);
@@ -184,12 +272,17 @@ s32 scx_cid_init(struct scx_sched *sch)
 
 	/*
 	 * No-topo section: any possible cpu without a cid - normally just the
-	 * not-online ones. Collect any currently-online cpus that land here in
-	 * @online_no_topo so we can warn about them at the end.
+	 * not-online ones. Pack into shards of up to min(@shard_size,
+	 * SCX_CID_SHARD_MAX_CPUS) cids so that every cid has a valid shard
+	 * assignment and the hard cap holds even with a large @shard_size.
+	 * Collect any currently-online cpus that land here in @online_no_topo
+	 * so we can warn about them at the end.
 	 */
-	for_each_cpu(cpu, cpu_possible_mask) {
-		s32 cid;
+	notopo_in_shard = min_t(u32, shard_size, SCX_CID_SHARD_MAX_CPUS);
+	notopo_shard_cid = -1;
+	notopo_shard_idx = -1;
 
+	for_each_cpu(cpu, cpu_possible_mask) {
 		if (__scx_cpu_to_cid(cpu) != -1)
 			continue;
 		if (cpu_online(cpu))
@@ -198,7 +291,18 @@ s32 scx_cid_init(struct scx_sched *sch)
 		cid = next_cid++;
 		scx_cid_to_cpu_tbl[cid] = cpu;
 		scx_cpu_to_cid_tbl[cpu] = cid;
+
+		if (notopo_in_shard >= min_t(u32, shard_size, SCX_CID_SHARD_MAX_CPUS)) {
+			notopo_shard_cid = cid;
+			notopo_shard_idx = next_shard_idx++;
+			notopo_in_shard = 0;
+		}
+		notopo_in_shard++;
+
+		scx_cid_to_shard[cid] = notopo_shard_idx;
 		scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+		scx_cid_topo[cid].shard_cid = notopo_shard_cid;
+		scx_cid_topo[cid].shard_idx = notopo_shard_idx;
 	}
 
 	if (!cpumask_empty(llc_fallback))
@@ -208,6 +312,20 @@ s32 scx_cid_init(struct scx_sched *sch)
 		pr_warn("scx_cid: online cpus with no usable topology: %*pbl\n",
 			cpumask_pr_args(online_no_topo));
 
+	/*
+	 * Fill cid_shard_ranges[] from cid_to_shard[]. Shards are contiguous
+	 * cid ranges by construction: base_cid is the first cid landing in a
+	 * shard, nr_cids is the count.
+	 */
+	for (cid = 0; cid < next_cid; cid++) {
+		s32 sidx = scx_cid_to_shard[cid];
+
+		if (scx_cid_shard_ranges[sidx].nr_cids == 0)
+			scx_cid_shard_ranges[sidx].base_cid = cid;
+		scx_cid_shard_ranges[sidx].nr_cids++;
+	}
+
+	scx_nr_cid_shards = next_shard_idx;
 	return 0;
 }
 
diff --git a/kernel/sched/ext/cid.h b/kernel/sched/ext/cid.h
index cd0d4b9f1088..cdc18a7a48f5 100644
--- a/kernel/sched/ext/cid.h
+++ b/kernel/sched/ext/cid.h
@@ -48,8 +48,12 @@ struct scx_sched;
  * See the comment above the table definitions in cid.c for the
  * memory-ordering and visibility contract.
  */
+extern u32 scx_nr_cid_shards;
 extern s16 *scx_cid_to_cpu_tbl;
 extern s16 *scx_cpu_to_cid_tbl;
+extern s32 *scx_cid_to_shard;
+extern s32 *scx_shard_node;
+extern struct scx_cid_shard *scx_cid_shard_ranges;
 extern struct scx_cid_topo *scx_cid_topo;
 extern struct btf_id_set8 scx_kfunc_ids_init_cids;
 
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 29bddfb52243..87a3fb9bb446 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -7141,6 +7141,9 @@ static int bpf_scx_init_member(const struct btf_type *t,
 	case offsetof(struct sched_ext_ops, hotplug_seq):
 		ops->hotplug_seq = *(u64 *)(udata + moff);
 		return 1;
+	case offsetof(struct sched_ext_ops, cid_shard_size):
+		ops->cid_shard_size = *(u32 *)(udata + moff);
+		return 1;
 #ifdef CONFIG_EXT_SUB_SCHED
 	case offsetof(struct sched_ext_ops, sub_cgroup_id):
 		ops->sub_cgroup_id = *(u64 *)(udata + moff);
@@ -9825,6 +9828,7 @@ static int __init scx_init(void)
 	CID_OFFSET_MATCH(timeout_ms, timeout_ms);
 	CID_OFFSET_MATCH(exit_dump_len, exit_dump_len);
 	CID_OFFSET_MATCH(hotplug_seq, hotplug_seq);
+	CID_OFFSET_MATCH(cid_shard_size, cid_shard_size);
 	CID_OFFSET_MATCH(sub_cgroup_id, sub_cgroup_id);
 	/* shared callbacks: the union view requires byte-for-byte offset match */
 	CID_OFFSET_MATCH(enqueue, enqueue);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index c8c3c6cb647d..ba5e9be0e3c3 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -845,6 +845,18 @@ struct sched_ext_ops {
 	 */
 	u64 hotplug_seq;
 
+	/**
+	 * @cid_shard_size: Target number of CIDs per shard
+	 *
+	 * Shards are contiguous CID ranges used as operation and locking
+	 * domains for sub-scheduling. Each LLC is divided into ceil(nr_cpus /
+	 * @cid_shard_size) shards, then cores are distributed across them
+	 * evenly. If one core has more logical CPUs than @cid_shard_size, its
+	 * shard will become larger than @cid_shard_size. Values above
+	 * SCX_CID_SHARD_MAX_CPUS are capped. 0 means use the default (24).
+	 */
+	u32 cid_shard_size;
+
 	/**
 	 * @cgroup_id: When >1, attach the scheduler as a sub-scheduler on the
 	 * specified cgroup.
@@ -977,6 +989,7 @@ struct sched_ext_ops_cid {
 	u32 timeout_ms;
 	u32 exit_dump_len;
 	u64 hotplug_seq;
+	u32 cid_shard_size;
 	u64 sub_cgroup_id;
 	char name[SCX_OPS_NAME_LEN];
 
diff --git a/kernel/sched/ext/types.h b/kernel/sched/ext/types.h
index bc74eafd43f1..b31d12931999 100644
--- a/kernel/sched/ext/types.h
+++ b/kernel/sched/ext/types.h
@@ -47,11 +47,16 @@ enum scx_consts {
 };
 
 /*
- * Per-cid topology info. For each topology level (core, LLC, node), records
- * the first cid in the unit and its global index. Global indices are
- * consecutive integers assigned in cid-walk order, so e.g. core_idx ranges
- * over [0, nr_cores_at_init) with no gaps. No-topo cids have all fields set
- * to -1.
+ * Per-cid topology info. For each topology level (core, LLC, node) and shard,
+ * records the first cid in the unit and its global index. Global indices are
+ * consecutive integers assigned in cid-walk order, so e.g. core_idx ranges over
+ * [0, nr_cores_at_init) with no gaps. No-topo cids have core/LLC/node fields
+ * set to -1 but always have valid shard assignments.
+ *
+ * Shards are contiguous CID ranges used as scalable locking/work domains for
+ * sub-scheduler operations. By default each LLC becomes one shard, split into
+ * smaller shards if the LLC exceeds the target size. No-topo cids are packed
+ * into their own max-sized shards.
  *
  * @core_cid: first cid of this cid's core (smt-sibling group)
  * @core_idx: global index of that core, in [0, nr_cores_at_init)
@@ -59,6 +64,8 @@ enum scx_consts {
  * @llc_idx: global index of that LLC, in [0, nr_llcs_at_init)
  * @node_cid: first cid of this cid's NUMA node
  * @node_idx: global index of that node, in [0, nr_nodes_at_init)
+ * @shard_cid: first cid of this cid's shard
+ * @shard_idx: global index of that shard, in [0, scx_nr_cid_shards)
  */
 struct scx_cid_topo {
 	s32 core_cid;
@@ -67,6 +74,24 @@ struct scx_cid_topo {
 	s32 llc_idx;
 	s32 node_cid;
 	s32 node_idx;
+	s32 shard_cid;
+	s32 shard_idx;
+};
+
+enum scx_cid_consts {
+	SCX_CID_SHARD_SIZE_DFL		= 24,
+	SCX_CID_SHARD_MAX_CPUS		= 512,
+};
+
+/*
+ * Per-shard metadata for O(1) shard->cid-range lookup.
+ *
+ * @base_cid: first cid of the shard
+ * @nr_cids: number of cids in the shard
+ */
+struct scx_cid_shard {
+	s32			base_cid;
+	s32			nr_cids;
 };
 
 /*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override()
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (8 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  9:51   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns Tejun Heo
                   ` (21 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

An overridden cid mapping invalidates the auto-generated shard layout, so
the override call has to provide both. Extend scx_bpf_cid_override() with a
shard_start[] array that lists the first cid of each shard (starting at 0,
strictly increasing, last shard implicitly extends to num_possible_cpus()).

A scheduler that wants only custom shards with the auto-generated cid
mapping can read the current mapping and pass it back unchanged.

Overridden shards can span NUMA nodes, so scx_shard_node[] is rebuilt by
majority count: each shard is assigned to the node that owns the most cpus
in it.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/cid.c                   | 136 +++++++++++++++++++++--
 tools/sched_ext/include/scx/compat.bpf.h |  11 +-
 tools/sched_ext/scx_qmap.bpf.c           |  16 ++-
 tools/sched_ext/scx_qmap.c               |  34 +++++-
 tools/sched_ext/scx_qmap.h               |   1 +
 5 files changed, 174 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
index 9d75b9311978..bd0467e8a8d2 100644
--- a/kernel/sched/ext/cid.c
+++ b/kernel/sched/ext/cid.c
@@ -392,29 +392,58 @@ void scx_cpumask_to_cmask(const struct cpumask *src, struct scx_cmask *dst)
 	}
 }
 
+/*
+ * Return the index of the largest entry in @counts, or NUMA_NO_NODE if all
+ * entries are zero. Ties resolve to the lowest index.
+ */
+static s32 pick_max_node(const u32 *counts, u32 n)
+{
+	s32 best = NUMA_NO_NODE;
+	u32 best_count = 0, i;
+
+	for (i = 0; i < n; i++) {
+		if (counts[i] > best_count) {
+			best_count = counts[i];
+			best = i;
+		}
+	}
+	return best;
+}
+
 __bpf_kfunc_start_defs();
 
 /**
- * scx_bpf_cid_override - Install an explicit cpu->cid mapping
+ * scx_bpf_cid_override - Install an explicit cpu->cid mapping with shard info
  * @cpu_to_cid: array of nr_cpu_ids s32 entries (cid for each cpu)
  * @cpu_to_cid__sz: must be nr_cpu_ids * sizeof(s32) bytes
+ * @shard_start: array of first-cid-of-each-shard, strictly increasing from 0
+ * @shard_start__sz: nr_shards * sizeof(s32) bytes
  * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
  *
  * May only be called from ops.init_cids() of the root scheduler. Replace the
- * topology-probed cid mapping with the caller-provided one. Each possible cpu
- * must map to a unique cid in [0, num_possible_cpus()). Topo info is cleared.
- * On invalid input, trigger scx_error() to abort the scheduler.
+ * topology-probed cid mapping and shard layout with caller-provided ones. Each
+ * possible cpu must map to a unique cid in [0, num_possible_cpus()).
+ * @shard_start must be strictly increasing with shard_start[0] == 0 and all
+ * values < num_possible_cpus(). The last shard extends to num_possible_cpus()
+ * and no shard may span more than SCX_CID_SHARD_MAX_CPUS cids. Topo info
+ * (core/LLC/node) is cleared and shard info is set from @shard_start. On
+ * invalid input, abort the scheduler.
  */
 __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
-				      const struct bpf_prog_aux *aux)
+				       const s32 *shard_start, u32 shard_start__sz,
+				       const struct bpf_prog_aux *aux)
 {
 	cpumask_var_t seen __free(free_cpumask_var) = CPUMASK_VAR_NULL;
+	u32 *node_counts __free(kfree) = NULL;
+	u32 npossible = num_possible_cpus();
 	struct scx_sched *sch;
+	u32 nr_shards;
 	bool alloced;
-	s32 cpu, cid;
+	s32 cpu, cid, si;
 
-	/* GFP_KERNEL alloc must happen before the rcu read section */
+	/* GFP_KERNEL allocs must happen before the rcu read section */
 	alloced = zalloc_cpumask_var(&seen, GFP_KERNEL);
+	node_counts = kcalloc(nr_node_ids, sizeof(*node_counts), GFP_KERNEL);
 
 	guard(rcu)();
 
@@ -422,17 +451,57 @@ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
 	if (unlikely(!sch))
 		return;
 
-	if (!alloced) {
-		scx_error(sch, "scx_bpf_cid_override: failed to allocate cpumask");
+	if (!alloced || !node_counts) {
+		scx_error(sch, "scx_bpf_cid_override: allocation failed");
 		return;
 	}
 
 	if (cpu_to_cid__sz != nr_cpu_ids * sizeof(s32)) {
-		scx_error(sch, "scx_bpf_cid_override: expected %zu bytes, got %u",
+		scx_error(sch, "scx_bpf_cid_override: cpu_to_cid expected %zu bytes, got %u",
 			  nr_cpu_ids * sizeof(s32), cpu_to_cid__sz);
 		return;
 	}
 
+	if (!shard_start__sz || shard_start__sz % sizeof(s32)) {
+		scx_error(sch, "scx_bpf_cid_override: invalid shard_start size %u",
+			  shard_start__sz);
+		return;
+	}
+
+	nr_shards = shard_start__sz / sizeof(s32);
+
+	/* validate shard_start[]: starts at 0, strictly increasing, in range */
+	if (shard_start[0] != 0) {
+		scx_error(sch, "scx_bpf_cid_override: shard_start[0] must be 0, got %d",
+			  shard_start[0]);
+		return;
+	}
+	for (si = 1; si < nr_shards; si++) {
+		if (shard_start[si] <= shard_start[si - 1]) {
+			scx_error(sch, "scx_bpf_cid_override: shard_start not increasing at [%d]",
+				  si);
+			return;
+		}
+		if (shard_start[si] >= npossible) {
+			scx_error(sch, "scx_bpf_cid_override: shard_start[%d]=%d >= %u",
+				  si, shard_start[si], npossible);
+			return;
+		}
+		if (shard_start[si] - shard_start[si - 1] > SCX_CID_SHARD_MAX_CPUS) {
+			scx_error(sch, "scx_bpf_cid_override: shard[%d] span %d exceeds max %d",
+				  si - 1, shard_start[si] - shard_start[si - 1],
+				  SCX_CID_SHARD_MAX_CPUS);
+			return;
+		}
+	}
+	if (npossible - shard_start[nr_shards - 1] > SCX_CID_SHARD_MAX_CPUS) {
+		scx_error(sch, "scx_bpf_cid_override: shard[%d] span %d exceeds max %d",
+			  nr_shards - 1, npossible - shard_start[nr_shards - 1],
+			  SCX_CID_SHARD_MAX_CPUS);
+		return;
+	}
+
+	/* Validate first so that invalid input leaves globals untouched. */
 	for_each_possible_cpu(cpu) {
 		s32 c = cpu_to_cid[cpu];
 
@@ -442,13 +511,56 @@ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
 			scx_error(sch, "cid %d assigned to multiple cpus", c);
 			return;
 		}
+	}
+
+	for_each_possible_cpu(cpu) {
+		s32 c = cpu_to_cid[cpu];
+
 		scx_cpu_to_cid_tbl[cpu] = c;
 		scx_cid_to_cpu_tbl[c] = cpu;
 	}
 
-	/* Invalidate stale topo info - the override carries no topology. */
-	for (cid = 0; cid < num_possible_cpus(); cid++)
+	/*
+	 * Derive scx_shard_node[] by majority count: an overridden shard may
+	 * span NUMA nodes, so assign each to the node that owns the most cpus.
+	 */
+	for (si = 0; si < nr_shards; si++) {
+		u32 end = (si + 1 < nr_shards) ? shard_start[si + 1] : npossible;
+
+		memset(node_counts, 0, nr_node_ids * sizeof(*node_counts));
+		for (cid = shard_start[si]; cid < end; cid++) {
+			s32 node = cpu_to_node(scx_cid_to_cpu_tbl[cid]);
+
+			if (numa_valid_node(node))
+				node_counts[node]++;
+		}
+		scx_shard_node[si] = pick_max_node(node_counts, nr_node_ids);
+	}
+
+	/*
+	 * Invalidate stale topo info and install shard layout from
+	 * @shard_start. Walk shards to derive shard_cid/shard_idx for each cid.
+	 */
+	si = 0;
+	for (cid = 0; cid < npossible; cid++) {
+		if (si + 1 < nr_shards && cid >= shard_start[si + 1])
+			si++;
+		scx_cid_to_shard[cid] = si;
 		scx_cid_topo[cid] = SCX_CID_TOPO_NEG;
+		scx_cid_topo[cid].shard_cid = shard_start[si];
+		scx_cid_topo[cid].shard_idx = si;
+	}
+
+	/* Rebuild scx_cid_shard_ranges[] for the new layout. */
+	memset(scx_cid_shard_ranges, 0, npossible * sizeof(*scx_cid_shard_ranges));
+	for (si = 0; si < nr_shards; si++) {
+		u32 end = (si + 1 < nr_shards) ? shard_start[si + 1] : npossible;
+
+		scx_cid_shard_ranges[si].base_cid = shard_start[si];
+		scx_cid_shard_ranges[si].nr_cids = end - shard_start[si];
+	}
+
+	scx_nr_cid_shards = nr_shards;
 }
 
 /**
diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
index 133058578668..cf469d5ff9ca 100644
--- a/tools/sched_ext/include/scx/compat.bpf.h
+++ b/tools/sched_ext/include/scx/compat.bpf.h
@@ -122,15 +122,18 @@ static inline bool scx_bpf_sub_dispatch(u64 cgroup_id)
 }
 
 /*
- * v7.2: scx_bpf_cid_override() for explicit cpu->cid mapping. Ignore if
+ * v7.3: scx_bpf_cid_override() for explicit cid and shard mapping. Ignore if
  * missing.
  */
-void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz) __ksym __weak;
+void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
+				    const s32 *shard_start, u32 shard_start__sz) __ksym __weak;
 
-static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz)
+static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
+					 const s32 *shard_start, u32 shard_start__sz)
 {
 	if (bpf_ksym_exists(scx_bpf_cid_override___compat))
-		return scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz);
+		scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz,
+					      shard_start, shard_start__sz);
 }
 
 /**
diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 2df7c53992dc..f6cfe63425d3 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -54,18 +54,20 @@ const volatile u32 max_tasks;
 
 /*
  * Optional cid-override test harness. When cid_override_mode is non-zero,
- * qmap_init_cids() calls scx_bpf_cid_override() with the caller-supplied
- * cpu_to_cid array to exercise the kfunc's acceptance and error paths. See enum
+ * qmap_init_cids() calls scx_bpf_cid_override() with the caller-supplied arrays
+ * to exercise the kfunc's acceptance and error paths. See enum
  * qmap_cid_override for the modes.
  */
 const volatile u32 cid_override_mode;
+const volatile u32 cid_override_nr_shards;
 /*
- * Array lives in bss (writable) because scx_bpf_cid_override()'s BPF
- * verifier signature treats its len-paired pointer as read/write - rodata
+ * Arrays live in bss (writable) because scx_bpf_cid_override()'s BPF
+ * verifier signature treats its len-paired pointers as read/write - rodata
  * fails verification with "write into map forbidden". Userspace populates
- * it before SCX_OPS_LOAD, same as rodata, and nothing writes it after.
+ * them before SCX_OPS_LOAD, same as rodata, and nothing writes them after.
  */
 s32 cid_override_cpu_to_cid[SCX_QMAP_MAX_CPUS];
+s32 cid_override_shard_start[SCX_QMAP_MAX_CPUS];
 
 UEI_DEFINE(uei);
 
@@ -1082,7 +1084,9 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init_cids)
 	}
 
 	scx_bpf_cid_override((const s32 *)cid_override_cpu_to_cid,
-			     nr_cpu_ids * sizeof(s32));
+			     nr_cpu_ids * sizeof(s32),
+			     (const s32 *)cid_override_shard_start,
+			     cid_override_nr_shards * sizeof(s32));
 	return 0;
 }
 
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index c0b5cab579d6..9124183bffec 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -43,7 +43,7 @@ const char help_fmt[] =
 "  -p            Switch only tasks on SCHED_EXT policy instead of all\n"
 "  -I            Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n"
 "  -F COUNT      IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n"
-"  -C MODE       cid-override test (shuffle|bad-dup|bad-range)\n"
+"  -C MODE       cid-override test (shuffle|bad-dup|bad-range|bad-mono)\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 
@@ -155,6 +155,7 @@ int main(int argc, char **argv)
 		case 'C': {
 			u32 nr_cpus = libbpf_num_possible_cpus();
 			u32 mode, i;
+			s32 shard_sz = 4;
 
 			if (!strcmp(optarg, "shuffle"))
 				mode = QMAP_CID_OVR_SHUFFLE;
@@ -162,13 +163,15 @@ int main(int argc, char **argv)
 				mode = QMAP_CID_OVR_BAD_DUP;
 			else if (!strcmp(optarg, "bad-range"))
 				mode = QMAP_CID_OVR_BAD_RANGE;
+			else if (!strcmp(optarg, "bad-mono"))
+				mode = QMAP_CID_OVR_BAD_MONO;
 			else {
 				fprintf(stderr, "unknown cid-override mode '%s'\n", optarg);
 				return 1;
 			}
 			skel->rodata->cid_override_mode = mode;
 
-			/* shuffle: reversed cpu_to_cid, bad-dup: dup cid 0, bad-range: identity */
+			/* shuffle: reversed cpu_to_cid; others: identity */
 			for (i = 0; i < nr_cpus; i++) {
 				if (mode == QMAP_CID_OVR_SHUFFLE)
 					skel->bss->cid_override_cpu_to_cid[i] = nr_cpus - 1 - i;
@@ -179,6 +182,33 @@ int main(int argc, char **argv)
 				skel->bss->cid_override_cpu_to_cid[1] = 0;
 			if (mode == QMAP_CID_OVR_BAD_RANGE)
 				skel->bss->cid_override_cpu_to_cid[0] = (s32)nr_cpus;
+
+			/*
+			 * bad-mono needs >= 3 shards to build a 0-based but
+			 * non-monotonic shard_start. Shrink the shard size so
+			 * the test runs on any machine with >= 3 cpus.
+			 */
+			if (mode == QMAP_CID_OVR_BAD_MONO) {
+				if (nr_cpus < 3) {
+					fprintf(stderr, "bad-mono needs >= 3 cpus (have %u)\n",
+						nr_cpus);
+					return 1;
+				}
+				shard_sz = nr_cpus / 3;
+			}
+
+			/* shards of shard_sz each */
+			skel->rodata->cid_override_nr_shards = (nr_cpus + shard_sz - 1) / shard_sz;
+			for (i = 0; i < skel->rodata->cid_override_nr_shards; i++)
+				skel->bss->cid_override_shard_start[i] = i * shard_sz;
+
+			if (mode == QMAP_CID_OVR_BAD_MONO) {
+				/* swap [1] and [2] to break monotonicity */
+				s32 tmp = skel->bss->cid_override_shard_start[1];
+				skel->bss->cid_override_shard_start[1] =
+					skel->bss->cid_override_shard_start[2];
+				skel->bss->cid_override_shard_start[2] = tmp;
+			}
 			break;
 		}
 		case 'v':
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index 3bcc3579839d..6c3ea1fc74ed 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -33,6 +33,7 @@ enum qmap_cid_override {
 	QMAP_CID_OVR_SHUFFLE	= 1,	/* valid reversed cpu->cid mapping */
 	QMAP_CID_OVR_BAD_DUP	= 2,	/* invalid: duplicate cid assignment */
 	QMAP_CID_OVR_BAD_RANGE	= 3,	/* invalid: out-of-range cid */
+	QMAP_CID_OVR_BAD_MONO	= 4,	/* invalid: non-monotonic shard_start */
 };
 
 struct cpu_ctx {
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (9 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding Tejun Heo
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Split kobject_init_and_add() in scx_alloc_and_add_sched(): only
kobject_init() runs there. A new scx_sched_sysfs_add() helper does
kobject_add() (and creates sub_kset when the scheduler implements
ops.sub_attach), called by both enable workfns once @sch is linked and its
sysfs-visible state is initialized. Prep so a future caps attribute can rely
on @sch being fully built by the time it's sysfs-visible. Add early enough
that a stall later in enable still leaves sysfs inspectable.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      | 73 +++++++++++++++++++++++--------------
 kernel/sched/ext/internal.h |  1 +
 kernel/sched/ext/sub.c      |  8 +++-
 3 files changed, 53 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 87a3fb9bb446..fcb8bf0d2422 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -5704,7 +5704,9 @@ static void scx_root_disable(struct scx_sched *sch)
 	if (sch->sub_kset)
 		kobject_del(&sch->sub_kset->kobj);
 #endif
-	kobject_del(&sch->kobj);
+	/* not added if enable failed before scx_sched_sysfs_add() */
+	if (sch->kobj.state_in_sysfs)
+		kobject_del(&sch->kobj);
 
 	free_kick_syncs();
 
@@ -6421,36 +6423,15 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 		 * disable. Released in scx_sched_free_rcu_work().
 		 */
 		kobject_get(&parent->kobj);
-		ret = kobject_init_and_add(&sch->kobj, &scx_ktype,
-					   &parent->sub_kset->kobj,
-					   "sub-%llu", cgroup_id(cgrp));
-	} else {
-		ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root");
-	}
-
-	if (ret < 0) {
-		RCU_INIT_POINTER(ops->priv, NULL);
-		kobject_put(&sch->kobj);
-		return ERR_PTR(ret);
-	}
-
-	if (ops->sub_attach) {
-		sch->sub_kset = kset_create_and_add("sub", NULL, &sch->kobj);
-		if (!sch->sub_kset) {
-			RCU_INIT_POINTER(ops->priv, NULL);
-			kobject_put(&sch->kobj);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
-#else	/* CONFIG_EXT_SUB_SCHED */
-	ret = kobject_init_and_add(&sch->kobj, &scx_ktype, NULL, "root");
-	if (ret < 0) {
-		RCU_INIT_POINTER(ops->priv, NULL);
-		kobject_put(&sch->kobj);
-		return ERR_PTR(ret);
 	}
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
+	/*
+	 * Init the kobj but don't add to sysfs yet. The enable path calls
+	 * scx_sched_sysfs_add() once @sch's sysfs-visible state is initialized.
+	 */
+	kobject_init(&sch->kobj, &scx_ktype);
+
 	/*
 	 * Consume the arena_map ref bpf_scx_reg_cid() took. Defer to here so
 	 * earlier failure paths leave cmd->arena_map set and bpf_scx_reg_cid
@@ -6508,6 +6489,36 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 	return ERR_PTR(ret);
 }
 
+/*
+ * Add @sch's kobject to sysfs, and create its sub_kset if the scheduler
+ * implements ops.sub_attach. Called by the enable workfns once @sch's
+ * sysfs-visible state is initialized.
+ */
+int scx_sched_sysfs_add(struct scx_sched *sch)
+{
+#ifdef CONFIG_EXT_SUB_SCHED
+	struct scx_sched *parent = scx_parent(sch);
+	int ret;
+
+	if (parent)
+		ret = kobject_add(&sch->kobj, &parent->sub_kset->kobj,
+				  "sub-%llu", cgroup_id(sch_cgroup(sch)));
+	else
+		ret = kobject_add(&sch->kobj, NULL, "root");
+	if (ret < 0)
+		return ret;
+
+	if (sch->ops.sub_attach) {
+		sch->sub_kset = kset_create_and_add("sub", NULL, &sch->kobj);
+		if (!sch->sub_kset)
+			return -ENOMEM;
+	}
+	return 0;
+#else
+	return kobject_add(&sch->kobj, NULL, "root");
+#endif
+}
+
 static int check_hotplug_seq(struct scx_sched *sch,
 			      const struct sched_ext_ops *ops)
 {
@@ -6730,6 +6741,12 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
 	}
 
+	ret = scx_sched_sysfs_add(sch);
+	if (ret) {
+		cpus_read_unlock();
+		goto err_disable;
+	}
+
 	for (i = SCX_OPI_CPU_HOTPLUG_BEGIN; i < SCX_OPI_CPU_HOTPLUG_END; i++)
 		if (((void (**)(void))ops)[i])
 			set_bit(i, sch->has_op);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index ba5e9be0e3c3..7c6f4ed10cde 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1689,6 +1689,7 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 					  struct cgroup *cgrp,
 					  struct scx_sched *parent);
 int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
+int scx_sched_sysfs_add(struct scx_sched *sch);
 
 extern raw_spinlock_t scx_sched_lock;
 extern struct mutex scx_enable_mutex;
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 050420427273..e94a415ee10a 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -270,7 +270,9 @@ void scx_sub_disable(struct scx_sched *sch)
 		SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
 	if (sch->sub_kset)
 		kobject_del(&sch->sub_kset->kobj);
-	kobject_del(&sch->kobj);
+	/* not added if enable failed before scx_sched_sysfs_add() */
+	if (sch->kobj.state_in_sysfs)
+		kobject_del(&sch->kobj);
 }
 
 /* verify that a scheduler can be attached to @cgrp and return the parent */
@@ -363,6 +365,10 @@ void scx_sub_enable_workfn(struct kthread_work *work)
 	if (ret)
 		goto err_disable;
 
+	ret = scx_sched_sysfs_add(sch);
+	if (ret)
+		goto err_disable;
+
 	if (sch->level >= SCX_SUB_MAX_DEPTH) {
 		scx_error(sch, "max nesting depth %d violated",
 			  SCX_SUB_MAX_DEPTH);
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (10 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access Tejun Heo
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Add struct scx_pshard and sch->pshard[] indexed by shard_idx, each entry
allocated on its shard's NUMA node from scx_shard_node[si]. The struct
starts empty (one dummy field). Follow-up patches will grow it as
shard-local state lands. Only cid-type schedulers with an arena pool get
pshards.

Allocation happens after ops.init_cids() returns so any
scx_bpf_cid_override() it issues has finalized scx_nr_cid_shards and
scx_shard_node[]. sch->nr_pshards records the array size for the async RCU
free path, which may run after a later scheduler's scx_cid_init() has
rewritten the global.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      |  8 +++++++
 kernel/sched/ext/internal.h | 18 ++++++++++++++++
 kernel/sched/ext/sub.c      | 42 +++++++++++++++++++++++++++++++++++++
 kernel/sched/ext/sub.h      |  4 ++++
 4 files changed, 72 insertions(+)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index fcb8bf0d2422..c0a3a1ead283 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -4652,6 +4652,8 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		free_pnode(sch->pnode[node]);
 	kfree(sch->pnode);
 
+	scx_free_pshards(sch);
+
 	rhashtable_walk_enter(&sch->dsq_hash, &rht_iter);
 	do {
 		rhashtable_walk_start(&rht_iter);
@@ -6730,6 +6732,12 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 		goto err_disable;
 	}
 
+	ret = scx_alloc_pshards(sch);
+	if (ret) {
+		cpus_read_unlock();
+		goto err_disable;
+	}
+
 	if (sch->ops.init) {
 		ret = SCX_CALL_OP_RET(sch, init, NULL);
 		if (ret) {
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 7c6f4ed10cde..e79175fab862 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1183,6 +1183,12 @@ struct scx_sched_pnode {
 	struct scx_dispatch_q	global_dsq;
 };
 
+#ifdef CONFIG_EXT_SUB_SCHED
+struct scx_pshard {
+	int			_dummy;		/* until the first real field lands */
+};
+#endif
+
 struct scx_sched {
 	/*
 	 * cpu-form and cid-form ops share field offsets up to .priv (verified
@@ -1230,6 +1236,9 @@ struct scx_sched {
 	 */
 	struct rhashtable	dsq_hash;
 	struct scx_sched_pnode	**pnode;
+#ifdef CONFIG_EXT_SUB_SCHED
+	struct scx_pshard	**pshard;	/* indexed by shard_idx */
+#endif
 	struct scx_sched_pcpu __percpu *pcpu;
 
 	u64			slice_dfl;
@@ -1245,6 +1254,15 @@ struct scx_sched {
 	u32			dsp_max_batch;
 	s32			level;
 
+#ifdef CONFIG_EXT_SUB_SCHED
+	/*
+	 * pshard[] size captured at enable for the async RCU free path -
+	 * scx_nr_cid_shards may be rewritten by a later scx_cid_init() before
+	 * free runs. While sch is active, use the global.
+	 */
+	u32			nr_pshards;
+#endif
+
 	/*
 	 * Updates to the following warned bitfields can race causing RMW issues
 	 * but it doesn't really matter.
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index e94a415ee10a..c87650f26b30 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -82,6 +82,48 @@ void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
 		rcu_assign_pointer(pos->scx_sched, sch);
 }
 
+static void free_pshard(struct scx_pshard *pshard)
+{
+	kfree(pshard);
+}
+
+void scx_free_pshards(struct scx_sched *sch)
+{
+	s32 si;
+
+	if (!sch->pshard)
+		return;
+	for (si = 0; si < sch->nr_pshards; si++)
+		free_pshard(sch->pshard[si]);
+	kfree(sch->pshard);
+}
+
+static struct scx_pshard *alloc_pshard(struct scx_sched *sch, s32 shard_idx, s32 node)
+{
+	return kzalloc_node(sizeof(struct scx_pshard), GFP_KERNEL, node);
+}
+
+s32 scx_alloc_pshards(struct scx_sched *sch)
+{
+	s32 si;
+
+	if (!sch->is_cid_type || !sch->arena_pool)
+		return 0;
+
+	sch->pshard = kzalloc_objs(sch->pshard[0], scx_nr_cid_shards, GFP_KERNEL);
+	if (!sch->pshard)
+		return -ENOMEM;
+
+	sch->nr_pshards = scx_nr_cid_shards;
+
+	for (si = 0; si < scx_nr_cid_shards; si++) {
+		sch->pshard[si] = alloc_pshard(sch, si, scx_shard_node[si]);
+		if (!sch->pshard[si])
+			return -ENOMEM;
+	}
+	return 0;
+}
+
 static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
 
 void drain_descendants(struct scx_sched *sch)
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 460a9fd196dc..9fa6b5c8be23 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -24,6 +24,8 @@ void drain_descendants(struct scx_sched *sch);
 void scx_sub_disable(struct scx_sched *sch);
 void scx_sub_enable_workfn(struct kthread_work *work);
 bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
+void scx_free_pshards(struct scx_sched *sch);
+s32 scx_alloc_pshards(struct scx_sched *sch);
 
 #else	/* CONFIG_EXT_SUB_SCHED */
 
@@ -33,6 +35,8 @@ static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
 static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
 static inline void drain_descendants(struct scx_sched *sch) { }
 static inline void scx_sub_disable(struct scx_sched *sch) { }
+static inline void scx_free_pshards(struct scx_sched *sch) {}
+static inline s32 scx_alloc_pshards(struct scx_sched *sch) { return 0; }
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (11 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

kfuncs taking struct scx_cmask * from BPF arena memory have two problems.
The pointer can be any value the BPF prog hands in, and the header (@base
and @nr_cids) can be mutated by the prog concurrently with kernel access.

Add scx_cmask_ref, a validated handle. _init() normalizes the input pointer
into the arena's kern_vm range via scx_arena_to_kaddr() and snapshots
@base/@nr_cids; downstream sizing uses the snapshot, not the live header.
_shard() reads slices, _or() / _copy() write back; all bounded by the
snapshot. No callers yet.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/cid.c   | 130 +++++++++++++++++++++++++++++++++++++++
 kernel/sched/ext/cid.h   |   7 +++
 kernel/sched/ext/types.h |  37 +++++++++++
 3 files changed, 174 insertions(+)

diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
index bd0467e8a8d2..7325ad04c386 100644
--- a/kernel/sched/ext/cid.c
+++ b/kernel/sched/ext/cid.c
@@ -633,6 +633,12 @@ enum cmask_op2 {
 	/* predicates - short-circuit when the per-word result is true */
 	CMASK_OP2_SUBSET,
 	CMASK_OP2_INTERSECTS,
+	/*
+	 * @a is a BPF-arena cmask. Words on @a use READ_ONCE/WRITE_ONCE since
+	 * BPF may read/write concurrently. See scx_cmask_ref_or() / _copy().
+	 */
+	CMASK_OP2_REF_OR,
+	CMASK_OP2_REF_COPY,
 };
 
 static __always_inline bool cmask_op2_is_pred(const enum cmask_op2 op)
@@ -661,6 +667,12 @@ static __always_inline bool cmask_word_op2(u64 *av, const u64 *bp, u64 mask,
 		return (READ_ONCE(*bp) & ~READ_ONCE(*av)) & mask;
 	case CMASK_OP2_INTERSECTS:
 		return (READ_ONCE(*av) & READ_ONCE(*bp)) & mask;
+	case CMASK_OP2_REF_OR:
+		WRITE_ONCE(*av, READ_ONCE(*av) | (READ_ONCE(*bp) & mask));
+		return false;
+	case CMASK_OP2_REF_COPY:
+		WRITE_ONCE(*av, (READ_ONCE(*av) & ~mask) | (READ_ONCE(*bp) & mask));
+		return false;
 	}
 	unreachable();
 }
@@ -891,6 +903,124 @@ static const struct btf_kfunc_id_set scx_kfunc_set_cid = {
 	.set	= &scx_kfunc_ids_cid,
 };
 
+/**
+ * scx_cmask_ref_init - Bind a scx_cmask_ref to a BPF-arena cmask
+ * @sch: scheduler whose arena hosts @src
+ * @src: BPF-supplied cmask pointer
+ * @ref: output ref
+ *
+ * Snapshot @src's @base and @nr_cids. The snapshot is necessary because BPF may
+ * mutate the live header asynchronously.
+ *
+ * Return 0 on success, -EINVAL if the snapshotted header is malformed.
+ */
+int scx_cmask_ref_init(struct scx_sched *sch, const struct scx_cmask *src,
+		       struct scx_cmask_ref *ref)
+{
+	struct scx_cmask *kern_src = scx_arena_to_kaddr(sch, src);
+	u32 base, nr_cids, npossible = num_possible_cpus();
+
+	base = READ_ONCE(kern_src->base);
+	nr_cids = READ_ONCE(kern_src->nr_cids);
+
+	if (unlikely(base >= npossible || nr_cids > npossible - base))
+		return -EINVAL;
+
+	ref->sch = sch;
+	ref->src = kern_src;
+	ref->base = base;
+	ref->nr_cids = nr_cids;
+
+	ref->shard_first = scx_cid_to_shard[base];
+	if (likely(nr_cids))
+		ref->shard_end = scx_cid_to_shard[base + nr_cids - 1] + 1;
+	else
+		ref->shard_end = ref->shard_first;
+
+	return 0;
+}
+
+/**
+ * scx_cmask_ref_shard - Read one shard from @ref into @out
+ * @ref: validated ref
+ * @shard_idx: target shard, in [@ref->shard_first, @ref->shard_end)
+ * @out: output cmask whose @out->alloc_words must hold the shard
+ *
+ * Set @out to the intersection of @ref's range with @shard_idx's cid range,
+ * with bits[] read from @ref->src via READ_ONCE. Empty intersection sets
+ * @out->nr_cids to 0. scx_error()s on @ref's sched if @out can't hold the
+ * shard.
+ */
+void scx_cmask_ref_shard(const struct scx_cmask_ref *ref, s32 shard_idx,
+			 struct scx_cmask *out)
+{
+	const struct scx_cid_shard *shard = &scx_cid_shard_ranges[shard_idx];
+	u32 shard_base = shard->base_cid;
+	u32 shard_end = shard_base + shard->nr_cids;
+	u32 isect_base, isect_end, nr_words, src_off, wi;
+	u64 head_mask, tail_mask;
+
+	isect_base = max(ref->base, shard_base);
+	isect_end = min(ref->base + ref->nr_cids, shard_end);
+
+	if (isect_base >= isect_end) {
+		out->base = shard_base;
+		out->nr_cids = 0;
+		return;
+	}
+
+	nr_words = ((isect_end - 1) / 64) - (isect_base / 64) + 1;
+	if (nr_words > out->alloc_words) {
+		scx_error(ref->sch, "scx_cmask_ref_shard: out alloc_words=%u < %u for shard %d",
+			  out->alloc_words, nr_words, shard_idx);
+		out->base = shard_base;
+		out->nr_cids = 0;
+		return;
+	}
+
+	out->base = isect_base;
+	out->nr_cids = isect_end - isect_base;
+	src_off = (isect_base / 64) - (ref->base / 64);
+
+	for (wi = 0; wi < nr_words; wi++)
+		out->bits[wi] = READ_ONCE(ref->src->bits[src_off + wi]);
+
+	head_mask = GENMASK_U64(63, isect_base & 63);
+	out->bits[0] &= head_mask;
+	tail_mask = GENMASK_U64((isect_end - 1) & 63, 0);
+	out->bits[nr_words - 1] &= tail_mask;
+}
+
+/**
+ * scx_cmask_ref_or - OR @src into the arena cmask referenced by @ref
+ * @ref: validated ref
+ * @src: stable kernel cmask
+ *
+ * Bits inside the intersection of @ref's snapshotted range with @src's range
+ * are OR'd into @ref->src and bits outside are left unchanged. Stores on
+ * @ref->src use WRITE_ONCE since BPF may read/write concurrently.
+ */
+void scx_cmask_ref_or(const struct scx_cmask_ref *ref, const struct scx_cmask *src)
+{
+	cmask_walk_op2(ref->src->bits, ref->base, ref->nr_cids,
+		       src->bits, src->base, src->nr_cids, CMASK_OP2_REF_OR);
+}
+
+/**
+ * scx_cmask_ref_copy - Copy @src into the arena cmask referenced by @ref
+ * @ref: validated ref
+ * @src: stable kernel cmask
+ *
+ * Bits inside the intersection of @ref's snapshotted range with @src's range
+ * take @src's values and bits outside are left unchanged. Stores on @ref->src
+ * use WRITE_ONCE since BPF may read/write concurrently.
+ */
+void scx_cmask_ref_copy(const struct scx_cmask_ref *ref, const struct scx_cmask *src)
+{
+	cmask_walk_op2(ref->src->bits, ref->base, ref->nr_cids,
+		       src->bits, src->base, src->nr_cids, CMASK_OP2_REF_COPY);
+}
+
 int scx_cid_kfunc_init(void)
 {
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, &scx_kfunc_set_init_cids) ?:
diff --git a/kernel/sched/ext/cid.h b/kernel/sched/ext/cid.h
index cdc18a7a48f5..70d97acd0ac4 100644
--- a/kernel/sched/ext/cid.h
+++ b/kernel/sched/ext/cid.h
@@ -293,4 +293,11 @@ static inline s32 scx_cpu_ret(struct scx_sched *sch, s32 cpu_or_cid)
 	return scx_cid_to_cpu(sch, cpu_or_cid);
 }
 
+int scx_cmask_ref_init(struct scx_sched *sch, const struct scx_cmask *src,
+		       struct scx_cmask_ref *ref);
+void scx_cmask_ref_shard(const struct scx_cmask_ref *ref, s32 shard_idx,
+			 struct scx_cmask *out);
+void scx_cmask_ref_or(const struct scx_cmask_ref *ref, const struct scx_cmask *src);
+void scx_cmask_ref_copy(const struct scx_cmask_ref *ref, const struct scx_cmask *src);
+
 #endif /* _KERNEL_SCHED_EXT_CID_H */
diff --git a/kernel/sched/ext/types.h b/kernel/sched/ext/types.h
index b31d12931999..98a6e072c33e 100644
--- a/kernel/sched/ext/types.h
+++ b/kernel/sched/ext/types.h
@@ -172,4 +172,41 @@ struct scx_cmask {
 #define SCX_CMASK_DEFINE_SHARD(NAME, BASE, NR_CIDS)				\
 	__SCX_CMASK_DEFINE(NAME, BASE, NR_CIDS, SCX_CID_SHARD_MAX_CPUS)
 
+/*
+ * scx_cmask_ref: validated reference to a BPF-arena cmask.
+ *
+ * scx_cmask_ref_init() normalizes the pointer into the arena and snapshots
+ * @base/@nr_cids. The snapshot is what downstream code uses for sizing - the
+ * live header can be mutated concurrently by BPF.
+ *
+ * scx_cmask_ref_shard() reads one shard into a cmask. scx_cmask_ref_or() and
+ * scx_cmask_ref_copy() write back into the referenced arena cmask, bounded by
+ * the snapshot.
+ *
+ * Typical input use:
+ *
+ *	struct scx_cmask_ref ref;
+ *	SCX_CMASK_DEFINE(shard, 0, SCX_CID_SHARD_MAX_CPUS);
+ *	s32 idx, ret;
+ *
+ *	ret = scx_cmask_ref_init(sch, src, &ref);
+ *	if (ret < 0)
+ *		return ret;
+ *
+ *	for (idx = ref.shard_first; idx < ref.shard_end; idx++) {
+ *		scx_cmask_ref_shard(&ref, idx, shard);
+ *		if (!shard->nr_cids)
+ *			continue;
+ *		... use idx and shard ...
+ *	}
+ */
+struct scx_cmask_ref {
+	struct scx_sched	*sch;
+	struct scx_cmask	*src;
+	u32			base;
+	u32			nr_cids;
+	s32			shard_first;
+	s32			shard_end;
+};
+
 #endif /* _KERNEL_SCHED_EXT_TYPES_H */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (12 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 10:49   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre() Tejun Heo
                   ` (17 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Future kfuncs need to walk descendants without scx_sched_lock. Make the
walker RCU-safe so that they can. A sub-sched's fields are initialized
before it is linked, so a walk that observes a linked node also observes its
setup. In-place changes after linking carry their own ordering.

Switch the children/sibling list ops to RCU and expand the descendant walker
to accept rcu_read_lock as a valid read-side context. Walkers that mutate
keep scx_sched_lock.

A sub-sched can be linked while an ancestor is bypassing, after the bypass
walk that propagates the depth has passed its parent. Take scx_bypass_lock
across linking and inherit the parent's bypass_depth so the new sched starts
out with the ancestor bypass state.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c | 10 +++++++---
 kernel/sched/ext/sub.c | 11 +++++++----
 kernel/sched/ext/sub.h |  4 ++--
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index c0a3a1ead283..1e38aaad4332 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -5468,7 +5468,8 @@ s32 scx_link_sched(struct scx_sched *sch)
 	const char *err_msg = "";
 	s32 ret = 0;
 
-	scoped_guard(raw_spinlock_irq, &scx_sched_lock) {
+	scoped_guard(raw_spinlock_irqsave, &scx_bypass_lock)	/* for bypass inheritance */
+	scoped_guard(raw_spinlock, &scx_sched_lock) {
 #ifdef CONFIG_EXT_SUB_SCHED
 		struct scx_sched *parent = scx_parent(sch);
 
@@ -5492,7 +5493,10 @@ s32 scx_link_sched(struct scx_sched *sch)
 				break;
 			}
 
-			list_add_tail(&sch->sibling, &parent->children);
+			list_add_tail_rcu(&sch->sibling, &parent->children);
+
+			/* inherit the ancestor bypass state */
+			WRITE_ONCE(sch->bypass_depth, READ_ONCE(parent->bypass_depth));
 		}
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
@@ -5519,7 +5523,7 @@ void scx_unlink_sched(struct scx_sched *sch)
 		if (scx_parent(sch)) {
 			rhashtable_remove_fast(&scx_sched_hash, &sch->hash_node,
 					       scx_sched_hash_params);
-			list_del_init(&sch->sibling);
+			list_del_rcu(&sch->sibling);
 		}
 #endif	/* CONFIG_EXT_SUB_SCHED */
 		list_del_rcu(&sch->all);
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index c87650f26b30..066fad0a60b4 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -35,21 +35,24 @@ struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sche
 	struct scx_sched *next;
 
 	lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
-		       lockdep_is_held(&scx_sched_lock));
+		       lockdep_is_held(&scx_sched_lock) ||
+		       rcu_read_lock_any_held());
 
 	/* if first iteration, visit @root */
 	if (!pos)
 		return root;
 
 	/* visit the first child if exists */
-	next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
+	next = list_first_or_null_rcu(&pos->children, struct scx_sched, sibling);
 	if (next)
 		return next;
 
 	/* no child, visit my or the closest ancestor's next sibling */
 	while (pos != root) {
-		if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
-			return list_next_entry(pos, sibling);
+		next = list_next_or_null_rcu(&scx_parent(pos)->children, &pos->sibling,
+					     struct scx_sched, sibling);
+		if (next)
+			return next;
 		pos = scx_parent(pos);
 	}
 
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 9fa6b5c8be23..e936867bc5c5 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -46,8 +46,8 @@ static inline s32 scx_alloc_pshards(struct scx_sched *sch) { return 0; }
  * @root: sched to walk the descendants of
  *
  * Walk @root's descendants. @root is included in the iteration and the first
- * node to be visited. Must be called with either scx_enable_mutex or
- * scx_sched_lock held.
+ * node to be visited. Must be called with scx_enable_mutex, scx_sched_lock, or
+ * RCU read lock.
  */
 #define scx_for_each_descendant_pre(pos, root)					\
 	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre()
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (13 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Factor the sibling/ancestor portion of scx_next_descendant_pre() out as
scx_skip_subtree_pre(), a pre-order walk primitive that skips @pos's
subtree, and call it from scx_next_descendant_pre(). Same locking rules as
the existing primitive.

Used in a follow-up to fast-skip subtrees that have nothing to do during a
descendant walk.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/sub.c | 37 ++++++++++++++++++++++++++++---------
 kernel/sched/ext/sub.h |  2 ++
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 066fad0a60b4..1e84f4620176 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -21,6 +21,33 @@
 
 #ifdef CONFIG_EXT_SUB_SCHED
 
+/**
+ * scx_skip_subtree_pre - Skip @pos's subtree in a pre-order walk
+ * @pos: current position
+ * @root: walk root
+ *
+ * In a walk started by scx_next_descendant_pre(), continue past @pos's subtree:
+ * return @pos's next sibling, or the closest ancestor's next sibling, or NULL
+ * if @pos's subtree is the last under @root. Same locking rules.
+ */
+struct scx_sched *scx_skip_subtree_pre(struct scx_sched *pos, struct scx_sched *root)
+{
+	struct scx_sched *next;
+
+	lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
+		       lockdep_is_held(&scx_sched_lock) ||
+		       rcu_read_lock_any_held());
+
+	while (pos != root) {
+		next = list_next_or_null_rcu(&scx_parent(pos)->children, &pos->sibling,
+					     struct scx_sched, sibling);
+		if (next)
+			return next;
+		pos = scx_parent(pos);
+	}
+	return NULL;
+}
+
 /**
  * scx_next_descendant_pre - find the next descendant for pre-order walk
  * @pos: the current position (%NULL to initiate traversal)
@@ -48,15 +75,7 @@ struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sche
 		return next;
 
 	/* no child, visit my or the closest ancestor's next sibling */
-	while (pos != root) {
-		next = list_next_or_null_rcu(&scx_parent(pos)->children, &pos->sibling,
-					     struct scx_sched, sibling);
-		if (next)
-			return next;
-		pos = scx_parent(pos);
-	}
-
-	return NULL;
+	return scx_skip_subtree_pre(pos, root);
 }
 
 static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index e936867bc5c5..3d5ad9c36d64 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -15,6 +15,7 @@
 
 #ifdef CONFIG_EXT_SUB_SCHED
 
+struct scx_sched *scx_skip_subtree_pre(struct scx_sched *pos, struct scx_sched *root);
 struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root);
 void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch);
 struct cgroup *sch_cgroup(struct scx_sched *sch);
@@ -30,6 +31,7 @@ s32 scx_alloc_pshards(struct scx_sched *sch);
 #else	/* CONFIG_EXT_SUB_SCHED */
 
 static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
+static inline struct scx_sched *scx_skip_subtree_pre(struct scx_sched *pos, struct scx_sched *root) { return NULL; }
 static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
 static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
 static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (14 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre() Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 11:17   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier " Tejun Heo
                   ` (15 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Caps are per-cid permissions parents delegate to direct children via
scx_bpf_sub_grant() / scx_bpf_sub_revoke(). A child's cap set is always a
subset of its parent's. Sub-scheds check their caps locally, and cross-sched
communication is needed only when the delegation set itself changes.

Caps will be used to implement sub-sched scheduling on the enqueue path.
Picking a cid for a task at a leaf depends on which cids the leaf is allowed
to use, and resolving that programmatically on every enqueue would mean a
cross-sched round-trip call chain, possibly retrying if the request can't be
granted as-is. The dispatch path is different - it runs as top-down
recursion via scx_bpf_sub_dispatch().

Locking is per shard. cid space is split into shards, and each sub-sched has
its own pshard->lock for each shard. Operations are broken up on shard
boundaries. Different shards never contend. Shards are expected to be
topology-aligned and likely to serve as the locality unit when cids are
allocated to schedulers, so per-shard lock granularity scales naturally with
the allocation pattern.

This patch adds the framework with a single dummy cap. Real caps land in
later patches.

The enable path is reordered for pshards. scx_arena_pool_init() moves ahead
of scx_link_sched() so the pshards are allocated before the sched becomes
reachable - scx_alloc_pshards() skips allocation when the arena pool isn't
initialized. A failing sub-enable also records an scx_error() now, so an
errno-only failure leaves a recorded reason for the disable work.

- scx_bpf_sub_grant(): Per-cid all-or-nothing grant to direct child.
- scx_bpf_sub_revoke(): Clear caps on @cmask across @child and its subtree.
- scx_bpf_sub_caps(): Lockless snapshot of caps on a cid range.

/sys/kernel/sched_ext/SCHED/caps shows the caps each scheduler currently
holds.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c                   |  77 +++++-
 kernel/sched/ext/internal.h              |  56 +++-
 kernel/sched/ext/sub.c                   | 334 ++++++++++++++++++++++-
 kernel/sched/ext/sub.h                   |   2 +
 tools/sched_ext/include/scx/common.bpf.h |   6 +
 5 files changed, 463 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 1e38aaad4332..26b869c373c7 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -4710,9 +4710,52 @@ static ssize_t scx_attr_events_show(struct kobject *kobj,
 }
 SCX_ATTR(events);
 
+#ifdef CONFIG_EXT_SUB_SCHED
+static const char *scx_cap_names[__SCX_NR_CAPS] = {
+	[__SCX_CAP_DUMMY]	= "dummy",
+};
+
+static ssize_t scx_attr_caps_show(struct kobject *kobj,
+				  struct kobj_attribute *ka, char *buf)
+{
+	struct scx_sched *sch = container_of(kobj, struct scx_sched, kobj);
+	u32 npossible = num_possible_cpus();
+	struct scx_cmask *agg __free(kfree) =
+		kzalloc(struct_size(agg, bits, SCX_CMASK_NR_WORDS(npossible)), GFP_KERNEL);
+	unsigned long *agg_bm __free(bitmap) = bitmap_zalloc(npossible, GFP_KERNEL);
+	ssize_t count = 0;
+	s32 cap, si;
+
+	if (!agg || !agg_bm)
+		return -ENOMEM;
+
+	for (cap = 0; cap < __SCX_NR_CAPS; cap++) {
+		SCX_CMASK_DEFINE(snap, 0, SCX_CID_SHARD_MAX_CPUS);
+
+		scx_cmask_init(agg, 0, npossible);
+		for (si = 0; si < sch->nr_pshards; si++) {
+			struct scx_cmask *cm = &sch->pshard[si]->caps[cap].cmask;
+
+			scx_cmask_reframe(snap, cm->base, cm->nr_cids);
+			scx_cmask_copy(snap, cm);
+			scx_cmask_or(agg, snap);
+		}
+		/* %*pbl takes unsigned long bitmap layout, convert from u64 */
+		bitmap_from_arr64(agg_bm, agg->bits, npossible);
+		count += sysfs_emit_at(buf, count, "%s: %*pbl\n",
+				       scx_cap_names[cap], npossible, agg_bm);
+	}
+	return count;
+}
+SCX_ATTR(caps);
+#endif	/* CONFIG_EXT_SUB_SCHED */
+
 static struct attribute *scx_sched_attrs[] = {
 	&scx_attr_ops.attr,
 	&scx_attr_events.attr,
+#ifdef CONFIG_EXT_SUB_SCHED
+	&scx_attr_caps.attr,
+#endif
 	NULL,
 };
 ATTRIBUTE_GROUPS(scx_sched);
@@ -6711,8 +6754,8 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 
 	/*
 	 * A cid-form scheduler finalizes its cid layout in ops.init_cids(),
-	 * which may call scx_bpf_cid_override(). Run it before ops.init() so
-	 * the final layout is in effect.
+	 * which may call scx_bpf_cid_override(). Run it before the caps and
+	 * shard state are built so the final layout is in effect.
 	 */
 	if (sch->is_cid_type && sch->ops_cid.init_cids) {
 		ret = SCX_CALL_OP_RET(sch, init_cids, NULL);
@@ -6742,6 +6785,9 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 		goto err_disable;
 	}
 
+	scx_init_root_caps(sch);
+
+	/* the cid caps and shards are live now, so ops.init() can query them */
 	if (sch->ops.init) {
 		ret = SCX_CALL_OP_RET(sch, init, NULL);
 		if (ret) {
@@ -7423,7 +7469,7 @@ static struct bpf_struct_ops bpf_sched_ext_ops = {
 
 /*
  * cid-form cfi stubs. Stubs whose signatures match the cpu-form (param types
- * identical, only param names differ across structs) are reused; only
+ * identical, only param names differ across structs) are reused. Only
  * set_cmask needs a fresh stub since the second argument type differs.
  */
 static void sched_ext_ops_cid__set_cmask(struct task_struct *p,
@@ -9611,6 +9657,28 @@ __bpf_kfunc struct cgroup *scx_bpf_task_cgroup(struct task_struct *p,
 }
 #endif	/* CONFIG_CGROUP_SCHED */
 
+#ifndef CONFIG_EXT_SUB_SCHED
+__bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
+				  const struct scx_cmask *cmask__ign,
+				  struct scx_cmask *denied_out__ign,
+				  const struct bpf_prog_aux *aux)
+{
+	return -EOPNOTSUPP;
+}
+
+__bpf_kfunc void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps,
+				    const struct scx_cmask *cmask__ign,
+				    const struct bpf_prog_aux *aux)
+{
+}
+
+__bpf_kfunc s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out__ign,
+				 const struct bpf_prog_aux *aux)
+{
+	return -EOPNOTSUPP;
+}
+#endif	/* !CONFIG_EXT_SUB_SCHED */
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_any)
@@ -9655,6 +9723,9 @@ BTF_ID_FLAGS(func, scx_bpf_events)
 #ifdef CONFIG_CGROUP_SCHED
 BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE)
 #endif
+BTF_ID_FLAGS(func, scx_bpf_sub_grant, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_sub_revoke, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_sub_caps, KF_IMPLICIT_ARGS)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index e79175fab862..0fa1e298220d 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -786,9 +786,9 @@ struct sched_ext_ops {
 	/**
 	 * @init_cids: Finalize the cid layout (cid-form only)
 	 *
-	 * Runs after the default cid layout is built, before ops.init(). A
-	 * cid-form scheduler may call scx_bpf_cid_override() here for a custom
-	 * layout. Ignored for cpu-form schedulers.
+	 * Runs after the default cid layout is built, before caps and shards
+	 * are finalized. A cid-form scheduler may call scx_bpf_cid_override()
+	 * here for a custom layout. Ignored for cpu-form schedulers.
 	 */
 	s32 (*init_cids)(void);
 
@@ -1183,9 +1183,57 @@ struct scx_sched_pnode {
 	struct scx_dispatch_q	global_dsq;
 };
 
+/*
+ * Sub-sched capability delegation.
+ *
+ * Caps are per-cid permissions parents delegate to direct children via
+ * scx_bpf_sub_grant() / scx_bpf_sub_revoke(). A child's cap set is always a
+ * subset of its parent's. A sub-sched checks its caps locally, and cross-sched
+ * communication is needed only when the delegation set itself changes.
+ *
+ * Caps are used to implement sub-sched scheduling on the enqueue path. Picking
+ * a cid for a task at a leaf depends on which cids the leaf is allowed to use.
+ * Resolving that programmatically on every enqueue would mean a cross-sched
+ * round-trip call chain, possibly retrying if the request can't be granted
+ * as-is.
+ *
+ * The dispatch path is different - it runs as top-down recursion via
+ * scx_bpf_sub_dispatch(): a sched's dispatch op invokes a child's dispatch op
+ * on the local rq, and the subtree dispatches in a single pass.
+ *
+ * Locking is per shard. cid space is split into shards, and each sub-sched has
+ * its own pshard->lock for each shard. Operations are broken up on shard
+ * boundaries. Different shards never contend. Shards are expected to be
+ * topology-aligned and likely to serve as the locality unit when cids are
+ * allocated to schedulers, so per-shard lock granularity scales naturally with
+ * the allocation pattern.
+ */
+enum scx_cap_flags {
+	__SCX_CAP_DUMMY			= 0,
+
+	__SCX_NR_CAPS,
+	__SCX_CAP_ALL			= BIT_U64(__SCX_NR_CAPS) - 1,
+
+	SCX_CAP_DUMMY			= BIT_U64(__SCX_CAP_DUMMY),
+};
+
 #ifdef CONFIG_EXT_SUB_SCHED
+/* iterate set bits in a u64 cap mask */
+#define scx_for_each_cap_bit(cap_bit, caps)				\
+	for (u64 __caps = (caps);					\
+	     __caps && ((cap_bit) = __ffs64(__caps), true);		\
+	     __caps &= __caps - 1)
+
 struct scx_pshard {
-	int			_dummy;		/* until the first real field lands */
+	raw_spinlock_t		lock;		/* serializes caps */
+	struct scx_sched	*sch;		/* backpointer */
+	/*
+	 * Per-cap cmask, inline via TRAILING_OVERLAP so cmask.bits[] overlaps
+	 * the trailing _bits[] storage. Access as &caps[i].cmask.
+	 */
+	TRAILING_OVERLAP(struct scx_cmask, cmask, bits,
+			 u64 _bits[SCX_CMASK_NR_WORDS(SCX_CID_SHARD_MAX_CPUS)];
+	) caps[__SCX_NR_CAPS];
 };
 #endif
 
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 1e84f4620176..e7259623fa3c 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -122,7 +122,21 @@ void scx_free_pshards(struct scx_sched *sch)
 
 static struct scx_pshard *alloc_pshard(struct scx_sched *sch, s32 shard_idx, s32 node)
 {
-	return kzalloc_node(sizeof(struct scx_pshard), GFP_KERNEL, node);
+	const struct scx_cid_shard *shard = &scx_cid_shard_ranges[shard_idx];
+	struct scx_pshard *pshard;
+	s32 i;
+
+	pshard = kzalloc_node(sizeof(*pshard), GFP_KERNEL, node);
+	if (!pshard)
+		return NULL;
+
+	raw_spin_lock_init(&pshard->lock);
+	pshard->sch = sch;
+
+	for (i = 0; i < __SCX_NR_CAPS; i++)
+		scx_cmask_init(&pshard->caps[i].cmask, shard->base_cid, shard->nr_cids);
+
+	return pshard;
 }
 
 s32 scx_alloc_pshards(struct scx_sched *sch)
@@ -146,6 +160,22 @@ s32 scx_alloc_pshards(struct scx_sched *sch)
 	return 0;
 }
 
+/*
+ * Seed the root's caps fully. Root owns all cids on all caps at enable time.
+ * Children acquire caps via scx_bpf_sub_grant().
+ */
+void scx_init_root_caps(struct scx_sched *sch)
+{
+	s32 si, i;
+
+	for (si = 0; si < sch->nr_pshards; si++) {
+		struct scx_pshard *ps = sch->pshard[si];
+
+		for (i = 0; i < __SCX_NR_CAPS; i++)
+			scx_cmask_fill(&ps->caps[i].cmask);
+	}
+}
+
 static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
 
 void drain_descendants(struct scx_sched *sch)
@@ -425,6 +455,23 @@ void scx_sub_enable_workfn(struct kthread_work *work)
 		goto out_unlock;
 	}
 
+	/*
+	 * Allocate pshard[] before scx_link_sched() publishes @sch into the
+	 * parent's RCU children list. A concurrent revoke walking the tree
+	 * would otherwise dereference sch->pshard[si] while it's still NULL.
+	 * Unlike the root path, the cid shard layout is stable at this point.
+	 *
+	 * scx_alloc_pshards() skips allocation when @sch's arena pool isn't
+	 * initialized, so scx_arena_pool_init() must run first.
+	 */
+	ret = scx_arena_pool_init(sch);
+	if (ret)
+		goto err_disable;
+
+	ret = scx_alloc_pshards(sch);
+	if (ret)
+		goto err_disable;
+
 	ret = scx_link_sched(sch);
 	if (ret)
 		goto err_disable;
@@ -449,10 +496,6 @@ void scx_sub_enable_workfn(struct kthread_work *work)
 		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
 	}
 
-	ret = scx_arena_pool_init(sch);
-	if (ret)
-		goto err_disable;
-
 	ret = scx_set_cmask_scratch_alloc(sch);
 	if (ret)
 		goto err_disable;
@@ -640,6 +683,12 @@ void scx_sub_enable_workfn(struct kthread_work *work)
 	percpu_up_write(&scx_fork_rwsem);
 err_disable:
 	mutex_unlock(&scx_enable_mutex);
+	/*
+	 * Some enable failures only return an errno (e.g. -ENOMEM from an
+	 * allocation) without calling scx_error(). Record it so
+	 * scx_flush_disable_work() runs the disable and ops.exit() fires.
+	 */
+	scx_error(sch, "scx_sub_enable() failed (%d)", ret);
 	scx_flush_disable_work(sch);
 	cmd->ret = 0;
 }
@@ -733,6 +782,281 @@ __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *
 				  true);
 }
 
+/* Validate common inputs. On success, *parent_out and *child_out are set. */
+static s32 sub_cap_preamble(u64 cgroup_id, u64 caps, const struct bpf_prog_aux *aux,
+			    struct scx_sched **parent_out, struct scx_sched **child_out)
+{
+	struct scx_sched *parent, *child;
+
+	parent = scx_prog_sched(aux);
+	if (unlikely(!parent))
+		return -ENODEV;
+
+	if (!scx_is_cid_type()) {
+		scx_error(parent, "sub-cap kfuncs require a cid-form scheduler");
+		return -EOPNOTSUPP;
+	}
+
+	child = scx_find_sub_sched(cgroup_id);
+	if (unlikely(!child))
+		return -ENODEV;
+
+	if (unlikely(scx_parent(child) != parent)) {
+		scx_error(parent, "%s: sub-%llu is not a direct child",
+			  parent->cgrp_path, cgroup_id);
+		return -EINVAL;
+	}
+
+	if (unlikely(caps & ~__SCX_CAP_ALL)) {
+		scx_error(parent, "invalid caps 0x%llx", caps);
+		return -EINVAL;
+	}
+
+	*parent_out = parent;
+	*child_out = child;
+	return 0;
+}
+
+/**
+ * scx_bpf_sub_grant - Grant @caps on @cmask__ign's cids to a direct child
+ * @cgroup_id: cgroup id of the direct child sub-sched
+ * @caps: bitmask of SCX_CAP_* to grant
+ * @cmask__ign: cid cmask to grant @caps on (arena pointer)
+ * @denied_out__ign: optional arena cmask accumulating refused cids
+ * @aux: implicit BPF argument
+ *
+ * A cid in @cmask__ign is granted to the child only if the parent holds every
+ * requested cap on it. Refused cids are OR'd into @denied_out__ign when
+ * provided. Refusals outside @denied_out__ign's range are not recorded.
+ *
+ * All-or-nothing keeps the caller-visible result binary per cid, so
+ * @denied_out__ign is one mask to interpret rather than a per-cap matrix.
+ *
+ * Return 0 on full success, -EPERM if any cid was refused, or a negative
+ * errno on other failures.
+ */
+__bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
+				  const struct scx_cmask *cmask__ign,
+				  struct scx_cmask *denied_out__ign,
+				  const struct bpf_prog_aux *aux)
+{
+	struct scx_cmask_ref ref, denied_ref;
+	struct scx_sched *parent, *child;
+	bool any_denied = false;
+	s32 si, ret;
+
+	guard(irqsave)();
+
+	ret = sub_cap_preamble(cgroup_id, caps, aux, &parent, &child);
+	if (ret)
+		return ret;
+
+	ret = scx_cmask_ref_init(parent, cmask__ign, &ref);
+	if (ret) {
+		scx_error(parent, "invalid cmask (%d)", ret);
+		return ret;
+	}
+
+	if (denied_out__ign) {
+		ret = scx_cmask_ref_init(parent, denied_out__ign, &denied_ref);
+		if (ret) {
+			scx_error(parent, "invalid denied_out (%d)", ret);
+			return ret;
+		}
+	}
+
+	/* apply the grant one shard at a time */
+	for (si = ref.shard_first; si < ref.shard_end; si++) {
+		SCX_CMASK_DEFINE_SHARD(slice, 0, SCX_CID_SHARD_MAX_CPUS);
+		struct scx_pshard *pps = parent->pshard[si];
+		struct scx_pshard *cps = child->pshard[si];
+		u32 cap_bit;
+
+		scx_cmask_ref_shard(&ref, si, slice);
+		if (scx_cmask_empty(slice))
+			continue;
+
+		SCX_CMASK_DEFINE_SHARD(granted_cids, slice->base, slice->nr_cids);
+		scx_cmask_copy(granted_cids, slice);
+
+		scoped_guard (raw_spinlock, &pps->lock) {
+			guard(raw_spinlock_nested)(&cps->lock);
+
+			/*
+			 * Narrow granted_cids to cids the parent holds every
+			 * requested cap on. All-or-nothing per cid.
+			 */
+			scx_for_each_cap_bit(cap_bit, caps)
+				scx_cmask_and(granted_cids, &pps->caps[cap_bit].cmask);
+
+			/* fold granted_cids into the child per requested cap */
+			scx_for_each_cap_bit(cap_bit, caps)
+				scx_cmask_or(&cps->caps[cap_bit].cmask, granted_cids);
+		}
+
+		/* record cids that didn't make it through into @denied_out */
+		if (!scx_cmask_subset(slice, granted_cids)) {
+			any_denied = true;
+			if (denied_out__ign) {
+				SCX_CMASK_DEFINE_SHARD(denied, slice->base, slice->nr_cids);
+
+				scx_cmask_copy(denied, slice);
+				scx_cmask_andnot(denied, granted_cids);
+				scx_cmask_ref_or(&denied_ref, denied);
+			}
+		}
+	}
+	return any_denied ? -EPERM : 0;
+}
+
+/**
+ * scx_bpf_sub_revoke - Revoke @caps on @cmask__ign's cids from @child
+ * @cgroup_id: cgroup id of the direct child sub-sched
+ * @caps: bitmask of SCX_CAP_* to revoke
+ * @cmask__ign: cid cmask to revoke @caps on (arena pointer)
+ * @aux: implicit BPF argument
+ *
+ * Clear @caps bits on @cmask__ign from the child named by @cgroup_id and all
+ * its descendants. The origin parent's pshard lock is held across the subtree
+ * walk so a concurrent grant from the origin parent observes the revoked
+ * state.
+ */
+__bpf_kfunc void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps,
+				    const struct scx_cmask *cmask__ign,
+				    const struct bpf_prog_aux *aux)
+{
+	struct scx_cmask_ref ref;
+	struct scx_sched *parent, *child, *pos;
+	s32 si, ret;
+
+	guard(irqsave)();
+
+	if (sub_cap_preamble(cgroup_id, caps, aux, &parent, &child))
+		return;
+
+	ret = scx_cmask_ref_init(parent, cmask__ign, &ref);
+	if (ret) {
+		scx_error(parent, "invalid cmask (%d)", ret);
+		return;
+	}
+
+	/* per-shard, walk child's subtree and clear @caps */
+	for (si = ref.shard_first; si < ref.shard_end; si++) {
+		SCX_CMASK_DEFINE_SHARD(slice, 0, SCX_CID_SHARD_MAX_CPUS);
+
+		scx_cmask_ref_shard(&ref, si, slice);
+		if (scx_cmask_empty(slice))
+			continue;
+
+		/*
+		 * Pre-order with subtree skip: a descendant that cleared
+		 * nothing means no descendant of it can hold @caps on these
+		 * cids either.
+		 */
+		guard(raw_spinlock)(&parent->pshard[si]->lock);
+		pos = scx_next_descendant_pre(NULL, child);
+		while (pos) {
+			struct scx_pshard *ps = pos->pshard[si];
+			u64 revoked_caps = 0;
+			u32 cap_bit;
+
+			scoped_guard (raw_spinlock_nested, &ps->lock) {
+				scx_for_each_cap_bit(cap_bit, caps) {
+					struct scx_cmask *cm = &ps->caps[cap_bit].cmask;
+
+					if (!scx_cmask_intersects(cm, slice))
+						continue;
+					scx_cmask_andnot(cm, slice);
+					revoked_caps |= BIT_U64(cap_bit);
+				}
+			}
+
+			if (revoked_caps)
+				pos = scx_next_descendant_pre(pos, child);
+			else
+				pos = scx_skip_subtree_pre(pos, child);
+		}
+	}
+}
+
+/**
+ * scx_bpf_sub_caps - Read self's or a direct child's cap cmasks
+ * @cgroup_id: 0 for self, or a direct child's cgroup id
+ * @caps: one or more SCX_CAP_* bits
+ * @out__ign: arena cmask to receive the union of @caps within its range
+ * @aux: implicit BPF argument
+ *
+ * Read the cap cmasks granted on each cid for self (@cgroup_id 0) or a direct
+ * child - the literal granted set. A sched can read only itself or a direct
+ * child.
+ *
+ * Return 0, -ENODEV if @cgroup_id names no direct child, or -EINVAL on bad
+ * inputs.
+ */
+__bpf_kfunc s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out__ign,
+				 const struct bpf_prog_aux *aux)
+{
+	struct scx_cmask_ref ref;
+	struct scx_sched *sch, *target;
+	s32 si, ret;
+
+	guard(irqsave)();
+
+	sch = scx_prog_sched(aux);
+	if (unlikely(!sch))
+		return -ENODEV;
+
+	if (!scx_is_cid_type()) {
+		scx_error(sch, "sub-cap kfuncs require a cid-form scheduler");
+		return -EOPNOTSUPP;
+	}
+
+	if (unlikely(caps & ~__SCX_CAP_ALL)) {
+		scx_error(sch, "invalid caps 0x%llx", caps);
+		return -EINVAL;
+	}
+
+	/* @cgroup_id 0 reads self, otherwise a direct child */
+	if (cgroup_id) {
+		target = scx_find_sub_sched(cgroup_id);
+		if (unlikely(!target))
+			return -ENODEV;
+		if (unlikely(scx_parent(target) != sch)) {
+			scx_error(sch, "%s: sub-%llu is not a direct child",
+				  sch->cgrp_path, cgroup_id);
+			return -EINVAL;
+		}
+	} else {
+		target = sch;
+	}
+
+	/*
+	 * The target's caps storage may not be set up yet (e.g. a self-read
+	 * during ops.init_cids()).
+	 */
+	if (unlikely(!target->pshard)) {
+		scx_error(sch, "scx_bpf_sub_caps() called before caps storage is initialized");
+		return -ENODEV;
+	}
+
+	ret = scx_cmask_ref_init(sch, out__ign, &ref);
+	if (ret) {
+		scx_error(sch, "invalid out (%d)", ret);
+		return ret;
+	}
+
+	for (si = ref.shard_first; si < ref.shard_end; si++) {
+		const struct scx_cid_shard *shard = &scx_cid_shard_ranges[si];
+		SCX_CMASK_DEFINE_SHARD(local_out, shard->base_cid, shard->nr_cids);
+		u32 cap_bit;
+
+		scx_for_each_cap_bit(cap_bit, caps)
+			scx_cmask_or(local_out, &target->pshard[si]->caps[cap_bit].cmask);
+		scx_cmask_ref_copy(&ref, local_out);
+	}
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 3d5ad9c36d64..3a913cc56422 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -27,6 +27,7 @@ void scx_sub_enable_workfn(struct kthread_work *work);
 bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
 void scx_free_pshards(struct scx_sched *sch);
 s32 scx_alloc_pshards(struct scx_sched *sch);
+void scx_init_root_caps(struct scx_sched *sch);
 
 #else	/* CONFIG_EXT_SUB_SCHED */
 
@@ -39,6 +40,7 @@ static inline void drain_descendants(struct scx_sched *sch) { }
 static inline void scx_sub_disable(struct scx_sched *sch) { }
 static inline void scx_free_pshards(struct scx_sched *sch) {}
 static inline s32 scx_alloc_pshards(struct scx_sched *sch) { return 0; }
+static inline void scx_init_root_caps(struct scx_sched *sch) {}
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index e7b3ba491c5e..09c21602b2ed 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -114,6 +114,12 @@ u32 scx_bpf_cidperf_cap(s32 cid) __ksym __weak;
 u32 scx_bpf_cidperf_cur(s32 cid) __ksym __weak;
 void scx_bpf_cidperf_set(s32 cid, u32 perf) __ksym __weak;
 
+/* sub-scheduler cap control, scx_bpf_sub_caps() cgroup_id 0 == self */
+s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps, const struct scx_cmask *cmask,
+		      struct scx_cmask *denied) __ksym __weak;
+void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps, const struct scx_cmask *cmask) __ksym __weak;
+s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out) __ksym __weak;
+
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
  * within bpf_for_each() loops.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier for sub-schedulers
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (15 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Wire up ops_cid.sub_caps_updated() to notify sub-scheds of cap changes.

Three constraints shape the design:

  1. Static memory. Deliveries use a fixed-size buffer, both for runtime
     efficiency and so notifications can't be lost under memory pressure.

  2. High-frequency updates. Grant/revoke can mutate caps in bursts, and the
     notifier path must absorb that without amplifying it.

  3. Recursive grant/revoke from the callback. A child receiving a
     notification can call grant/revoke on its own children, which can
     cascade recursively down its subtree.

(1) and (2) lead to coalescing into a fixed payload. Each delivery carries a
single (cmask, caps) pair covering every change since the previous one.
Direction (set vs cleared) isn't encoded as it doesn't fit in the fixed-size
summary. The callback queries scx_bpf_sub_caps() for current state. Only one
delivery is in flight per shard. Further changes fold into the same buffer
and ship as the next callback, so a shard's callbacks fire in order.

(3) leads to deferred delivery. Events accumulate during grant/revoke and
are delivered after the shard lock is released.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      |   8 +-
 kernel/sched/ext/internal.h |  71 ++++++++++++++++
 kernel/sched/ext/sub.c      | 162 ++++++++++++++++++++++++++++++++++--
 3 files changed, 234 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 26b869c373c7..4701346765cd 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -7469,11 +7469,13 @@ static struct bpf_struct_ops bpf_sched_ext_ops = {
 
 /*
  * cid-form cfi stubs. Stubs whose signatures match the cpu-form (param types
- * identical, only param names differ across structs) are reused. Only
- * set_cmask needs a fresh stub since the second argument type differs.
+ * identical, only param names differ across structs) are reused. Some need
+ * fresh stubs, set_cmask due to an argument type difference and the sub-sched
+ * notifiers because no cpu-form stub exists to reuse.
  */
 static void sched_ext_ops_cid__set_cmask(struct task_struct *p,
 					 const struct scx_cmask *cmask) {}
+static void sched_ext_ops__sub_caps_updated(const struct scx_cmask *cmask, u64 caps) {}
 
 static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
 	.select_cid		= sched_ext_ops__select_cpu,
@@ -7506,6 +7508,7 @@ static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
 #endif
 	.sub_attach		= sched_ext_ops__sub_attach,
 	.sub_detach		= sched_ext_ops__sub_detach,
+	.sub_caps_updated	= sched_ext_ops__sub_caps_updated,
 	.cid_online		= sched_ext_ops__cpu_online,
 	.cid_offline		= sched_ext_ops__cpu_offline,
 	.init_cids		= sched_ext_ops__init_cids,
@@ -9951,6 +9954,7 @@ static int __init scx_init(void)
 	CID_OFFSET_MATCH(dump_task, dump_task);
 	CID_OFFSET_MATCH(sub_attach, sub_attach);
 	CID_OFFSET_MATCH(sub_detach, sub_detach);
+	CID_OFFSET_MATCH(sub_caps_updated, sub_caps_updated);
 	CID_OFFSET_MATCH(init_cids, init_cids);
 	CID_OFFSET_MATCH(init, init);
 	CID_OFFSET_MATCH(exit, exit);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 0fa1e298220d..fd75005fcc10 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -757,6 +757,25 @@ struct sched_ext_ops {
 	 */
 	void (*sub_detach)(struct scx_sub_detach_args *args);
 
+	/**
+	 * @sub_caps_updated: Caps on this sub-sched's shard changed
+	 * @cmask: cids whose caps changed (cmask->base identifies the shard)
+	 * @caps: SCX_CAP_* that changed
+	 *
+	 * Invoked after grant or revoke modifies caps on a shard. There can be
+	 * only one in-flight invocation per shard. @cmask and @caps coalesce
+	 * all changes since the last delivery. Direction (set vs cleared) isn't
+	 * encoded. Query current state with scx_bpf_sub_caps().
+	 *
+	 * Delivered asynchronously after the change is recorded, and may run
+	 * before it takes effect on any given cpu. Use it to track which caps
+	 * the sub-sched holds and propagate to its own children, not to decide
+	 * if a task can run on a cpu now.
+	 *
+	 * May call scx_bpf_sub_grant() / scx_bpf_sub_revoke() on children.
+	 */
+	void (*sub_caps_updated)(const struct scx_cmask *cmask, u64 caps);
+
 	/*
 	 * All online ops must come before ops.cpu_online().
 	 */
@@ -977,6 +996,7 @@ struct sched_ext_ops_cid {
 #endif	/* CONFIG_EXT_GROUP_SCHED */
 	s32 (*sub_attach)(struct scx_sub_attach_args *args);
 	void (*sub_detach)(struct scx_sub_detach_args *args);
+	void (*sub_caps_updated)(const struct scx_cmask *cmask, u64 caps);
 	void (*cid_online)(s32 cid);
 	void (*cid_offline)(s32 cid);
 	s32 (*init_cids)(void);
@@ -1224,9 +1244,51 @@ enum scx_cap_flags {
 	     __caps && ((cap_bit) = __ffs64(__caps), true);		\
 	     __caps &= __caps - 1)
 
+/*
+ * Sub-cap update notifier.
+ *
+ * ops_cid.sub_caps_updated() notifies sub-scheds when their cap state changes
+ * so they can refresh internal state without polling scx_bpf_sub_caps() per
+ * enqueue.
+ *
+ * Three constraints shape the design:
+ *
+ *   1. Static memory. Deliveries use a fixed-size buffer, both for runtime
+ *      efficiency and so notifications can't be lost under memory pressure.
+ *
+ *   2. High-frequency updates. Grant/revoke can mutate caps in bursts, and the
+ *      notifier path must absorb that without amplifying it.
+ *
+ *   3. Recursive grant/revoke from the callback. A child receiving a
+ *      notification can call grant/revoke on its own children, which can
+ *      cascade recursively down its subtree.
+ *
+ * (1) and (2) lead to coalescing into a fixed payload. Each delivery carries a
+ * single (cmask, caps) pair covering every change since the previous one.
+ * Direction (set vs cleared) isn't encoded as it doesn't fit in the fixed-size
+ * summary. The callback queries scx_bpf_sub_caps() for current state. Only one
+ * delivery is in flight per shard. Further changes fold into the same buffer
+ * and ship as the next callback, so a shard's callbacks fire in order.
+ *
+ * (3) leads to deferred delivery. Events accumulate during grant/revoke and are
+ * delivered after the shard lock is released.
+ */
+struct scx_caps_updated {
+	raw_spinlock_t		lock;
+	u64			caps;
+	struct scx_cmask	*cmask_arena_out;
+	struct list_head	node_in_flight;
+	/* Kernel-side accumulator. Access as &cu->cmask. */
+	TRAILING_OVERLAP(struct scx_cmask, cmask, bits,
+			 u64 _bits[SCX_CMASK_NR_WORDS(SCX_CID_SHARD_MAX_CPUS)];
+	);
+};
+
 struct scx_pshard {
 	raw_spinlock_t		lock;		/* serializes caps */
 	struct scx_sched	*sch;		/* backpointer */
+	struct scx_caps_updated	caps_updated;
+
 	/*
 	 * Per-cap cmask, inline via TRAILING_OVERLAP so cmask.bits[] overlaps
 	 * the trailing _bits[] storage. Access as &caps[i].cmask.
@@ -1234,6 +1296,15 @@ struct scx_pshard {
 	TRAILING_OVERLAP(struct scx_cmask, cmask, bits,
 			 u64 _bits[SCX_CMASK_NR_WORDS(SCX_CID_SHARD_MAX_CPUS)];
 	) caps[__SCX_NR_CAPS];
+
+	/*
+	 * Shard geometry captured at alloc. cmask_arena_out's own header is
+	 * bpf-writable and the live shard range can change before the
+	 * rcu-deferred free, so re-init and size cmask_arena_out from these
+	 * trusted copies instead.
+	 */
+	u32			base;
+	u32			nr_cids;
 };
 #endif
 
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index e7259623fa3c..c821d604ac9d 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -106,6 +106,15 @@ void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
 
 static void free_pshard(struct scx_pshard *pshard)
 {
+	struct scx_caps_updated *cu;
+
+	if (!pshard)
+		return;
+	cu = &pshard->caps_updated;
+	if (cu->cmask_arena_out)
+		scx_arena_free(pshard->sch, cu->cmask_arena_out,
+			       struct_size_t(struct scx_cmask, bits,
+					     SCX_CMASK_NR_WORDS(pshard->nr_cids)));
 	kfree(pshard);
 }
 
@@ -123,7 +132,10 @@ void scx_free_pshards(struct scx_sched *sch)
 static struct scx_pshard *alloc_pshard(struct scx_sched *sch, s32 shard_idx, s32 node)
 {
 	const struct scx_cid_shard *shard = &scx_cid_shard_ranges[shard_idx];
+	size_t cmask_size = struct_size_t(struct scx_cmask, bits,
+					  SCX_CMASK_NR_WORDS(shard->nr_cids));
 	struct scx_pshard *pshard;
+	struct scx_caps_updated *cu;
 	s32 i;
 
 	pshard = kzalloc_node(sizeof(*pshard), GFP_KERNEL, node);
@@ -132,10 +144,25 @@ static struct scx_pshard *alloc_pshard(struct scx_sched *sch, s32 shard_idx, s32
 
 	raw_spin_lock_init(&pshard->lock);
 	pshard->sch = sch;
+	pshard->base = shard->base_cid;
+	pshard->nr_cids = shard->nr_cids;
 
 	for (i = 0; i < __SCX_NR_CAPS; i++)
 		scx_cmask_init(&pshard->caps[i].cmask, shard->base_cid, shard->nr_cids);
 
+	cu = &pshard->caps_updated;
+	raw_spin_lock_init(&cu->lock);
+	INIT_LIST_HEAD(&cu->node_in_flight);
+	__scx_cmask_init(&cu->cmask, shard->base_cid, shard->nr_cids, SCX_CID_SHARD_MAX_CPUS);
+
+	cu->cmask_arena_out = scx_arena_alloc(sch, cmask_size);
+	if (!cu->cmask_arena_out) {
+		free_pshard(pshard);
+		return NULL;
+	}
+
+	scx_cmask_init(cu->cmask_arena_out, shard->base_cid, shard->nr_cids);
+
 	return pshard;
 }
 
@@ -176,6 +203,86 @@ void scx_init_root_caps(struct scx_sched *sch)
 	}
 }
 
+/* record a caps change, see struct scx_caps_updated */
+static void caps_updated_record(struct scx_pshard *ps, const struct scx_cmask *cids, u64 caps,
+				struct list_head *to_deliver)
+{
+	struct scx_caps_updated *cu = &ps->caps_updated;
+
+	guard(raw_spinlock)(&cu->lock);
+	scx_cmask_or(&cu->cmask, cids);
+	cu->caps |= caps;
+	if (list_empty(&cu->node_in_flight))
+		list_add_tail(&cu->node_in_flight, to_deliver);
+}
+
+/* deliver queued caps_updated callbacks, see struct scx_caps_updated */
+static void caps_updated_deliver(struct list_head *to_deliver)
+{
+	struct scx_caps_updated *cu, *tmp;
+
+	list_for_each_entry_safe(cu, tmp, to_deliver, node_in_flight) {
+		struct scx_pshard *ps = container_of(cu, struct scx_pshard, caps_updated);
+		struct scx_sched *sch = ps->sch;
+
+		while (true) {
+			u64 caps = 0;
+
+			/*
+			 * During enable, has_op is set after ops.sub_attach(),
+			 * so !has_op means the op is absent or the sched isn't
+			 * live yet - e.g. caps grant from ops.sub_attach().
+			 * Either way don't consume - leave for
+			 * scx_sub_seed_caps() to deliver once live.
+			 */
+			scoped_guard (raw_spinlock, &cu->lock) {
+				if (cu->caps && SCX_HAS_OP(sch, sub_caps_updated) &&
+				    likely(!READ_ONCE(sch->aborting))) {
+					caps = cu->caps;
+					scx_cmask_init(cu->cmask_arena_out,
+						       ps->base, ps->nr_cids);
+					scx_cmask_copy(cu->cmask_arena_out, &cu->cmask);
+					scx_cmask_clear(&cu->cmask);
+					cu->caps = 0;
+				} else {
+					list_del_init(&cu->node_in_flight);
+				}
+			}
+			if (!caps)
+				break;
+
+			/* caps != 0 only when deliverable (has_op, above) */
+			SCX_CALL_OP(sch, sub_caps_updated, NULL,
+				    scx_kaddr_to_arena(sch, cu->cmask_arena_out),
+				    caps);
+		}
+	}
+}
+
+/*
+ * Deliver caps owed to @sch that couldn't be delivered earlier (e.g. a grant
+ * taken during its sub_attach(), before has_op was set). Called once @sch is
+ * enabled.
+ */
+static void scx_sub_seed_caps(struct scx_sched *sch)
+{
+	LIST_HEAD(to_deliver);
+	s32 si;
+
+	guard(irqsave)();
+
+	for (si = 0; si < sch->nr_pshards; si++) {
+		struct scx_pshard *ps = sch->pshard[si];
+		struct scx_caps_updated *cu = &ps->caps_updated;
+
+		scoped_guard (raw_spinlock, &cu->lock) {
+			if (cu->caps && list_empty(&cu->node_in_flight))
+				list_add_tail(&cu->node_in_flight, &to_deliver);
+		}
+	}
+	caps_updated_deliver(&to_deliver);
+}
+
 static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
 
 void drain_descendants(struct scx_sched *sch)
@@ -645,6 +752,9 @@ void scx_sub_enable_workfn(struct kthread_work *work)
 
 	scx_bypass(sch, false);
 
+	/* @sch is enabled; deliver any caps owed since its sub_attach() */
+	scx_sub_seed_caps(sch);
+
 	pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
 	kobject_uevent(&sch->kobj, KOBJ_ADD);
 	ret = 0;
@@ -843,6 +953,7 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 	struct scx_cmask_ref ref, denied_ref;
 	struct scx_sched *parent, *child;
 	bool any_denied = false;
+	LIST_HEAD(to_deliver);
 	s32 si, ret;
 
 	guard(irqsave)();
@@ -870,6 +981,7 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 		SCX_CMASK_DEFINE_SHARD(slice, 0, SCX_CID_SHARD_MAX_CPUS);
 		struct scx_pshard *pps = parent->pshard[si];
 		struct scx_pshard *cps = child->pshard[si];
+		u64 granted_caps = 0;
 		u32 cap_bit;
 
 		scx_cmask_ref_shard(&ref, si, slice);
@@ -877,6 +989,9 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 			continue;
 
 		SCX_CMASK_DEFINE_SHARD(granted_cids, slice->base, slice->nr_cids);
+		SCX_CMASK_DEFINE_SHARD(changed_cids, slice->base, slice->nr_cids);
+		SCX_CMASK_DEFINE_SHARD(delta, slice->base, slice->nr_cids);
+
 		scx_cmask_copy(granted_cids, slice);
 
 		scoped_guard (raw_spinlock, &pps->lock) {
@@ -889,9 +1004,26 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 			scx_for_each_cap_bit(cap_bit, caps)
 				scx_cmask_and(granted_cids, &pps->caps[cap_bit].cmask);
 
-			/* fold granted_cids into the child per requested cap */
-			scx_for_each_cap_bit(cap_bit, caps)
-				scx_cmask_or(&cps->caps[cap_bit].cmask, granted_cids);
+			/*
+			 * For each requested cap, fold the newly-set cids into
+			 * the child and accumulate the delta.
+			 */
+			scx_for_each_cap_bit(cap_bit, caps) {
+				struct scx_cmask *ccm = &cps->caps[cap_bit].cmask;
+
+				scx_cmask_copy(delta, granted_cids);
+				scx_cmask_andnot(delta, ccm);
+				if (scx_cmask_empty(delta))
+					continue;
+
+				scx_cmask_or(ccm, delta);
+				scx_cmask_or(changed_cids, delta);
+				granted_caps |= BIT_U64(cap_bit);
+			}
+
+			if (granted_caps)
+				caps_updated_record(cps, changed_cids, granted_caps,
+						    &to_deliver);
 		}
 
 		/* record cids that didn't make it through into @denied_out */
@@ -906,6 +1038,9 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 			}
 		}
 	}
+
+	caps_updated_deliver(&to_deliver);
+
 	return any_denied ? -EPERM : 0;
 }
 
@@ -927,6 +1062,7 @@ __bpf_kfunc void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps,
 {
 	struct scx_cmask_ref ref;
 	struct scx_sched *parent, *child, *pos;
+	LIST_HEAD(to_deliver);
 	s32 si, ret;
 
 	guard(irqsave)();
@@ -957,18 +1093,32 @@ __bpf_kfunc void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps,
 		pos = scx_next_descendant_pre(NULL, child);
 		while (pos) {
 			struct scx_pshard *ps = pos->pshard[si];
+			SCX_CMASK_DEFINE_SHARD(changed_cids, slice->base, slice->nr_cids);
+			SCX_CMASK_DEFINE_SHARD(delta, slice->base, slice->nr_cids);
 			u64 revoked_caps = 0;
 			u32 cap_bit;
 
 			scoped_guard (raw_spinlock_nested, &ps->lock) {
+				/*
+				 * For each cap, clear lost cids and accumulate
+				 * the per-cap diff for notification.
+				 */
 				scx_for_each_cap_bit(cap_bit, caps) {
 					struct scx_cmask *cm = &ps->caps[cap_bit].cmask;
 
-					if (!scx_cmask_intersects(cm, slice))
+					scx_cmask_copy(delta, cm);
+					scx_cmask_and(delta, slice);
+					if (scx_cmask_empty(delta))
 						continue;
-					scx_cmask_andnot(cm, slice);
+
+					scx_cmask_andnot(cm, delta);
+					scx_cmask_or(changed_cids, delta);
 					revoked_caps |= BIT_U64(cap_bit);
 				}
+
+				if (revoked_caps)
+					caps_updated_record(ps, changed_cids, revoked_caps,
+							    &to_deliver);
 			}
 
 			if (revoked_caps)
@@ -977,6 +1127,8 @@ __bpf_kfunc void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps,
 				pos = scx_skip_subtree_pre(pos, child);
 		}
 	}
+
+	caps_updated_deliver(&to_deliver);
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (16 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier " Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 12:05   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
                   ` (13 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Checking a sched's caps on a cid would need to test several cap bits against
caps[] to account for implied caps. Also, caps[] modifications aren't
synchronized against scheduling operations on each cpu, which can lead to
awkward race conditions.

Collect them per cpu instead. caps[] under pshard->lock stays the target
configuration. scx_sched_pcpu->ecaps is added, the transposed effective
copy: the set of cap bits the sched holds on that cpu which can be accessed
with a single read. It is stable under the rq lock. It can also be read
locklessly with READ_ONCE().

Grant and revoke only mutate caps[]. They queue a sync request on the target
cpu's rq->scx.ecaps_to_sync and kick it, and the cpu recomputes the queued
scheds' ecaps from caps[] in balance_one() under its own rq lock. A dying
sched runs the sync directly to retire its queued request before freeing. As
held references can defer the freeing past the enclosing root scheduler's
lifetime, root enable discards leftover sync requests before going live.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      |  10 +++
 kernel/sched/ext/internal.h |  21 +++++-
 kernel/sched/ext/sub.c      | 133 +++++++++++++++++++++++++++++++++++-
 kernel/sched/ext/sub.h      |  16 +++++
 kernel/sched/sched.h        |   3 +
 5 files changed, 180 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4701346765cd..a1b994da9514 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -2600,6 +2600,8 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	rq->scx.flags |= SCX_RQ_IN_BALANCE;
 	rq->scx.flags &= ~SCX_RQ_BAL_KEEP;
 
+	scx_process_sync_ecaps(rq);
+
 	if ((sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT) &&
 	    unlikely(rq->scx.cpu_released)) {
 		/*
@@ -4632,6 +4634,9 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		 */
 		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
 
+		/* flush the queued ecaps syncs */
+		scx_discard_ecaps_to_sync(cpu, pcpu);
+
 		/*
 		 * Bypass blocks new kicks. Flush the kick irq_work so this
 		 * pcpu's to_kick_node is off the list before it is freed.
@@ -6376,6 +6381,9 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 		node = cpu_to_node(cpu);
 		pcpu->sch = sch;
 		INIT_LIST_HEAD(&pcpu->deferred_reenq_local.node);
+#ifdef CONFIG_EXT_SUB_SCHED
+		init_llist_node(&pcpu->ecaps_to_sync_node);
+#endif
 		INIT_LIST_HEAD(&pcpu->to_kick_node);
 		if (!zalloc_cpumask_var_node(&pcpu->cpus_to_kick, GFP_KERNEL, node) ||
 		    !zalloc_cpumask_var_node(&pcpu->cpus_to_kick_if_idle, GFP_KERNEL, node) ||
@@ -6720,6 +6728,8 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 		rq->scx.cpuperf_target = SCX_CPUPERF_ONE;
 	}
 
+	scx_discard_stale_ecaps_syncs();
+
 	/*
 	 * Keep CPUs stable during enable so that the BPF scheduler can track
 	 * online CPUs by watching ->on/offline_cpu() after ->init().
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index fd75005fcc10..ed56ac5e458d 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1182,6 +1182,24 @@ struct scx_sched_pcpu {
 	cpumask_var_t		cpus_to_wait;
 	struct list_head	to_kick_node;
 
+#ifdef CONFIG_EXT_SUB_SCHED
+	/*
+	 * pshard->caps[cap_bit] is the set of cids the sched holds that one
+	 * cap on. ecaps is its transpose: the set of SCX_CAP_* bits the sched
+	 * holds on this cpu, collected so that the hot-path check is a single
+	 * read.
+	 *
+	 * While pshard->caps[] under pshard->lock is the target configuration,
+	 * ecaps is the effective copy owned by the cpu. It is written under the
+	 * rq lock while processing rq->ecaps_to_sync. Can also be read with
+	 * READ_ONCE() outside rq lock.
+	 *
+	 * See queue_sync_ecaps() and scx_process_sync_ecaps().
+	 */
+	u64			ecaps;
+	struct llist_node	ecaps_to_sync_node;
+#endif
+
 	/*
 	 * The event counters are in a per-CPU variable to minimize the
 	 * accounting overhead. A system-wide view on the event counter is
@@ -1291,7 +1309,8 @@ struct scx_pshard {
 
 	/*
 	 * Per-cap cmask, inline via TRAILING_OVERLAP so cmask.bits[] overlaps
-	 * the trailing _bits[] storage. Access as &caps[i].cmask.
+	 * the trailing _bits[] storage. Access as &caps[i].cmask. See
+	 * scx_sched_pcpu->ecaps.
 	 */
 	TRAILING_OVERLAP(struct scx_cmask, cmask, bits,
 			 u64 _bits[SCX_CMASK_NR_WORDS(SCX_CID_SHARD_MAX_CPUS)];
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index c821d604ac9d..08d9367cf218 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -283,6 +283,125 @@ static void scx_sub_seed_caps(struct scx_sched *sch)
 	caps_updated_deliver(&to_deliver);
 }
 
+static u64 calc_effective_caps(struct scx_pshard *ps, s32 cid)
+{
+	u64 ecaps = 0;
+	u32 cap_bit;
+
+	for (cap_bit = 0; cap_bit < __SCX_NR_CAPS; cap_bit++)
+		if (scx_cmask_test(cid, &ps->caps[cap_bit].cmask))
+			ecaps |= BIT_U64(cap_bit) | scx_caps_implied(BIT_U64(cap_bit));
+	return ecaps;
+}
+
+/**
+ * queue_sync_ecaps - Queue ecaps update for a (sch, cid) pair
+ * @sch: sched to update
+ * @cid: cid to update
+ *
+ * Queue an ecaps update for @sch's @cid and kick the cpu so that it syncs in
+ * balance_one().
+ */
+static void queue_sync_ecaps(struct scx_sched *sch, s32 cid)
+{
+	s32 cpu = __scx_cid_to_cpu(cid);
+	struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+
+	/*
+	 * Pairs with smp_mb() in scx_process_sync_ecaps(). Either the check
+	 * below sees the node off the list and queues it, or the in-flight sync
+	 * sees the caps[] update made before this call.
+	 */
+	smp_mb();
+
+	/* @cid's pshard->lock excludes concurrent queueing attempts */
+	if (llist_on_list(&pcpu->ecaps_to_sync_node))
+		return;
+	if (llist_add(&pcpu->ecaps_to_sync_node, &cpu_rq(cpu)->scx.ecaps_to_sync))
+		scx_kick_cpu(scx_root, cpu, 0);
+}
+
+/* discard @rq's queued ecaps syncs */
+static void discard_queued_syncs(struct rq *rq)
+{
+	struct llist_node *pos, *tmp;
+
+	lockdep_assert_rq_held(rq);
+
+	llist_for_each_safe(pos, tmp, llist_del_all(&rq->scx.ecaps_to_sync))
+		init_llist_node(pos);
+}
+
+/**
+ * scx_process_sync_ecaps - Sync this cpu's ecaps to pshard->caps[]
+ * @rq: the cid's cpu rq
+ *
+ * pshard->caps[] is the target configuration. pcpu->ecaps is the effective
+ * transposed copy owned by the cid's cpu and written only here under @rq's
+ * lock.
+ */
+void scx_process_sync_ecaps(struct rq *rq)
+{
+	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
+	s32 shard = scx_cid_to_shard[cid];
+	struct llist_node *batch, *pos, *tmp;
+
+	lockdep_assert_rq_held(rq);
+
+	if (likely(llist_empty(&rq->scx.ecaps_to_sync)))
+		return;
+
+	batch = llist_del_all(&rq->scx.ecaps_to_sync);
+	llist_for_each_safe(pos, tmp, batch) {
+		struct scx_sched_pcpu *pcpu =
+			container_of(pos, struct scx_sched_pcpu, ecaps_to_sync_node);
+		struct scx_pshard *ps = pcpu->sch->pshard[shard];
+
+		init_llist_node(pos);
+
+		/* pairs with smp_mb() in queue_sync_ecaps(), see there */
+		smp_mb();
+
+		WRITE_ONCE(pcpu->ecaps, calc_effective_caps(ps, cid));
+	}
+}
+
+/*
+ * @pcpu's sched was unhashed before the grace period, so nothing new queues.
+ * Flush its pending sync so the pcpu can be freed. scx_process_sync_ecaps()
+ * takes nodes off the list before syncing and acquiring the rq lock waits for
+ * any in-flight walk.
+ */
+void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu)
+{
+	scoped_guard (rq_lock_irqsave, cpu_rq(cpu))
+		scx_process_sync_ecaps(cpu_rq(cpu));
+
+	WARN_ON_ONCE(llist_on_list(&pcpu->ecaps_to_sync_node));
+}
+
+/**
+ * scx_discard_stale_ecaps_syncs - Discard ecaps syncs from earlier schedulers
+ *
+ * To be called during root enable before the scheduler goes live. An earlier
+ * root's sub-sched may not have gone through its RCU free path yet (e.g. a
+ * still-open link fd defers it) and can leave queued ecaps syncs behind.
+ * Processing them would decode the dead sched's pshards with the current cid
+ * layout. Discard them instead. The backing scx_sched_pcpu's are still
+ * allocated as the free path drains ecaps_to_sync_node before freeing.
+ */
+void scx_discard_stale_ecaps_syncs(void)
+{
+	s32 cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct rq *rq = cpu_rq(cpu);
+
+		guard(rq_lock_irqsave)(rq);
+		discard_queued_syncs(rq);
+	}
+}
+
 static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
 
 void drain_descendants(struct scx_sched *sch)
@@ -1021,9 +1140,14 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 				granted_caps |= BIT_U64(cap_bit);
 			}
 
-			if (granted_caps)
+			if (granted_caps) {
+				s32 cid;
+
 				caps_updated_record(cps, changed_cids, granted_caps,
 						    &to_deliver);
+				scx_cmask_for_each_cid(cid, changed_cids)
+					queue_sync_ecaps(child, cid);
+			}
 		}
 
 		/* record cids that didn't make it through into @denied_out */
@@ -1116,9 +1240,14 @@ __bpf_kfunc void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps,
 					revoked_caps |= BIT_U64(cap_bit);
 				}
 
-				if (revoked_caps)
+				if (revoked_caps) {
+					s32 cid;
+
 					caps_updated_record(ps, changed_cids, revoked_caps,
 							    &to_deliver);
+					scx_cmask_for_each_cid(cid, changed_cids)
+						queue_sync_ecaps(pos, cid);
+				}
 			}
 
 			if (revoked_caps)
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 3a913cc56422..85cadb62ad93 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -28,6 +28,9 @@ bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
 void scx_free_pshards(struct scx_sched *sch);
 s32 scx_alloc_pshards(struct scx_sched *sch);
 void scx_init_root_caps(struct scx_sched *sch);
+void scx_process_sync_ecaps(struct rq *rq);
+void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu);
+void scx_discard_stale_ecaps_syncs(void);
 
 #else	/* CONFIG_EXT_SUB_SCHED */
 
@@ -41,6 +44,9 @@ static inline void scx_sub_disable(struct scx_sched *sch) { }
 static inline void scx_free_pshards(struct scx_sched *sch) {}
 static inline s32 scx_alloc_pshards(struct scx_sched *sch) { return 0; }
 static inline void scx_init_root_caps(struct scx_sched *sch) {}
+static inline void scx_process_sync_ecaps(struct rq *rq) {}
+static inline void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu) {}
+static inline void scx_discard_stale_ecaps_syncs(void) {}
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
@@ -57,6 +63,16 @@ static inline void scx_init_root_caps(struct scx_sched *sch) {}
 	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
 	     (pos) = scx_next_descendant_pre((pos), (root)))
 
+#ifdef CONFIG_EXT_SUB_SCHED
+
+/* caps implied by holding @cap */
+static inline u64 scx_caps_implied(u64 cap)
+{
+	return 0;
+}
+
+#endif	/* CONFIG_EXT_SUB_SCHED */
+
 /*
  * One user of this function is scx_bpf_dispatch() which can be called
  * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7da25f918382..e05dcdff3ace 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -805,6 +805,9 @@ struct scx_rq {
 	u32			flags;
 	u32			nr_immed;		/* ENQ_IMMED tasks on local_dsq */
 	u64			clock;			/* current per-rq clock -- see scx_bpf_now() */
+#ifdef CONFIG_EXT_SUB_SCHED
+	struct llist_head	ecaps_to_sync;		/* pending ecaps syncs */
+#endif
 	cpumask_var_t		cpus_to_sync;
 	bool			kick_sync_pending;
 	unsigned long		kick_sync;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (17 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 12:25   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs Tejun Heo
                   ` (12 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A sub-scheduler that gains or loses effective caps on a cpu may want to act
on it right away - e.g. place or preempt on a newly usable cpu. The existing
ops.sub_caps_updated() doesn't fit as it is delivered asynchronously to
scheduling operations and can arrive before the per-cpu effective caps go
live.

Add ops.sub_ecaps_updated(cid, before, after), a cid-form callback fired
from scx_process_sync_ecaps() when a sub-sched's effective caps on a cid
change. It runs in dispatch context so the sched can insert, kick or preempt
on the cid directly. @before is the caps as of the last delivery.

Cpu hotplug rides the same machinery. Going down zeroes each sched's ecaps
on the cpu's cid, with queued syncs discarded at consumption while the cpu
is inactive. Coming back up queues a sync for every sched. reported_ecaps is
kept across the down/up cycle, so the resync fires the callback only if
ownership actually changed while the cpu was down.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      |  13 +++-
 kernel/sched/ext/internal.h |  19 +++++-
 kernel/sched/ext/sub.c      | 118 +++++++++++++++++++++++++++++++++---
 kernel/sched/ext/sub.h      |   8 ++-
 4 files changed, 144 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index a1b994da9514..4f0d72658fd8 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -2600,7 +2600,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	rq->scx.flags |= SCX_RQ_IN_BALANCE;
 	rq->scx.flags &= ~SCX_RQ_BAL_KEEP;
 
-	scx_process_sync_ecaps(rq);
+	scx_process_sync_ecaps(rq, prev);
 
 	if ((sch->ops.flags & SCX_OPS_HAS_CPU_PREEMPT) &&
 	    unlikely(rq->scx.cpu_released)) {
@@ -3125,6 +3125,11 @@ static void handle_hotplug(struct rq *rq, bool online)
 	if (scx_enabled())
 		scx_idle_update_selcpu_topology(&sch->ops);
 
+	if (online)
+		scx_online_ecaps(rq);
+	else
+		scx_offline_ecaps(rq);
+
 	if (online && SCX_HAS_OP(sch, cpu_online))
 		SCX_CALL_OP(sch, cpu_online, NULL, scx_cpu_arg(cpu));
 	else if (!online && SCX_HAS_OP(sch, cpu_offline))
@@ -4634,7 +4639,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		 */
 		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
 
-		/* flush the queued ecaps syncs */
+		/* retire the queued ecaps syncs so the pcpu can be freed */
 		scx_discard_ecaps_to_sync(cpu, pcpu);
 
 		/*
@@ -7486,6 +7491,7 @@ static struct bpf_struct_ops bpf_sched_ext_ops = {
 static void sched_ext_ops_cid__set_cmask(struct task_struct *p,
 					 const struct scx_cmask *cmask) {}
 static void sched_ext_ops__sub_caps_updated(const struct scx_cmask *cmask, u64 caps) {}
+static void sched_ext_ops__sub_ecaps_updated(s32 cid, u64 before, u64 after) {}
 
 static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
 	.select_cid		= sched_ext_ops__select_cpu,
@@ -7519,6 +7525,7 @@ static struct sched_ext_ops_cid __bpf_ops_sched_ext_ops_cid = {
 	.sub_attach		= sched_ext_ops__sub_attach,
 	.sub_detach		= sched_ext_ops__sub_detach,
 	.sub_caps_updated	= sched_ext_ops__sub_caps_updated,
+	.sub_ecaps_updated	= sched_ext_ops__sub_ecaps_updated,
 	.cid_online		= sched_ext_ops__cpu_online,
 	.cid_offline		= sched_ext_ops__cpu_offline,
 	.init_cids		= sched_ext_ops__init_cids,
@@ -9826,6 +9833,7 @@ static const u32 scx_kf_allow_flags[] = {
 #endif	/* CONFIG_EXT_GROUP_SCHED */
 	[SCX_OP_IDX(sub_attach)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(sub_detach)]	= SCX_KF_ALLOW_UNLOCKED,
+	[SCX_OP_IDX(sub_ecaps_updated)]	= SCX_KF_ALLOW_ENQUEUE | SCX_KF_ALLOW_DISPATCH,
 	[SCX_OP_IDX(cpu_online)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(cpu_offline)]	= SCX_KF_ALLOW_UNLOCKED,
 	[SCX_OP_IDX(init_cids)]		= SCX_KF_ALLOW_UNLOCKED | SCX_KF_ALLOW_INIT_CIDS,
@@ -9965,6 +9973,7 @@ static int __init scx_init(void)
 	CID_OFFSET_MATCH(sub_attach, sub_attach);
 	CID_OFFSET_MATCH(sub_detach, sub_detach);
 	CID_OFFSET_MATCH(sub_caps_updated, sub_caps_updated);
+	CID_OFFSET_MATCH(sub_ecaps_updated, sub_ecaps_updated);
 	CID_OFFSET_MATCH(init_cids, init_cids);
 	CID_OFFSET_MATCH(init, init);
 	CID_OFFSET_MATCH(exit, exit);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index ed56ac5e458d..3b4ba9300a22 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -770,12 +770,26 @@ struct sched_ext_ops {
 	 * Delivered asynchronously after the change is recorded, and may run
 	 * before it takes effect on any given cpu. Use it to track which caps
 	 * the sub-sched holds and propagate to its own children, not to decide
-	 * if a task can run on a cpu now.
+	 * if a task can run on a cpu now. sub_ecaps_updated() reports that per
+	 * cpu, once it is in effect.
 	 *
 	 * May call scx_bpf_sub_grant() / scx_bpf_sub_revoke() on children.
 	 */
 	void (*sub_caps_updated)(const struct scx_cmask *cmask, u64 caps);
 
+	/**
+	 * @sub_ecaps_updated: This sub-sched's effective caps on a cid changed
+	 * @cid: the cid whose effective caps changed
+	 * @before: effective caps as of the last delivery
+	 * @after: effective caps now
+	 *
+	 * Invoked when this sub-sched's effective caps on @cid change, once the
+	 * change is in effect on the cpu. Runs in dispatch context with rq lock
+	 * held, and can perform all operations allowed in ops.dispatch()
+	 * including inserting/moving tasks.
+	 */
+	void (*sub_ecaps_updated)(s32 cid, u64 before, u64 after);
+
 	/*
 	 * All online ops must come before ops.cpu_online().
 	 */
@@ -997,6 +1011,7 @@ struct sched_ext_ops_cid {
 	s32 (*sub_attach)(struct scx_sub_attach_args *args);
 	void (*sub_detach)(struct scx_sub_detach_args *args);
 	void (*sub_caps_updated)(const struct scx_cmask *cmask, u64 caps);
+	void (*sub_ecaps_updated)(s32 cid, u64 before, u64 after);
 	void (*cid_online)(s32 cid);
 	void (*cid_offline)(s32 cid);
 	s32 (*init_cids)(void);
@@ -1198,6 +1213,8 @@ struct scx_sched_pcpu {
 	 */
 	u64			ecaps;
 	struct llist_node	ecaps_to_sync_node;
+	/* effective caps as of the last sub_ecaps_updated() delivery */
+	u64			reported_ecaps;
 #endif
 
 	/*
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 08d9367cf218..55437f1d1965 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -13,6 +13,7 @@
  * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
  * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
  */
+#include <linux/delay.h>
 #include <linux/rhashtable.h>
 #include "internal.h"
 #include "cid.h"
@@ -335,14 +336,16 @@ static void discard_queued_syncs(struct rq *rq)
 /**
  * scx_process_sync_ecaps - Sync this cpu's ecaps to pshard->caps[]
  * @rq: the cid's cpu rq
+ * @prev: @rq's previous task from the in-progress balance
  *
  * pshard->caps[] is the target configuration. pcpu->ecaps is the effective
  * transposed copy owned by the cid's cpu and written only here under @rq's
  * lock.
  */
-void scx_process_sync_ecaps(struct rq *rq)
+void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 {
-	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
+	s32 cpu = cpu_of(rq);
+	s32 cid = __scx_cpu_to_cid(cpu);
 	s32 shard = scx_cid_to_shard[cid];
 	struct llist_node *batch, *pos, *tmp;
 
@@ -351,33 +354,130 @@ void scx_process_sync_ecaps(struct rq *rq)
 	if (likely(llist_empty(&rq->scx.ecaps_to_sync)))
 		return;
 
+	/*
+	 * ecaps are zeroed while the cpu is inactive and must stay zero.
+	 * Discard queued syncs instead of processing them - the
+	 * scx_online_ecaps() reseed re-syncs every sched on activation.
+	 * cpu_active() clears before the offline zeroing and sets before the
+	 * reseed is queued, so this test can neither miss a racing sync nor
+	 * eat the reseed.
+	 */
+	if (unlikely(!cpu_active(cpu))) {
+		discard_queued_syncs(rq);
+		return;
+	}
+
 	batch = llist_del_all(&rq->scx.ecaps_to_sync);
 	llist_for_each_safe(pos, tmp, batch) {
 		struct scx_sched_pcpu *pcpu =
 			container_of(pos, struct scx_sched_pcpu, ecaps_to_sync_node);
 		struct scx_pshard *ps = pcpu->sch->pshard[shard];
+		u64 ecaps;
 
 		init_llist_node(pos);
 
 		/* pairs with smp_mb() in queue_sync_ecaps(), see there */
 		smp_mb();
 
-		WRITE_ONCE(pcpu->ecaps, calc_effective_caps(ps, cid));
+		ecaps = calc_effective_caps(ps, cid);
+		WRITE_ONCE(pcpu->ecaps, ecaps);
+
+		/* tell the sched its effective caps on this cid changed */
+		if (ecaps != pcpu->reported_ecaps &&
+		    SCX_HAS_OP(pcpu->sch, sub_ecaps_updated) &&
+		    !scx_bypassing(pcpu->sch, cpu)) {
+			struct scx_dsp_ctx *dspc = &pcpu->dsp_ctx;
+
+			dspc->rq = rq;
+			/* stash @prev so nested dispatches can access it */
+			rq->scx.sub_dispatch_prev = prev;
+			SCX_CALL_OP(pcpu->sch, sub_ecaps_updated, rq, scx_cpu_arg(cpu),
+				    pcpu->reported_ecaps, ecaps);
+			rq->scx.sub_dispatch_prev = NULL;
+			scx_flush_dispatch_buf(pcpu->sch, rq);
+			pcpu->reported_ecaps = ecaps;
+		}
+	}
+}
+
+/*
+ * A cpu came back. Re-seed each sub-sched's ecaps on the cpu's cid. The sync
+ * recomputes effective caps from the pshard and fires ops.sub_ecaps_updated()
+ * only on a real change since offline.
+ */
+void scx_online_ecaps(struct rq *rq)
+{
+	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
+	s32 shard = scx_cid_to_shard[cid];
+	struct scx_sched *pos;
+
+	guard(rq_lock_irqsave)(rq);
+
+	scx_for_each_descendant_pre(pos, scx_root) {
+		struct scx_pshard *ps;
+
+		/* root holds every cap and never uses ecaps */
+		if (pos == scx_root)
+			continue;
+
+		ps = pos->pshard[shard];
+		guard(raw_spinlock)(&ps->lock);
+		queue_sync_ecaps(pos, cid);
+	}
+}
+
+/*
+ * A cpu is going down. Zero each sub-sched's in-effect ecaps so cap checks
+ * treat the cpu as capless while offline. Pending and late-queued syncs are
+ * discarded at consumption by scx_process_sync_ecaps() while the cpu is
+ * inactive. Leave reported_ecaps. Ownership is unchanged, so the
+ * scx_online_ecaps() reseed reports only a genuine delta. No callback fires
+ * here.
+ */
+void scx_offline_ecaps(struct rq *rq)
+{
+	s32 cpu = cpu_of(rq);
+	struct scx_sched *pos;
+
+	guard(rq_lock_irqsave)(rq);
+
+	scx_for_each_descendant_pre(pos, scx_root) {
+		/* root holds every cap and never uses ecaps */
+		if (pos == scx_root)
+			continue;
+
+		WRITE_ONCE(per_cpu_ptr(pos->pcpu, cpu)->ecaps, 0);
 	}
 }
 
 /*
  * @pcpu's sched was unhashed before the grace period, so nothing new queues.
- * Flush its pending sync so the pcpu can be freed. scx_process_sync_ecaps()
- * takes nodes off the list before syncing and acquiring the rq lock waits for
- * any in-flight walk.
+ * Flush its pending sync so the pcpu can be freed. If the cpu is online and
+ * scx is enabled, drain via balance_one(). Otherwise, discard under the rq
+ * lock.
  */
 void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu)
 {
-	scoped_guard (rq_lock_irqsave, cpu_rq(cpu))
-		scx_process_sync_ecaps(cpu_rq(cpu));
+	struct rq *rq = cpu_rq(cpu);
 
-	WARN_ON_ONCE(llist_on_list(&pcpu->ecaps_to_sync_node));
+	while (true) {
+		scoped_guard (rq_lock_irqsave, rq) {
+			/*
+			 * scx_process_sync_ecaps() takes the node off the list
+			 * before it is done accessing @pcpu but does all of it
+			 * under the rq lock. Off-list observed under the rq
+			 * lock guarantees that the sync is complete.
+			 */
+			if (!llist_on_list(&pcpu->ecaps_to_sync_node))
+				return;
+			if (!scx_enabled() || !scx_rq_online(rq)) {
+				discard_queued_syncs(rq);
+				return;
+			}
+		}
+		resched_cpu(cpu);
+		msleep(1);
+	}
 }
 
 /**
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 85cadb62ad93..1f0cef59302c 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -28,7 +28,9 @@ bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
 void scx_free_pshards(struct scx_sched *sch);
 s32 scx_alloc_pshards(struct scx_sched *sch);
 void scx_init_root_caps(struct scx_sched *sch);
-void scx_process_sync_ecaps(struct rq *rq);
+void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev);
+void scx_online_ecaps(struct rq *rq);
+void scx_offline_ecaps(struct rq *rq);
 void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu);
 void scx_discard_stale_ecaps_syncs(void);
 
@@ -44,7 +46,9 @@ static inline void scx_sub_disable(struct scx_sched *sch) { }
 static inline void scx_free_pshards(struct scx_sched *sch) {}
 static inline s32 scx_alloc_pshards(struct scx_sched *sch) { return 0; }
 static inline void scx_init_root_caps(struct scx_sched *sch) {}
-static inline void scx_process_sync_ecaps(struct rq *rq) {}
+static inline void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev) {}
+static inline void scx_online_ecaps(struct rq *rq) {}
+static inline void scx_offline_ecaps(struct rq *rq) {}
 static inline void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu) {}
 static inline void scx_discard_stale_ecaps_syncs(void) {}
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (18 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

The local DSQ is synchronized by the containing rq lock rather than its own
dsq->lock. A later patch adds a second such DSQ. In preparation, factor the
"rq owns the lock" test into dsq_is_rq_owned() and rename
local_dsq_post_enq() to rq_owned_post_enq(), taking @rq explicitly.

No behavior change.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4f0d72658fd8..7d8846cea425 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -96,6 +96,17 @@ static atomic_long_t scx_hotplug_seq = ATOMIC_LONG_INIT(0);
 /* Global cursor for the per-CPU tid allocator. Starts at 1; tid 0 is reserved. */
 static atomic64_t scx_tid_cursor = ATOMIC64_INIT(1);
 
+/* is @dsq synchronized by the containing rq lock instead of dsq->lock? */
+static bool dsq_is_rq_owned(struct scx_dispatch_q *dsq)
+{
+	switch (dsq->id) {
+	case SCX_DSQ_LOCAL:
+		return true;
+	default:
+		return false;
+	}
+}
+
 #ifdef CONFIG_EXT_SUB_SCHED
 /*
  * The sub sched being enabled. Used by scx_disable_and_exit_task() to exit
@@ -1263,11 +1274,10 @@ static void call_task_dequeue(struct scx_sched *sch, struct rq *rq,
 	p->scx.flags &= ~SCX_TASK_IN_CUSTODY;
 }
 
-static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
-			       struct task_struct *p, u64 enq_flags)
+static void rq_owned_post_enq(struct scx_sched *sch, struct rq *rq,
+			      struct scx_dispatch_q *dsq, struct task_struct *p,
+			      u64 enq_flags)
 {
-	struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
-
 	call_task_dequeue(sch, rq, p, 0);
 
 	/*
@@ -1327,13 +1337,13 @@ static void scx_dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
 				 struct scx_dispatch_q *dsq, struct task_struct *p,
 				 u64 enq_flags)
 {
-	bool is_local = dsq->id == SCX_DSQ_LOCAL;
+	bool is_rq_owned = dsq_is_rq_owned(dsq);
 
 	WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
 	WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) ||
 		     !RB_EMPTY_NODE(&p->scx.dsq_priq));
 
-	if (!is_local) {
+	if (!is_rq_owned) {
 		raw_spin_lock_nested(&dsq->lock,
 			(enq_flags & SCX_ENQ_NESTED) ? SINGLE_DEPTH_NESTING : 0);
 
@@ -1428,8 +1438,8 @@ static void scx_dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
 	 * ops_state first, both sides would modify p->scx.flags
 	 * concurrently in a non-atomic way.
 	 */
-	if (is_local) {
-		local_dsq_post_enq(sch, dsq, p, enq_flags);
+	if (is_rq_owned) {
+		rq_owned_post_enq(sch, rq, dsq, p, enq_flags);
 	} else {
 		/*
 		 * Task on global/bypass DSQ: leave custody, task on
@@ -1476,7 +1486,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
 static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
 {
 	struct scx_dispatch_q *dsq = p->scx.dsq;
-	bool is_local = dsq == &rq->scx.local_dsq;
+	bool is_rq_owned = dsq && dsq_is_rq_owned(dsq);
 
 	lockdep_assert_rq_held(rq);
 
@@ -1500,7 +1510,7 @@ static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
-	if (!is_local)
+	if (!is_rq_owned)
 		raw_spin_lock(&dsq->lock);
 
 	/*
@@ -1522,7 +1532,7 @@ static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
 	}
 	p->scx.dsq = NULL;
 
-	if (!is_local)
+	if (!is_rq_owned)
 		raw_spin_unlock(&dsq->lock);
 }
 
@@ -2074,7 +2084,7 @@ static void move_local_task_to_local_dsq(struct scx_sched *sch,
 	dsq_inc_nr(dst_dsq, p, enq_flags);
 	p->scx.dsq = dst_dsq;
 
-	local_dsq_post_enq(sch, dst_dsq, p, enq_flags);
+	rq_owned_post_enq(sch, dst_rq, dst_dsq, p, enq_flags);
 }
 
 /**
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (19 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 12:57   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap Tejun Heo
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

When a sub-scheduler dispatches a task to a CPU it lacks the required
capability on, the task must be rejected rather than allowed to run.

Add the machinery for that. Each rq gets a reject DSQ, a kernel-internal
holding queue that is never run and that the BPF scheduler cannot reach. An
insert that must be refused is diverted there instead of the local DSQ, and
a deferred requeue then hands the parked tasks back to the BPF scheduler to
re-decide. A cap revoke extends this to already-queued tasks. When the
revoke reaches the cpu's effective caps, the cpu scans its local DSQ and
reenqueues the tasks that no longer qualify.

A migration-disabled task must run on its cpu, so a capless one is admitted
anyway and counted in the new SCX_EV_SUB_FORCED_ADMIT event.

This is preparation for the actual sub-sched cap enforcement. The divert is
wired but inert here.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h   |  15 ++++-
 kernel/sched/ext/ext.c      |  42 ++++++++++--
 kernel/sched/ext/internal.h |  19 +++++-
 kernel/sched/ext/sub.c      | 125 +++++++++++++++++++++++++++++++++++-
 kernel/sched/ext/sub.h      |  49 ++++++++++++++
 kernel/sched/sched.h        |   3 +
 6 files changed, 242 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 75cb8b119fb7..7e3f6b33f4a8 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -58,6 +58,7 @@ enum scx_dsq_id_flags {
 	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
 	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
 	SCX_DSQ_BYPASS		= SCX_DSQ_FLAG_BUILTIN | 3,
+	SCX_DSQ_REJECT		= SCX_DSQ_FLAG_BUILTIN | 4,	/* internal - see find_dsq_for_dispatch() */
 	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
 	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
 };
@@ -124,7 +125,7 @@ enum scx_ent_flags {
 	SCX_TASK_DEAD		= 5 << SCX_TASK_STATE_SHIFT,
 
 	/*
-	 * Bits 12 and 13 are used to carry reenqueue reason. In addition to
+	 * Bits 12 to 14 are used to carry reenqueue reason. In addition to
 	 * %SCX_ENQ_REENQ flag, ops.enqueue() can also test for
 	 * %SCX_TASK_REENQ_REASON_NONE to distinguish reenqueues.
 	 *
@@ -132,15 +133,17 @@ enum scx_ent_flags {
 	 * KFUNC	reenqueued by scx_bpf_dsq_reenq() and friends
 	 * IMMED	reenqueued due to failed ENQ_IMMED
 	 * PREEMPTED	preempted while running
+	 * CAP		sub-sched cap miss, see p->scx.reenq_reason_*
 	 */
 	SCX_TASK_REENQ_REASON_SHIFT = 12,
-	SCX_TASK_REENQ_REASON_BITS = 2,
+	SCX_TASK_REENQ_REASON_BITS = 3,
 	SCX_TASK_REENQ_REASON_MASK = ((1 << SCX_TASK_REENQ_REASON_BITS) - 1) << SCX_TASK_REENQ_REASON_SHIFT,
 
 	SCX_TASK_REENQ_NONE	= 0 << SCX_TASK_REENQ_REASON_SHIFT,
 	SCX_TASK_REENQ_KFUNC	= 1 << SCX_TASK_REENQ_REASON_SHIFT,
 	SCX_TASK_REENQ_IMMED	= 2 << SCX_TASK_REENQ_REASON_SHIFT,
 	SCX_TASK_REENQ_PREEMPTED = 3 << SCX_TASK_REENQ_REASON_SHIFT,
+	SCX_TASK_REENQ_CAP	= 4 << SCX_TASK_REENQ_REASON_SHIFT,
 
 	/* iteration cursor, not a task */
 	SCX_TASK_CURSOR		= 1 << 31,
@@ -239,6 +242,14 @@ struct sched_ext_entity {
 	 */
 	u64			dsq_vtime;
 
+	/*
+	 * Sub-sched cap rejected reenq context, valid only while
+	 * %SCX_TASK_REENQ_CAP is set. @reenq_reason_caps is the SCX_CAP_* bits
+	 * that were needed but missing. @reenq_reason_cid is the target cid.
+	 */
+	u64			reenq_reason_caps;
+	s32			reenq_reason_cid;
+
 	/*
 	 * If set, reject future sched_setscheduler(2) calls updating the policy
 	 * to %SCHED_EXT with -%EACCES.
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 7d8846cea425..b6d68a80a04f 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -101,6 +101,7 @@ static bool dsq_is_rq_owned(struct scx_dispatch_q *dsq)
 {
 	switch (dsq->id) {
 	case SCX_DSQ_LOCAL:
+	case SCX_DSQ_REJECT:
 		return true;
 	default:
 		return false;
@@ -1280,6 +1281,12 @@ static void rq_owned_post_enq(struct scx_sched *sch, struct rq *rq,
 {
 	call_task_dequeue(sch, rq, p, 0);
 
+	/* rejected: kick the deferred reenq, skip wakeup/preemption */
+	if (unlikely(dsq->id == SCX_DSQ_REJECT)) {
+		schedule_deferred_locked(rq);
+		return;
+	}
+
 	/*
 	 * Note that @rq's lock may be dropped between this enqueue and @p
 	 * actually getting on CPU. This gives higher-class tasks (e.g. RT)
@@ -1337,7 +1344,12 @@ static void scx_dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
 				 struct scx_dispatch_q *dsq, struct task_struct *p,
 				 u64 enq_flags)
 {
-	bool is_rq_owned = dsq_is_rq_owned(dsq);
+	bool is_rq_owned = false;
+
+	if (dsq->id == SCX_DSQ_LOCAL) {
+		dsq = scx_local_or_reject_dsq(sch, rq, p, &enq_flags);
+		is_rq_owned = true;
+	}
 
 	WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_list.node));
 	WARN_ON_ONCE((p->scx.dsq_flags & SCX_TASK_DSQ_ON_PRIQ) ||
@@ -1483,7 +1495,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
 	}
 }
 
-static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
+void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
 {
 	struct scx_dispatch_q *dsq = p->scx.dsq;
 	bool is_rq_owned = dsq && dsq_is_rq_owned(dsq);
@@ -1573,6 +1585,10 @@ static struct scx_dispatch_q *find_dsq_for_dispatch(struct scx_sched *sch,
 	else
 		dsq = find_user_dsq(sch, dsq_id);
 
+	/*
+	 * Built-in DSQs are never inserted into dsq_hash, so REJECT hits the
+	 * error below. It cannot be reached with an ID.
+	 */
 	if (unlikely(!dsq)) {
 		scx_error(sch, "non-existent DSQ 0x%llx", dsq_id);
 		return find_global_dsq(sch, tcpu);
@@ -1698,8 +1714,8 @@ bool scx_rq_online(struct rq *rq)
 	return likely((rq->scx.flags & SCX_RQ_ONLINE) && cpu_active(cpu_of(rq)));
 }
 
-static void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
-				int sticky_cpu)
+void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+			 int sticky_cpu)
 {
 	struct scx_sched *sch = scx_task_sched(p);
 	struct task_struct **ddsp_taskp;
@@ -2068,7 +2084,7 @@ static void move_local_task_to_local_dsq(struct scx_sched *sch,
 					 struct scx_dispatch_q *src_dsq,
 					 struct rq *dst_rq)
 {
-	struct scx_dispatch_q *dst_dsq = &dst_rq->scx.local_dsq;
+	struct scx_dispatch_q *dst_dsq = scx_local_or_reject_dsq(sch, dst_rq, p, &enq_flags);
 
 	/* @dsq is locked and @p is on @dst_rq */
 	lockdep_assert_held(&src_dsq->lock);
@@ -3786,7 +3802,8 @@ static void process_ddsp_deferred_locals(struct rq *rq)
  * another reenq cycle. Repetitions are bounded by %SCX_REENQ_LOCAL_MAX_REPEAT
  * in process_deferred_reenq_locals().
  */
-static bool local_task_should_reenq(struct task_struct *p, u64 *reenq_flags, u32 *reason)
+static bool local_task_should_reenq(struct rq *rq, struct task_struct *p,
+				    u64 *reenq_flags, u32 *reason)
 {
 	bool first;
 
@@ -3802,6 +3819,12 @@ static bool local_task_should_reenq(struct task_struct *p, u64 *reenq_flags, u32
 		return true;
 	}
 
+	if ((*reenq_flags & SCX_REENQ_CAP_REVOKE) &&
+	    scx_task_reenq_on_cap_revoke(rq, p)) {
+		*reason = SCX_TASK_REENQ_CAP;
+		return true;
+	}
+
 	return *reenq_flags & SCX_REENQ_ANY;
 }
 
@@ -3845,7 +3868,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 		if (!scx_is_descendant(task_sch, sch))
 			continue;
 
-		if (!local_task_should_reenq(p, &reenq_flags, &reason))
+		if (!local_task_should_reenq(rq, p, &reenq_flags, &reason))
 			continue;
 
 		scx_dispatch_dequeue(rq, p);
@@ -4041,6 +4064,8 @@ static void run_deferred(struct rq *rq)
 
 	if (!list_empty(&rq->scx.deferred_reenq_users))
 		process_deferred_reenq_users(rq);
+
+	scx_reenq_reject(rq);
 }
 
 #ifdef CONFIG_NO_HZ_FULL
@@ -7839,6 +7864,9 @@ void __init init_sched_ext_class(void)
 
 		/* local_dsq's sch will be set during scx_root_enable() */
 		BUG_ON(init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL, NULL));
+#ifdef CONFIG_EXT_SUB_SCHED
+		BUG_ON(init_dsq(&rq->scx.reject_dsq, SCX_DSQ_REJECT, NULL));
+#endif
 
 		INIT_LIST_HEAD(&rq->scx.runnable_list);
 		INIT_LIST_HEAD(&rq->scx.ddsp_deferred_locals);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 3b4ba9300a22..ef6b4d0f7dee 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1135,6 +1135,13 @@ struct scx_event_stats {
 	 * from sub_bypass_dsq's.
 	 */
 	s64		SCX_EV_SUB_BYPASS_DISPATCH;
+
+	/*
+	 * The number of times a migration-disabled task lacking the cap for its
+	 * cid was allowed onto the local DSQ. It must run on its pinned CPU, so
+	 * it can't be rejected. The violation is counted here.
+	 */
+	s64		SCX_EV_SUB_FORCED_ADMIT;
 };
 
 #define SCX_EVENTS_LIST(SCX_EVENT)					\
@@ -1150,7 +1157,8 @@ struct scx_event_stats {
 	SCX_EVENT(SCX_EV_BYPASS_DISPATCH);				\
 	SCX_EVENT(SCX_EV_BYPASS_ACTIVATE);				\
 	SCX_EVENT(SCX_EV_INSERT_NOT_OWNED);				\
-	SCX_EVENT(SCX_EV_SUB_BYPASS_DISPATCH)
+	SCX_EVENT(SCX_EV_SUB_BYPASS_DISPATCH);				\
+	SCX_EVENT(SCX_EV_SUB_FORCED_ADMIT)
 
 struct scx_sched;
 
@@ -1270,6 +1278,9 @@ enum scx_cap_flags {
 	__SCX_CAP_ALL			= BIT_U64(__SCX_NR_CAPS) - 1,
 
 	SCX_CAP_DUMMY			= BIT_U64(__SCX_CAP_DUMMY),
+
+	/* caps whose loss strands queued tasks, see scx_process_sync_ecaps() */
+	SCX_CAPS_REENQ_ON_LOSS		= 0,
 };
 
 #ifdef CONFIG_EXT_SUB_SCHED
@@ -1581,6 +1592,9 @@ enum scx_reenq_flags {
 	/* low 16bits determine which tasks should be reenqueued */
 	SCX_REENQ_ANY		= 1LLU << 0,	/* all tasks */
 
+	/* internal: kernel-issued on cap revoke, not accepted from BPF */
+	SCX_REENQ_CAP_REVOKE	= 1LLU << 1,
+
 	__SCX_REENQ_FILTER_MASK	= 0xffffLLU,
 
 	__SCX_REENQ_USER_MASK	= SCX_REENQ_ANY,
@@ -1833,6 +1847,9 @@ void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp);
 void scx_task_iter_unlock(struct scx_task_iter *iter);
 void scx_task_iter_stop(struct scx_task_iter *iter);
 struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter);
+void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p);
+void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+			 int sticky_cpu);
 bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 			    struct scx_dispatch_q *dsq, u64 enq_flags);
 bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq);
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 55437f1d1965..aea63484edc5 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -204,6 +204,116 @@ void scx_init_root_caps(struct scx_sched *sch)
 	}
 }
 
+/**
+ * scx_local_or_reject_dsq - Pick the local or reject DSQ for an insert
+ * @sch: enqueuing sub-sched
+ * @rq: rq whose local DSQ @p targets
+ * @p: task being inserted
+ * @enq_flags: in/out; %SCX_ENQ_IMMED is cleared when diverting to reject
+ *
+ * Return @rq's local DSQ if @sch holds the required caps on @rq's cid,
+ * otherwise @rq's reject DSQ after recording the reenq reason on @p.
+ *
+ * Bypass doesn't need special-casing as a bypassing sched's tasks are enqueued
+ * to and run by its nearest non-bypassing ancestor. If root is bypassing, it
+ * always holds all caps.
+ */
+struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq *rq,
+					       struct task_struct *p, u64 *enq_flags)
+{
+	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
+	u64 missing = scx_missing_caps(sch, cpu_of(rq), scx_caps_for_enq(*enq_flags));
+
+	/* requirements met */
+	if (likely(!missing))
+		return &rq->scx.local_dsq;
+
+	/*
+	 * A migration-disabled task must run on this CPU. Let it run and count
+	 * the violation.
+	 */
+	if (unlikely(is_migration_disabled(p))) {
+		__scx_add_event(sch, SCX_EV_SUB_FORCED_ADMIT, 1);
+		return &rq->scx.local_dsq;
+	}
+
+	p->scx.reenq_reason_caps = missing;
+	p->scx.reenq_reason_cid = cid;
+
+	/*
+	 * Only local DSQ can honor IMMED and dsq_inc_nr() WARNs on IMMED into
+	 * others. Strip both the enq flag and the sticky task flag - the
+	 * latter can carry in from an earlier admitted IMMED insert.
+	 */
+	*enq_flags &= ~SCX_ENQ_IMMED;
+	p->scx.flags &= ~SCX_TASK_IMMED;
+
+	return &rq->scx.reject_dsq;
+}
+
+/* @p lost the caps needed to stay on @rq's local DSQ? Record reason if so. */
+bool scx_task_reenq_on_cap_revoke(struct rq *rq, struct task_struct *p)
+{
+	u64 missing;
+
+	/* migration-disabled tasks are admitted regardless of caps */
+	if (is_migration_disabled(p))
+		return false;
+
+	missing = scx_missing_caps(scx_task_sched(p), cpu_of(rq), scx_caps_for_task(p));
+	if (likely(!missing))
+		return false;
+
+	p->scx.reenq_reason_caps = missing;
+	p->scx.reenq_reason_cid = __scx_cpu_to_cid(cpu_of(rq));
+	return true;
+}
+
+/*
+ * Drain @rq->scx.reject_dsq, reenqueueing each task so the BPF re-decides
+ * from p->scx.reenq_reason_*.
+ *
+ * A task can be re-rejected repeatedly, and there's no repeat limit here.
+ * Rejection can't happen for root, and sub-scheds can be safely ejected after
+ * triggering the stall watchdog.
+ */
+void scx_reenq_reject(struct rq *rq)
+{
+	LIST_HEAD(tasks);
+	struct task_struct *p, *n;
+
+	lockdep_assert_rq_held(rq);
+
+	if (list_empty(&rq->scx.reject_dsq.list))
+		return;
+
+	/*
+	 * Move to a private list so a task re-rejected by the
+	 * scx_do_enqueue_task() below isn't revisited this round.
+	 */
+	list_for_each_entry_safe(p, n, &rq->scx.reject_dsq.list, scx.dsq_list.node) {
+		/* migration_pending tasks should have bypassed to local DSQ */
+		if (WARN_ON_ONCE(p->migration_pending))
+			continue;
+
+		scx_dispatch_dequeue(rq, p);
+
+		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
+			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
+		p->scx.flags |= SCX_TASK_REENQ_CAP;
+
+		list_add_tail(&p->scx.dsq_list.node, &tasks);
+	}
+
+	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
+		list_del_init(&p->scx.dsq_list.node);
+
+		scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+
+		p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
+	}
+}
+
 /* record a caps change, see struct scx_caps_updated */
 static void caps_updated_record(struct scx_pshard *ps, const struct scx_cmask *cids, u64 caps,
 				struct list_head *to_deliver)
@@ -348,6 +458,7 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 	s32 cid = __scx_cpu_to_cid(cpu);
 	s32 shard = scx_cid_to_shard[cid];
 	struct llist_node *batch, *pos, *tmp;
+	u64 lost_all = 0;
 
 	lockdep_assert_rq_held(rq);
 
@@ -372,16 +483,20 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 		struct scx_sched_pcpu *pcpu =
 			container_of(pos, struct scx_sched_pcpu, ecaps_to_sync_node);
 		struct scx_pshard *ps = pcpu->sch->pshard[shard];
-		u64 ecaps;
+		u64 old, ecaps, lost;
 
 		init_llist_node(pos);
 
 		/* pairs with smp_mb() in queue_sync_ecaps(), see there */
 		smp_mb();
 
+		old = READ_ONCE(pcpu->ecaps);
 		ecaps = calc_effective_caps(ps, cid);
 		WRITE_ONCE(pcpu->ecaps, ecaps);
 
+		lost = old & ~ecaps;
+		lost_all |= lost;
+
 		/* tell the sched its effective caps on this cid changed */
 		if (ecaps != pcpu->reported_ecaps &&
 		    SCX_HAS_OP(pcpu->sch, sub_ecaps_updated) &&
@@ -398,6 +513,14 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 			pcpu->reported_ecaps = ecaps;
 		}
 	}
+
+	/*
+	 * Losing a cap can strand already-queued tasks. Schedule a reenq scan
+	 * to move the now-capless ones off the local DSQ. The scan tests
+	 * against the effective caps and thus must come after the ecaps sync.
+	 */
+	if (lost_all & SCX_CAPS_REENQ_ON_LOSS)
+		scx_schedule_reenq_local(rq, SCX_REENQ_CAP_REVOKE);
 }
 
 /*
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 1f0cef59302c..89d1458ff450 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -33,6 +33,10 @@ void scx_online_ecaps(struct rq *rq);
 void scx_offline_ecaps(struct rq *rq);
 void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu);
 void scx_discard_stale_ecaps_syncs(void);
+struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq *rq,
+					       struct task_struct *p, u64 *enq_flags);
+bool scx_task_reenq_on_cap_revoke(struct rq *rq, struct task_struct *p);
+void scx_reenq_reject(struct rq *rq);
 
 #else	/* CONFIG_EXT_SUB_SCHED */
 
@@ -51,6 +55,9 @@ static inline void scx_online_ecaps(struct rq *rq) {}
 static inline void scx_offline_ecaps(struct rq *rq) {}
 static inline void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu) {}
 static inline void scx_discard_stale_ecaps_syncs(void) {}
+static inline struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq *rq, struct task_struct *p, u64 *enq_flags) { return &rq->scx.local_dsq; }
+static inline bool scx_task_reenq_on_cap_revoke(struct rq *rq, struct task_struct *p) { return false; }
+static inline void scx_reenq_reject(struct rq *rq) {}
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
@@ -69,12 +76,54 @@ static inline void scx_discard_stale_ecaps_syncs(void) {}
 
 #ifdef CONFIG_EXT_SUB_SCHED
 
+/**
+ * scx_missing_caps - The caps in @needed that @sch lacks on @cpu
+ * @sch: sched to test
+ * @cpu: cpu to test on
+ * @needed: bitmask of SCX_CAP_* values
+ *
+ * Return the caps in @needed that @sch lacks for @cpu, 0 if it holds them all.
+ */
+static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed)
+{
+	u64 ecaps;
+
+	/* root holds every cap on every cpu */
+	if (!sch->level)
+		return 0;
+
+	ecaps = READ_ONCE(per_cpu_ptr(sch->pcpu, cpu)->ecaps);
+
+	return needed & ~ecaps;
+}
+
+/*
+ * Cap semantics: which caps an action requires, and which caps a cap implies.
+ * Keep all such mappings collected here.
+ */
+
+/* map @enq_flags to the SCX_CAP_* bit required for the local-DSQ insert */
+static inline u64 scx_caps_for_enq(u64 enq_flags)
+{
+	return 0;
+}
+
+/* map queued @p to the SCX_CAP_* bit required to stay on its local DSQ */
+static inline u64 scx_caps_for_task(struct task_struct *p)
+{
+	return 0;
+}
+
 /* caps implied by holding @cap */
 static inline u64 scx_caps_implied(u64 cap)
 {
 	return 0;
 }
 
+#else	/* CONFIG_EXT_SUB_SCHED */
+
+static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed) { return 0; }
+
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e05dcdff3ace..8db6b09d91bf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -794,6 +794,9 @@ enum scx_rq_flags {
 
 struct scx_rq {
 	struct scx_dispatch_q	local_dsq;
+#ifdef CONFIG_EXT_SUB_SCHED
+	struct scx_dispatch_q	reject_dsq;		/* staging for cap-rejected tasks */
+#endif
 	struct list_head	runnable_list;		/* runnable tasks on this rq */
 	struct list_head	ddsp_deferred_locals;	/* deferred ddsps from enq */
 	unsigned long		ops_qseq;
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (20 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance Tejun Heo
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Replace the __SCX_CAP_DUMMY placeholder with SCX_CAP_ENQ_IMMED, which gates
inserting IMMED tasks onto a cid's local DSQ. An IMMED enqueue is guaranteed
to either get its task running on the cpu at once or hand it back to the
scheduler, so IMMED work can never pile up on the cpu's queue and a cpu can
be shared across sub-scheds through IMMED access without any of them
swamping it.

That makes ENQ_IMMED the natural baseline, the minimal cap to make any use
of a cpu. SCX_CAP_BASE aliases it so gates on basic cpu access can state the
intention instead of naming ENQ_IMMED.

Enforcement covers inserts and queued tasks. An insert without the cap is
diverted to the reject DSQ, and queued tasks are reenqueued when the cap is
lost. scx_bpf_sub_dispatch() skips a child that lacks the cap on the cpu, as
its inserts would only be rejected. Vacating the running task on cap loss
lands in a later patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      |  2 +-
 kernel/sched/ext/internal.h | 11 ++++++++---
 kernel/sched/ext/sub.c      |  8 ++++++++
 kernel/sched/ext/sub.h      |  4 ++--
 4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index b6d68a80a04f..9309d57e3f4f 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -4757,7 +4757,7 @@ SCX_ATTR(events);
 
 #ifdef CONFIG_EXT_SUB_SCHED
 static const char *scx_cap_names[__SCX_NR_CAPS] = {
-	[__SCX_CAP_DUMMY]	= "dummy",
+	[__SCX_CAP_ENQ_IMMED]	= "enq_immed",
 };
 
 static ssize_t scx_attr_caps_show(struct kobject *kobj,
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index ef6b4d0f7dee..20a1ffbe4c26 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1270,17 +1270,22 @@ struct scx_sched_pnode {
  * topology-aligned and likely to serve as the locality unit when cids are
  * allocated to schedulers, so per-shard lock granularity scales naturally with
  * the allocation pattern.
+ *
+ * ENQ_IMMED  insert an IMMED task onto the cid's local DSQ
  */
 enum scx_cap_flags {
-	__SCX_CAP_DUMMY			= 0,
+	__SCX_CAP_ENQ_IMMED		= 0,
 
 	__SCX_NR_CAPS,
 	__SCX_CAP_ALL			= BIT_U64(__SCX_NR_CAPS) - 1,
 
-	SCX_CAP_DUMMY			= BIT_U64(__SCX_CAP_DUMMY),
+	SCX_CAP_ENQ_IMMED		= BIT_U64(__SCX_CAP_ENQ_IMMED),
+
+	/* alias for minimal cap to make any use of a cpu */
+	SCX_CAP_BASE			= SCX_CAP_ENQ_IMMED,
 
 	/* caps whose loss strands queued tasks, see scx_process_sync_ecaps() */
-	SCX_CAPS_REENQ_ON_LOSS		= 0,
+	SCX_CAPS_REENQ_ON_LOSS		= SCX_CAP_ENQ_IMMED,
 };
 
 #ifdef CONFIG_EXT_SUB_SCHED
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index aea63484edc5..2f1e19db8e72 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -1230,6 +1230,14 @@ __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *
 		return false;
 	}
 
+	/*
+	 * Skip a child that does not effectively hold the base cap on this cpu:
+	 * its inserts would only be rejected. ecaps are synced at the top of
+	 * balance_one() before dispatch, so this reflects the in-effect state.
+	 */
+	if (scx_missing_caps(child, cpu_of(this_rq), SCX_CAP_BASE))
+		return false;
+
 	return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
 				  true);
 }
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 89d1458ff450..ea8bea347bb0 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -105,13 +105,13 @@ static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed)
 /* map @enq_flags to the SCX_CAP_* bit required for the local-DSQ insert */
 static inline u64 scx_caps_for_enq(u64 enq_flags)
 {
-	return 0;
+	return SCX_CAP_ENQ_IMMED;
 }
 
 /* map queued @p to the SCX_CAP_* bit required to stay on its local DSQ */
 static inline u64 scx_caps_for_task(struct task_struct *p)
 {
-	return 0;
+	return SCX_CAP_ENQ_IMMED;
 }
 
 /* caps implied by holding @cap */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (21 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice() Tejun Heo
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Neither a scx_sched pointer nor its cgroup id uniquely identifies a
scheduler instance. A freed sched's memory can be reallocated, and a cgroup
can detach one sched and attach another. Add a monotonic, never-reused u64
id. A later patch compares it to drop a slice request that outlived a change
of a task's owning scheduler.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      | 4 ++++
 kernel/sched/ext/internal.h | 3 +++
 2 files changed, 7 insertions(+)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 9309d57e3f4f..7d4fd4e9d5a4 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -108,6 +108,9 @@ static bool dsq_is_rq_owned(struct scx_dispatch_q *dsq)
 	}
 }
 
+/* Cursor for unique scx_sched instance ids. id 0 is reserved. */
+static atomic64_t scx_sched_id_cursor = ATOMIC64_INIT(0);
+
 #ifdef CONFIG_EXT_SUB_SCHED
 /*
  * The sub sched being enabled. Used by scx_disable_and_exit_task() to exit
@@ -6447,6 +6450,7 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 		       level * sizeof(parent->ancestors[0]));
 	sch->ancestors[level] = sch;
 	sch->level = level;
+	sch->id = atomic64_inc_return(&scx_sched_id_cursor);
 
 	if (ops->timeout_ms)
 		sch->watchdog_timeout = msecs_to_jiffies(ops->timeout_ms);
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 20a1ffbe4c26..323c88835698 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1443,6 +1443,9 @@ struct scx_sched {
 
 	struct list_head	all;
 
+	/* unique instance id, monotonic and never reused */
+	u64			id;
+
 #ifdef CONFIG_EXT_SUB_SCHED
 	struct rhash_head	hash_node;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice()
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (22 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A later change makes every write to p->scx.slice also drop a pending
out-of-band slice request, which needs all such writes to go through one
place. Introduce set_task_slice() and route the existing direct writes
through it. No functional change - the helper only assigns p->scx.slice.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 7d4fd4e9d5a4..95ad8f37cc92 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -1158,6 +1158,12 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
 #endif
 }
 
+/* set @p's slice, every write to p->scx.slice goes through here */
+static void set_task_slice(struct task_struct *p, u64 slice)
+{
+	p->scx.slice = slice;
+}
+
 static void update_curr_scx(struct rq *rq)
 {
 	struct task_struct *curr = rq->curr;
@@ -1243,7 +1249,7 @@ static void dsq_dec_nr(struct scx_dispatch_q *dsq, struct task_struct *p)
 
 static void refill_task_slice_dfl(struct scx_sched *sch, struct task_struct *p)
 {
-	p->scx.slice = READ_ONCE(sch->slice_dfl);
+	set_task_slice(p, READ_ONCE(sch->slice_dfl));
 	__scx_add_event(sch, SCX_EV_REFILL_SLICE_DFL, 1);
 }
 
@@ -1338,7 +1344,7 @@ static void rq_owned_post_enq(struct scx_sched *sch, struct rq *rq,
 
 	if ((enq_flags & SCX_ENQ_PREEMPT) && p != rq->curr &&
 	    rq->curr->sched_class == &ext_sched_class) {
-		rq->curr->scx.slice = 0;
+		set_task_slice(rq->curr, 0);
 		resched_curr(rq);
 	}
 }
@@ -2044,7 +2050,7 @@ static void yield_task_scx(struct rq *rq)
 	if (SCX_HAS_OP(sch, yield))
 		SCX_CALL_OP_2TASKS_RET(sch, yield, rq, p, NULL);
 	else
-		p->scx.slice = 0;
+		set_task_slice(p, 0);
 }
 
 static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
@@ -3278,7 +3284,7 @@ static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
 	 * we can't trust the slice management or ops.core_sched_before().
 	 */
 	if (scx_bypassing(sch, cpu_of(rq))) {
-		curr->scx.slice = 0;
+		set_task_slice(curr, 0);
 		touch_core_sched(rq, curr);
 	} else if (SCX_HAS_OP(sch, tick)) {
 		SCX_CALL_OP_TASK(sch, tick, rq, curr);
@@ -3417,7 +3423,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p)
 	 * control, after ops.disable() has observed their final values.
 	 */
 	p->scx.dsq_vtime = 0;
-	p->scx.slice = 0;
+	set_task_slice(p, 0);
 
 	/*
 	 * Verify the task is not in BPF scheduler's custody. If flag
@@ -7038,7 +7044,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 			queue_flags |= DEQUEUE_CLASS;
 
 		scoped_guard (sched_change, p, queue_flags) {
-			p->scx.slice = READ_ONCE(sch->slice_dfl);
+			set_task_slice(p, READ_ONCE(sch->slice_dfl));
 			p->sched_class = new_class;
 		}
 	}
@@ -7676,7 +7682,7 @@ static bool kick_one_cpu(s32 cpu, struct scx_sched_pcpu *pcpu, struct rq *this_r
 	    !sched_class_above(cur_class, &ext_sched_class)) {
 		if (cpumask_test_cpu(cpu, pcpu->cpus_to_preempt)) {
 			if (cur_class == &ext_sched_class)
-				rq->curr->scx.slice = 0;
+				set_task_slice(rq->curr, 0);
 			cpumask_clear_cpu(cpu, pcpu->cpus_to_preempt);
 		}
 
@@ -8026,9 +8032,9 @@ __bpf_kfunc bool scx_bpf_dsq_insert___v2(struct task_struct *p, u64 dsq_id,
 		return false;
 
 	if (slice)
-		p->scx.slice = slice;
+		set_task_slice(p, slice);
 	else
-		p->scx.slice = p->scx.slice ?: 1;
+		set_task_slice(p, p->scx.slice ?: 1);
 
 	scx_dsq_insert_commit(sch, p, dsq_id, enq_flags);
 
@@ -8052,9 +8058,9 @@ static bool scx_dsq_insert_vtime(struct scx_sched *sch, struct task_struct *p,
 		return false;
 
 	if (slice)
-		p->scx.slice = slice;
+		set_task_slice(p, slice);
 	else
-		p->scx.slice = p->scx.slice ?: 1;
+		set_task_slice(p, p->scx.slice ?: 1);
 
 	p->scx.dsq_vtime = vtime;
 
@@ -8238,7 +8244,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 	if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_VTIME)
 		p->scx.dsq_vtime = kit->vtime;
 	if (kit->cursor.flags & __SCX_DSQ_ITER_HAS_SLICE)
-		p->scx.slice = kit->slice;
+		set_task_slice(p, kit->slice);
 
 	/* execute move */
 	locked_rq = move_task_between_dsqs(sch, p, enq_flags, src_dsq, dst_dsq);
@@ -8633,7 +8639,7 @@ __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice,
 	if (unlikely(!sch || !scx_task_on_sched(sch, p)))
 		return false;
 
-	p->scx.slice = slice;
+	set_task_slice(p, slice);
 	return true;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (23 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice() Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 13:34   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap Tejun Heo
                   ` (6 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A task's slice grants it cpu occupancy - how long it holds its cpu. In a
sub-scheduler hierarchy cpu access is delegated through revocable
capabilities, so a task's occupancy must follow them. Only its own scheduler
sets its slice, and extending the slice is allowed only while that scheduler
holds baseline cpu access (SCX_CAP_BASE) on the cpu. Otherwise a scheduler
could keep occupying a cpu it has been denied simply by handing out long
slices.

The cap check reads effective caps, which are coherent only under the task's
rq lock, and the kernel decrements the slice under that lock as the task
runs, so a running task's slice can be changed only there while a queued
task's can be set directly. Make scx_bpf_task_set_slice() apply the slice
under the rq lock. Synchronously when the caller already holds it, otherwise
by stashing it in the new p->scx.slice_oob, tagged with the scheduler's id
so a request that outlived a reassignment is dropped.

Revocation is enforced through the same grant. When a cpu's effective caps
lose SCX_CAP_BASE, the cap-revoke reenq scan also checks the running task
and zeroes its slice to evict it. The scan runs as a balance callback after
the pick, so this catches both the task that was running when the revoke
landed and a capless task the pick just promoted off the local DSQ. The
paths that keep a task on its cpu - holding on to the last runnable task in
balance, the ENQ_LAST reinsertion and the slice refill on pick - skip tasks
lacking baseline access. A migration-disabled task is exempt, mirroring its
capless admission on insert.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/linux/sched/ext.h   |  17 +++-
 kernel/sched/ext/ext.c      | 174 ++++++++++++++++++++++++++++++++++--
 kernel/sched/ext/internal.h |  19 +++-
 kernel/sched/ext/sub.h      |  11 +++
 4 files changed, 208 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
index 7e3f6b33f4a8..7f3f8a26b0b4 100644
--- a/include/linux/sched/ext.h
+++ b/include/linux/sched/ext.h
@@ -222,10 +222,11 @@ struct sched_ext_entity {
 	/* BPF scheduler modifiable fields */
 
 	/*
-	 * Runtime budget in nsecs. This is usually set through
-	 * scx_bpf_dsq_insert() but can also be modified directly by the BPF
-	 * scheduler. Automatically decreased by SCX as the task executes. On
-	 * depletion, a scheduling event is triggered.
+	 * Runtime budget in nsecs - how long the task may hold its cpu. Owned
+	 * by the task's scheduler. Set it when enqueuing via
+	 * scx_bpf_dsq_insert(), or otherwise via scx_bpf_task_set_slice().
+	 * Automatically decreased as the task executes. On depletion a
+	 * scheduling event is triggered.
 	 *
 	 * This value is cleared to zero if the task is preempted by
 	 * %SCX_KICK_PREEMPT and shouldn't be used to determine how long the
@@ -242,6 +243,14 @@ struct sched_ext_entity {
 	 */
 	u64			dsq_vtime;
 
+	/*
+	 * Out-of-band slice request from scx_bpf_task_set_slice() when the
+	 * caller does not hold the rq lock, applied under the rq lock at the
+	 * next slice consideration. One atomic64 packs the pending flag, the
+	 * issuing sch's id, and the requested slice. See scx_slice_oob_consts.
+	 */
+	atomic64_t		slice_oob;
+
 	/*
 	 * Sub-sched cap rejected reenq context, valid only while
 	 * %SCX_TASK_REENQ_CAP is set. @reenq_reason_caps is the SCX_CAP_* bits
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 95ad8f37cc92..dfae05ce3e81 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -1158,10 +1158,125 @@ static void touch_core_sched_dispatch(struct rq *rq, struct task_struct *p)
 #endif
 }
 
-/* set @p's slice, every write to p->scx.slice goes through here */
+/*
+ * p->scx.slice_oob packs an out-of-band slice request into one atomic64. A zero
+ * word means no request. Otherwise the fields are:
+ *
+ *   63      SCX_SLICE_OOB_PENDING, set on every request
+ *   62-43   lower bits of issuing scheduler's id
+ *   42-0    requested slice duration in nsecs
+ *
+ * A duration of SCX_SLICE_OOB_DUR_MASK means SCX_SLICE_INF. A finite dur
+ * saturates at SCX_SLICE_OOB_DUR_MASK - 1. The id is used to detect and ignore
+ * a request that outlived a task ownership change.
+ *
+ * Only the low 20 bits of sch->id are packed, which is enough to make
+ * collisions practically impossible. A theoretical collision just lets a stale
+ * request through once.
+ */
+enum scx_slice_oob_consts {
+	SCX_SLICE_OOB_DUR_BITS	= 43,
+	SCX_SLICE_OOB_ID_BITS	= 64 - SCX_SLICE_OOB_DUR_BITS - 1,
+
+	SCX_SLICE_OOB_DUR_MASK	= (1LLU << SCX_SLICE_OOB_DUR_BITS) - 1,
+	SCX_SLICE_OOB_ID_SHIFT	= SCX_SLICE_OOB_DUR_BITS,
+	SCX_SLICE_OOB_ID_MASK	= (1LLU << SCX_SLICE_OOB_ID_BITS) - 1,
+	SCX_SLICE_OOB_PENDING	= 1LLU << 63,
+};
+
+/*
+ * Slice write rules
+ *
+ * A task's slice - how long it may hold its cpu - is an occupancy grant owned
+ * by the task's scheduler. How it may be written depends on whether the task is
+ * running.
+ *
+ * Queued, not running: the slice grants no occupancy yet and nothing consumes
+ * it, so the owner writes it directly - via scx_bpf_dsq_insert(), the dsq move
+ * kfuncs, or scx_bpf_task_set_slice(). Serializing its own writers is then the
+ * scheduler's job, not the kernel's.
+ *
+ * Running: the slice must be changed under the task's rq lock, because:
+ *
+ * - Raising it extends occupancy, allowed only with %SCX_CAP_BASE on the cpu,
+ *   and that cap check is coherent only under the rq lock. Shortening is always
+ *   allowed.
+ *
+ * - The kernel decrements it there as the task runs. The decrement is a
+ *   read-modify-write, so a racing write can be clobbered.
+ *
+ * scx_bpf_task_set_slice() writes directly when it holds the rq lock, and
+ * otherwise stashes the value in p->scx.slice_oob for the kernel to apply under
+ * the lock. A later in-band write supersedes a stash, and a stash whose
+ * scheduler id no longer matches the task's owner is dropped.
+ */
+
+/* clear a pending slice request */
+static void clear_task_slice_oob(struct task_struct *p)
+{
+	if (unlikely(atomic64_read(&p->scx.slice_oob)))
+		atomic64_set(&p->scx.slice_oob, 0);
+}
+
+/* set @p's slice, superseding any pending out-of-band request */
 static void set_task_slice(struct task_struct *p, u64 slice)
 {
 	p->scx.slice = slice;
+	clear_task_slice_oob(p);
+}
+
+/* request @p's slice to be set to @slice, see the slice write rules above */
+static void set_task_slice_oob(struct scx_sched *sch, struct task_struct *p, u64 slice)
+{
+	u64 dur;
+
+	if (slice == SCX_SLICE_INF) {
+		dur = SCX_SLICE_OOB_DUR_MASK;
+	} else if (unlikely(slice >= SCX_SLICE_OOB_DUR_MASK)) {
+		dur = SCX_SLICE_OOB_DUR_MASK - 1;
+		scx_add_event(sch, SCX_EV_SLICE_CLAMPED, 1);
+	} else {
+		dur = slice;
+	}
+
+	atomic64_set(&p->scx.slice_oob, SCX_SLICE_OOB_PENDING |
+		     ((sch->id & SCX_SLICE_OOB_ID_MASK) << SCX_SLICE_OOB_ID_SHIFT) | dur);
+}
+
+/*
+ * Apply a pending out-of-band slice request under @rq's lock. A request whose
+ * packed id no longer matches @p's current owner is dropped. An extension needs
+ * baseline cpu access on @p's cid. %SCX_EV_SLICE_DENIED counts the denials.
+ * Shortening is always allowed. See the slice write rules above.
+ */
+static void apply_task_slice_oob(struct rq *rq, struct task_struct *p)
+{
+	u64 oob, dur, slice;
+
+	lockdep_assert_rq_held(rq);
+
+	if (likely(!atomic64_read(&p->scx.slice_oob)))
+		return;
+
+	oob = atomic64_xchg(&p->scx.slice_oob, 0);
+	if (unlikely(!oob))
+		return;
+
+	/* the issuing scheduler no longer owns @p, drop the request */
+	if (unlikely(((oob >> SCX_SLICE_OOB_ID_SHIFT) & SCX_SLICE_OOB_ID_MASK) !=
+		     (scx_task_sched(p)->id & SCX_SLICE_OOB_ID_MASK)))
+		return;
+
+	dur = oob & SCX_SLICE_OOB_DUR_MASK;
+	slice = dur == SCX_SLICE_OOB_DUR_MASK ? SCX_SLICE_INF : dur;
+
+	if (slice > p->scx.slice &&
+	    unlikely(scx_missing_caps(scx_task_sched(p), cpu_of(rq), SCX_CAP_BASE))) {
+		__scx_add_event(scx_task_sched(p), SCX_EV_SLICE_DENIED, 1);
+		return;
+	}
+
+	p->scx.slice = slice;
 }
 
 static void update_curr_scx(struct rq *rq)
@@ -1169,6 +1284,9 @@ static void update_curr_scx(struct rq *rq)
 	struct task_struct *curr = rq->curr;
 	s64 delta_exec;
 
+	/* apply even on 0 delta_exec, callers may still act on the slice */
+	apply_task_slice_oob(rq, curr);
+
 	delta_exec = update_curr_common(rq);
 	if (unlikely(delta_exec <= 0))
 		return;
@@ -2682,7 +2800,8 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	 * %SCX_OPS_ENQ_LAST is in effect.
 	 */
 	if ((prev->scx.flags & SCX_TASK_QUEUED) &&
-	    (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_bypassing(sch, cpu))) {
+	    (!(sch->ops.flags & SCX_OPS_ENQ_LAST) || scx_bypassing(sch, cpu)) &&
+	    scx_task_can_stay_on_cpu(rq, prev)) {
 		rq->scx.flags |= SCX_RQ_BAL_KEEP;
 		__scx_add_event(sch, SCX_EV_DISPATCH_KEEP_LAST, 1);
 		goto has_tasks;
@@ -2843,12 +2962,14 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		 * sched_class, %SCX_OPS_ENQ_LAST must be set. Tell
 		 * ops.enqueue() that @p is the only one available for this cpu,
 		 * which should trigger an explicit follow-up scheduling event.
+		 * This doesn't apply if the baseline access on the CPU is lost.
 		 *
 		 * Core scheduling can force this CPU idle while @p stays
 		 * runnable. @p's cookie then won't match the core's, so skip
 		 * the warning in that case.
 		 */
-		if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
+		if (next && sched_class_above(&ext_sched_class, next->sched_class) &&
+		    scx_task_can_stay_on_cpu(rq, p)) {
 			WARN_ON_ONCE(sched_cpu_cookie_match(rq, p) &&
 				     !(sch->ops.flags & SCX_OPS_ENQ_LAST));
 			scx_do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
@@ -2970,7 +3091,7 @@ do_pick_task_scx(struct rq *rq, struct rq_flags *rf, bool force_scx)
 		if (!p)
 			return NULL;
 
-		if (unlikely(!p->scx.slice)) {
+		if (unlikely(!p->scx.slice) && scx_task_can_stay_on_cpu(rq, p)) {
 			struct scx_sched *sch = scx_task_sched(p);
 
 			if (!scx_bypassing(sch, cpu_of(rq)) &&
@@ -3898,6 +4019,20 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 		nr_enqueued++;
 	}
 
+	/*
+	 * The revoke that scheduled this scan may have raced the pick: curr
+	 * may be a now-capless task, either one that kept running or one
+	 * promoted off the local DSQ between the ecaps sync and this scan.
+	 * Zero the slice to evict it. The enqueue gate blocks new capless
+	 * inserts, so no later pick can slip through after the scan.
+	 */
+	if ((reenq_flags & SCX_REENQ_CAP_REVOKE) &&
+	    rq->curr->sched_class == &ext_sched_class &&
+	    scx_task_reenq_on_cap_revoke(rq, rq->curr)) {
+		set_task_slice(rq->curr, 0);
+		resched_curr(rq);
+	}
+
 	return nr_enqueued;
 }
 
@@ -8626,20 +8761,45 @@ __bpf_kfunc_start_defs();
  * @slice: time slice to set in nsecs
  * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
  *
- * Set @p's time slice to @slice. Returns %true on success, %false if the
- * calling scheduler doesn't have authority over @p.
+ * Set @p's time slice. @p must be on the calling scheduler. The value is
+ * applied whether or not the caller holds @p's rq lock - see the slice write
+ * rules above for the ownership model.
+ *
+ * Raising the slice is honored only while the scheduler holds %SCX_CAP_BASE on
+ * @p's cpu, otherwise it is counted in %SCX_EV_SLICE_DENIED. Shortening is
+ * always allowed. On the stashed path the slice is packed into an atomic64_t
+ * with the scheduler id and a flag bit, so a slice too large to fit is clamped
+ * and counted in %SCX_EV_SLICE_CLAMPED. %SCX_SLICE_INF is preserved.
+ *
+ * Return %true on success, %false if @p is not on the calling scheduler.
  */
 __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice,
 					const struct bpf_prog_aux *aux)
 {
 	struct scx_sched *sch;
+	struct rq *rq = task_rq(p);
 
 	guard(rcu)();
 	sch = scx_prog_sched(aux);
 	if (unlikely(!sch || !scx_task_on_sched(sch, p)))
 		return false;
 
-	set_task_slice(p, slice);
+	/*
+	 * Out of band: stash and apply under the rq lock at the next drain,
+	 * where it is re-validated against @p's current owner.
+	 */
+	if (scx_locked_rq() != rq) {
+		set_task_slice_oob(sch, p, slice);
+		return true;
+	}
+
+	/* under the rq lock: apply now, extensions gated on baseline access */
+	if (slice > p->scx.slice &&
+	    unlikely(scx_missing_caps(sch, cpu_of(rq), SCX_CAP_BASE)))
+		__scx_add_event(sch, SCX_EV_SLICE_DENIED, 1);
+	else
+		set_task_slice(p, slice);
+
 	return true;
 }
 
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 323c88835698..48d975a457ca 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -400,8 +400,9 @@ struct sched_ext_ops {
 	 * @p: task running currently
 	 *
 	 * This operation is called every 1/HZ seconds on CPUs which are
-	 * executing an SCX task. Setting @p->scx.slice to 0 will trigger an
-	 * immediate dispatch cycle on the CPU.
+	 * executing an SCX task. Setting a slice of 0 for @p with
+	 * scx_bpf_task_set_slice() will trigger an immediate dispatch cycle on
+	 * the CPU.
 	 */
 	void (*tick)(struct task_struct *p);
 
@@ -1103,6 +1104,18 @@ struct scx_event_stats {
 	 */
 	s64		SCX_EV_REFILL_SLICE_DFL;
 
+	/*
+	 * The number of times an out-of-band slice request exceeded the maximum
+	 * representable value and was clamped.
+	 */
+	s64		SCX_EV_SLICE_CLAMPED;
+
+	/*
+	 * The number of times a slice extension was denied because the
+	 * scheduler lacked baseline cpu access on the task's cpu.
+	 */
+	s64		SCX_EV_SLICE_DENIED;
+
 	/*
 	 * The total duration of bypass modes in nanoseconds.
 	 */
@@ -1153,6 +1166,8 @@ struct scx_event_stats {
 	SCX_EVENT(SCX_EV_REENQ_IMMED);					\
 	SCX_EVENT(SCX_EV_REENQ_LOCAL_REPEAT);				\
 	SCX_EVENT(SCX_EV_REFILL_SLICE_DFL);				\
+	SCX_EVENT(SCX_EV_SLICE_CLAMPED);				\
+	SCX_EVENT(SCX_EV_SLICE_DENIED);					\
 	SCX_EVENT(SCX_EV_BYPASS_DURATION);				\
 	SCX_EVENT(SCX_EV_BYPASS_DISPATCH);				\
 	SCX_EVENT(SCX_EV_BYPASS_ACTIVATE);				\
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index ea8bea347bb0..ce626a29b33b 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -120,9 +120,20 @@ static inline u64 scx_caps_implied(u64 cap)
 	return 0;
 }
 
+/* may @p keep running on @rq's cpu? requires baseline cpu access */
+static inline bool scx_task_can_stay_on_cpu(struct rq *rq, struct task_struct *p)
+{
+	/* a migration-disabled task is let in without caps, keep it likewise */
+	if (unlikely(is_migration_disabled(p)))
+		return true;
+
+	return likely(!scx_missing_caps(scx_task_sched(p), cpu_of(rq), SCX_CAP_BASE));
+}
+
 #else	/* CONFIG_EXT_SUB_SCHED */
 
 static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed) { return 0; }
+static inline bool scx_task_can_stay_on_cpu(struct rq *rq, struct task_struct *p) { return true; }
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (24 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT Tejun Heo
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Add SCX_CAP_ENQ, which gates inserting tasks onto a cid's local DSQ. Unlike
IMMED enqueue, plain enqueues can pile up, so ENQ is the stronger cap and
implies ENQ_IMMED. Losing ENQ also triggers the reenq scan. The scan tests
each queued task and the running task against the cap each needs via
scx_caps_for_task(), so an ENQ-only loss reenqueues plain tasks, evicting a
running one, while IMMED tasks, which need only ENQ_IMMED, stay put.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      |  1 +
 kernel/sched/ext/internal.h | 14 +++++++++++---
 kernel/sched/ext/sub.h      | 12 ++++++++++--
 3 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index dfae05ce3e81..7e521dc7e1b7 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -4902,6 +4902,7 @@ SCX_ATTR(events);
 #ifdef CONFIG_EXT_SUB_SCHED
 static const char *scx_cap_names[__SCX_NR_CAPS] = {
 	[__SCX_CAP_ENQ_IMMED]	= "enq_immed",
+	[__SCX_CAP_ENQ]		= "enq",
 };
 
 static ssize_t scx_attr_caps_show(struct kobject *kobj,
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 48d975a457ca..80913365e19a 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1224,8 +1224,8 @@ struct scx_sched_pcpu {
 	/*
 	 * pshard->caps[cap_bit] is the set of cids the sched holds that one
 	 * cap on. ecaps is its transpose: the set of SCX_CAP_* bits the sched
-	 * holds on this cpu, collected so that the hot-path check is a single
-	 * read.
+	 * effectively holds on this cpu, with implied caps folded in, so that
+	 * the hot-path check is a single read.
 	 *
 	 * While pshard->caps[] under pshard->lock is the target configuration,
 	 * ecaps is the effective copy owned by the cpu. It is written under the
@@ -1287,20 +1287,28 @@ struct scx_sched_pnode {
  * the allocation pattern.
  *
  * ENQ_IMMED  insert an IMMED task onto the cid's local DSQ
+ *
+ * ENQ        insert any task onto the cid's local DSQ (implies ENQ_IMMED)
+ *
+ * Implied caps apply to the holder's own use of a cid, not to delegation.
+ * scx_bpf_sub_grant() delegates literally-held caps, so a cap held only through
+ * implication is usable but cannot be re-delegated to a child.
  */
 enum scx_cap_flags {
 	__SCX_CAP_ENQ_IMMED		= 0,
+	__SCX_CAP_ENQ			= 1,
 
 	__SCX_NR_CAPS,
 	__SCX_CAP_ALL			= BIT_U64(__SCX_NR_CAPS) - 1,
 
 	SCX_CAP_ENQ_IMMED		= BIT_U64(__SCX_CAP_ENQ_IMMED),
+	SCX_CAP_ENQ			= BIT_U64(__SCX_CAP_ENQ),
 
 	/* alias for minimal cap to make any use of a cpu */
 	SCX_CAP_BASE			= SCX_CAP_ENQ_IMMED,
 
 	/* caps whose loss strands queued tasks, see scx_process_sync_ecaps() */
-	SCX_CAPS_REENQ_ON_LOSS		= SCX_CAP_ENQ_IMMED,
+	SCX_CAPS_REENQ_ON_LOSS		= SCX_CAP_ENQ_IMMED | SCX_CAP_ENQ,
 };
 
 #ifdef CONFIG_EXT_SUB_SCHED
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index ce626a29b33b..7d8c1632f58f 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -105,18 +105,26 @@ static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed)
 /* map @enq_flags to the SCX_CAP_* bit required for the local-DSQ insert */
 static inline u64 scx_caps_for_enq(u64 enq_flags)
 {
-	return SCX_CAP_ENQ_IMMED;
+	if (enq_flags & SCX_ENQ_IMMED)
+		return SCX_CAP_ENQ_IMMED;
+	return SCX_CAP_ENQ;
 }
 
 /* map queued @p to the SCX_CAP_* bit required to stay on its local DSQ */
 static inline u64 scx_caps_for_task(struct task_struct *p)
 {
-	return SCX_CAP_ENQ_IMMED;
+	if (p->scx.flags & SCX_TASK_IMMED)
+		return SCX_CAP_ENQ_IMMED;
+	return SCX_CAP_ENQ;
 }
 
 /* caps implied by holding @cap */
 static inline u64 scx_caps_implied(u64 cap)
 {
+	switch (cap) {
+	case SCX_CAP_ENQ:
+		return SCX_CAP_ENQ_IMMED;
+	}
 	return 0;
 }
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (25 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 14:01   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
                   ` (4 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A kick forces a scheduling event on the target cpu, and a preemption also
evicts the running task. Gate both on caps. Any kick requires baseline
access on the cid, and preempting a task the sub-sched does not own -
whether by a SCX_ENQ_PREEMPT insert or a SCX_KICK_PREEMPT kick - requires
the new SCX_CAP_PREEMPT. Gating either alone would leave a hole - the
weakest cap authorizing preempting kicks, or plain kicks disturbing cpus the
kicker has no access to.

Preempting the sched's own subtree is always allowed, and the cap extends
the right to any task on the cid. PREEMPT implies ENQ, and so ENQ_IMMED.

A preempting insert tests the running task under the target rq lock and is
rejected and reenqueued unless the victim is in the inserter's subtree or it
holds PREEMPT. A migration-disabled task is admitted regardless, but with
SCX_ENQ_PREEMPT stripped.

Kicks are enforced on the delivery path, where the effective caps can be
read coherently under the target rq's lock. A kick from a sub-sched lacking
SCX_CAP_BASE on the cid is dropped, and a SCX_KICK_PREEMPT kick without
PREEMPT for a task outside the kicker's subtree degrades to a plain
reschedule.

Unlike the enqueue caps, PREEMPT is checked only at the instant of the
insert or kick, never as a standing property of a queued task.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      | 34 +++++++++++++++++++++++++---------
 kernel/sched/ext/internal.h | 24 ++++++++++++++++++++++--
 kernel/sched/ext/sub.c      | 20 +++++++++++++++-----
 kernel/sched/ext/sub.h      | 15 +++++++++++++++
 4 files changed, 77 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 7e521dc7e1b7..5a2c96bf8aa9 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -294,7 +294,7 @@ static bool u32_before(u32 a, u32 b)
  *
  * Test whether @sch is a descendant of @ancestor.
  */
-static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
+bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
 {
 	if (sch->level < ancestor->level)
 		return false;
@@ -4903,6 +4903,7 @@ SCX_ATTR(events);
 static const char *scx_cap_names[__SCX_NR_CAPS] = {
 	[__SCX_CAP_ENQ_IMMED]	= "enq_immed",
 	[__SCX_CAP_ENQ]		= "enq",
+	[__SCX_CAP_PREEMPT]	= "preempt",
 };
 
 static ssize_t scx_attr_caps_show(struct kobject *kobj,
@@ -7812,13 +7813,22 @@ static bool kick_one_cpu(s32 cpu, struct scx_sched_pcpu *pcpu, struct rq *this_r
 	 * During CPU hotplug, a CPU may depend on kicking itself to make
 	 * forward progress. Allow kicking self regardless of online state. If
 	 * @cpu is running a higher class task, we have no control over @cpu.
-	 * Skip kicking.
+	 * Skip kicking. A sub-sched lacking baseline access on @cid has no
+	 * business forcing a reschedule there - skip. This is the authoritative
+	 * cap check: ecaps is read here under @rq's lock.
 	 */
 	if ((cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
-	    !sched_class_above(cur_class, &ext_sched_class)) {
+	    !sched_class_above(cur_class, &ext_sched_class) &&
+	    !scx_missing_caps(pcpu->sch, cpu, SCX_CAP_BASE)) {
 		if (cpumask_test_cpu(cpu, pcpu->cpus_to_preempt)) {
-			if (cur_class == &ext_sched_class)
-				set_task_slice(rq->curr, 0);
+			if (cur_class == &ext_sched_class) {
+				if (likely(!scx_missing_caps(pcpu->sch, cpu,
+							     scx_caps_for_preempt(pcpu->sch, rq))))
+					set_task_slice(rq->curr, 0);
+				else
+					__scx_add_event(pcpu->sch,
+							SCX_EV_SUB_PREEMPT_DENIED, 1);
+			}
 			cpumask_clear_cpu(cpu, pcpu->cpus_to_preempt);
 		}
 
@@ -7842,15 +7852,18 @@ static bool kick_one_cpu(s32 cpu, struct scx_sched_pcpu *pcpu, struct rq *this_r
 	return should_wait;
 }
 
-static void kick_one_cpu_if_idle(s32 cpu, struct rq *this_rq)
+static void kick_one_cpu_if_idle(s32 cpu, struct scx_sched_pcpu *pcpu,
+				 struct rq *this_rq)
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
 
 	raw_spin_rq_lock_irqsave(rq, flags);
 
+	/* idle kicks need baseline access too, see kick_one_cpu() */
 	if (!can_skip_idle_kick(rq) &&
-	    (cpu_online(cpu) || cpu == cpu_of(this_rq)))
+	    (cpu_online(cpu) || cpu == cpu_of(this_rq)) &&
+	    !scx_missing_caps(pcpu->sch, cpu, SCX_CAP_BASE))
 		resched_curr(rq);
 
 	raw_spin_rq_unlock_irqrestore(rq, flags);
@@ -7887,7 +7900,7 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
 		}
 
 		for_each_cpu(cpu, pcpu->cpus_to_kick_if_idle) {
-			kick_one_cpu_if_idle(cpu, this_rq);
+			kick_one_cpu_if_idle(cpu, pcpu, this_rq);
 			cpumask_clear_cpu(cpu, pcpu->cpus_to_kick_if_idle);
 		}
 	}
@@ -8912,7 +8925,10 @@ __bpf_kfunc void scx_bpf_kick_cpu(s32 cpu, u64 flags, const struct bpf_prog_aux
  * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
  *
  * cid-addressed equivalent of scx_bpf_kick_cpu(). An invalid @cid aborts the
- * scheduler via scx_cid_to_cpu().
+ * scheduler via scx_cid_to_cpu(). Caps are enforced on the delivery path: a
+ * kick is dropped if the caller lacks baseline access on @cid, and a
+ * %SCX_KICK_PREEMPT degrades to a plain reschedule if the caller lacks
+ * %SCX_CAP_PREEMPT for a task outside its subtree.
  */
 __bpf_kfunc void scx_bpf_kick_cid(s32 cid, u64 flags, const struct bpf_prog_aux *aux)
 {
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 80913365e19a..6e2daf90a4ac 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1155,6 +1155,13 @@ struct scx_event_stats {
 	 * it can't be rejected. The violation is counted here.
 	 */
 	s64		SCX_EV_SUB_FORCED_ADMIT;
+
+	/*
+	 * The number of times a preempting kick was refused because the
+	 * sub-sched lacked SCX_CAP_PREEMPT for a task outside its subtree. The
+	 * kick degrades to a plain reschedule.
+	 */
+	s64		SCX_EV_SUB_PREEMPT_DENIED;
 };
 
 #define SCX_EVENTS_LIST(SCX_EVENT)					\
@@ -1173,7 +1180,8 @@ struct scx_event_stats {
 	SCX_EVENT(SCX_EV_BYPASS_ACTIVATE);				\
 	SCX_EVENT(SCX_EV_INSERT_NOT_OWNED);				\
 	SCX_EVENT(SCX_EV_SUB_BYPASS_DISPATCH);				\
-	SCX_EVENT(SCX_EV_SUB_FORCED_ADMIT)
+	SCX_EVENT(SCX_EV_SUB_FORCED_ADMIT);				\
+	SCX_EVENT(SCX_EV_SUB_PREEMPT_DENIED)
 
 struct scx_sched;
 
@@ -1287,22 +1295,33 @@ struct scx_sched_pnode {
  * the allocation pattern.
  *
  * ENQ_IMMED  insert an IMMED task onto the cid's local DSQ
+ *            - kick the cid's cpu (except SCX_KICK_PREEMPT)
  *
  * ENQ        insert any task onto the cid's local DSQ (implies ENQ_IMMED)
  *
+ * PREEMPT    preempt any task running on the cid regardless of the owning
+ *            sched (implies ENQ). Preempting a task in the sched's own subtree
+ *            doesn't require any cap.
+ *            - SCX_ENQ_PREEMPT inserts
+ *            - SCX_KICK_PREEMPT kicks
+ *
  * Implied caps apply to the holder's own use of a cid, not to delegation.
  * scx_bpf_sub_grant() delegates literally-held caps, so a cap held only through
- * implication is usable but cannot be re-delegated to a child.
+ * implication is usable but cannot be re-delegated to a child. When granting a
+ * cap, it usually makes sense to delegate its implied caps explicitly alongside
+ * it.
  */
 enum scx_cap_flags {
 	__SCX_CAP_ENQ_IMMED		= 0,
 	__SCX_CAP_ENQ			= 1,
+	__SCX_CAP_PREEMPT		= 2,
 
 	__SCX_NR_CAPS,
 	__SCX_CAP_ALL			= BIT_U64(__SCX_NR_CAPS) - 1,
 
 	SCX_CAP_ENQ_IMMED		= BIT_U64(__SCX_CAP_ENQ_IMMED),
 	SCX_CAP_ENQ			= BIT_U64(__SCX_CAP_ENQ),
+	SCX_CAP_PREEMPT			= BIT_U64(__SCX_CAP_PREEMPT),
 
 	/* alias for minimal cap to make any use of a cpu */
 	SCX_CAP_BASE			= SCX_CAP_ENQ_IMMED,
@@ -1911,6 +1930,7 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 					  struct scx_sched *parent);
 int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
 int scx_sched_sysfs_add(struct scx_sched *sch);
+bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor);
 
 extern raw_spinlock_t scx_sched_lock;
 extern struct mutex scx_enable_mutex;
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 2f1e19db8e72..67ba352828e0 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -209,11 +209,14 @@ void scx_init_root_caps(struct scx_sched *sch)
  * @sch: enqueuing sub-sched
  * @rq: rq whose local DSQ @p targets
  * @p: task being inserted
- * @enq_flags: in/out; %SCX_ENQ_IMMED is cleared when diverting to reject
+ * @enq_flags: in/out, unhonored flags are cleared
  *
  * Return @rq's local DSQ if @sch holds the required caps on @rq's cid,
  * otherwise @rq's reject DSQ after recording the reenq reason on @p.
  *
+ * %SCX_ENQ_IMMED and %SCX_ENQ_PREEMPT are cleared when diverting to reject.
+ * %SCX_ENQ_PREEMPT is also cleared on a fallback migration-disabled admission.
+ *
  * Bypass doesn't need special-casing as a bypassing sched's tasks are enqueued
  * to and run by its nearest non-bypassing ancestor. If root is bypassing, it
  * always holds all caps.
@@ -222,7 +225,12 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
 					       struct task_struct *p, u64 *enq_flags)
 {
 	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
-	u64 missing = scx_missing_caps(sch, cpu_of(rq), scx_caps_for_enq(*enq_flags));
+	u64 needed = scx_caps_for_enq(*enq_flags);
+	u64 missing;
+
+	if (*enq_flags & SCX_ENQ_PREEMPT)
+		needed |= scx_caps_for_preempt(sch, rq);
+	missing = scx_missing_caps(sch, cpu_of(rq), needed);
 
 	/* requirements met */
 	if (likely(!missing))
@@ -230,10 +238,11 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
 
 	/*
 	 * A migration-disabled task must run on this CPU. Let it run and count
-	 * the violation.
+	 * the violation. Refuse preemptions.
 	 */
 	if (unlikely(is_migration_disabled(p))) {
 		__scx_add_event(sch, SCX_EV_SUB_FORCED_ADMIT, 1);
+		*enq_flags &= ~SCX_ENQ_PREEMPT;
 		return &rq->scx.local_dsq;
 	}
 
@@ -243,9 +252,10 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
 	/*
 	 * Only local DSQ can honor IMMED and dsq_inc_nr() WARNs on IMMED into
 	 * others. Strip both the enq flag and the sticky task flag - the
-	 * latter can carry in from an earlier admitted IMMED insert.
+	 * latter can carry in from an earlier admitted IMMED insert. Strip
+	 * PREEMPT too.
 	 */
-	*enq_flags &= ~SCX_ENQ_IMMED;
+	*enq_flags &= ~(SCX_ENQ_IMMED | SCX_ENQ_PREEMPT);
 	p->scx.flags &= ~SCX_TASK_IMMED;
 
 	return &rq->scx.reject_dsq;
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 7d8c1632f58f..9f74c142b73f 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -118,10 +118,24 @@ static inline u64 scx_caps_for_task(struct task_struct *p)
 	return SCX_CAP_ENQ;
 }
 
+/* the cap @sch needs to preempt @rq's current task, 0 if none */
+static inline u64 scx_caps_for_preempt(struct scx_sched *sch, struct rq *rq)
+{
+	struct task_struct *curr = rq->curr;
+
+	/* a non-ext task can't be preempted by ext, own-subtree needs no cap */
+	if (curr->sched_class != &ext_sched_class ||
+	    scx_is_descendant(scx_task_sched(curr), sch))
+		return 0;
+	return SCX_CAP_PREEMPT;
+}
+
 /* caps implied by holding @cap */
 static inline u64 scx_caps_implied(u64 cap)
 {
 	switch (cap) {
+	case SCX_CAP_PREEMPT:
+		return SCX_CAP_ENQ | SCX_CAP_ENQ_IMMED;
 	case SCX_CAP_ENQ:
 		return SCX_CAP_ENQ_IMMED;
 	}
@@ -141,6 +155,7 @@ static inline bool scx_task_can_stay_on_cpu(struct rq *rq, struct task_struct *p
 #else	/* CONFIG_EXT_SUB_SCHED */
 
 static inline u64 scx_missing_caps(struct scx_sched *sch, s32 cpu, u64 needed) { return 0; }
+static inline u64 scx_caps_for_preempt(struct scx_sched *sch, struct rq *rq) { return 0; }
 static inline bool scx_task_can_stay_on_cpu(struct rq *rq, struct task_struct *p) { return true; }
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (26 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 14:14   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
                   ` (3 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

__scx_update_idle() notified only the root scheduler. A sub-scheduler that
holds a cid needs that cid's idle state to place and kick on it.

Deliver ops.update_idle() to every scheduler that holds SCX_CAP_BASE on the
transitioning cid. The root holds every cap, so a real transition always
reaches it.

Real transitions are not enough on their own. A cid that is already idle
when a sub-sched gains baseline access produces no transition, so the new
holder would never learn it is idle. The ecaps sync arms a re-notify on the
gain, and the next idle pick delivers ops.update_idle() to just that sched,
leaving holders that already track the cpu untouched. A matching loss of
baseline access drops any pending re-notify.

Bypass suppresses ops.update_idle() too, so a cpu that goes idle during a
bypass window and stays idle yields no transition to re-deliver on
un-bypass. Arm the same re-notify for every sched leaving bypass. The acute
case is a child granted cids during its own ops.sub_attach(). The grant
lands while the child is bypassed and the notify walk skips it, so on
un-bypass it holds cids it never saw go idle. The root is owed the same and
is armed through a separate per-rq flag, which keeps this working when
sub-schedulers are compiled out.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c      | 40 ++++++++++++++++++++--
 kernel/sched/ext/idle.c     | 68 +++++++++++++++++++++++++++++++------
 kernel/sched/ext/internal.h |  2 ++
 kernel/sched/ext/sub.c      | 24 ++++++++++++-
 kernel/sched/sched.h        |  2 ++
 5 files changed, 122 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 5a2c96bf8aa9..bd934928d31d 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -5476,6 +5476,38 @@ void scx_disable_bypass_dsp(struct scx_sched *sch)
 	}
 }
 
+/**
+ * unbypass_renotify_idle - Arm an idle re-notify for a sched leaving bypass
+ * @rq: rq of the cpu leaving bypass
+ * @pos: scheduler that just left bypass on @rq's cpu
+ * @pcpu: @pos's per-cpu state for @rq's cpu
+ *
+ * A sched leaving bypass is owed the ops.update_idle() calls suppressed while
+ * bypassing. A cpu that goes idle during the bypass window and stays idle won't
+ * produce a notification. Arm a re-notify that scx_bypass()'s resched flushes
+ * on the next idle pick.
+ *
+ * An acute case is ops.sub_attach(). If the parent grants the child cids while
+ * attaching, when attach is complete and bypass is lifted, the child may hold
+ * idle cids it never saw go idle.
+ *
+ * The root is no exception as bypass suppresses its notifications the same way.
+ * However, the root uses a separate per-rq flag so its re-notify keeps working
+ * even when !CONFIG_EXT_SUB_SCHED.
+ */
+static void unbypass_renotify_idle(struct rq *rq, struct scx_sched *pos,
+				   struct scx_sched_pcpu *pcpu)
+{
+	if (pos == scx_root) {
+		rq->scx.flags |= SCX_RQ_ROOT_IDLE_RENOTIFY;
+		return;
+	}
+#ifdef CONFIG_EXT_SUB_SCHED
+	pcpu->idle_renotify = true;
+	rq->scx.flags |= SCX_RQ_SUB_IDLE_RENOTIFY;
+#endif
+}
+
 /**
  * scx_bypass - [Un]bypass scx_ops and guarantee forward progress
  * @sch: sched to bypass
@@ -5559,11 +5591,15 @@ void scx_bypass(struct scx_sched *sch, bool bypass)
 
 		scx_for_each_descendant_pre(pos, sch) {
 			struct scx_sched_pcpu *pcpu = per_cpu_ptr(pos->pcpu, cpu);
+			bool was_bypassing = pcpu->flags & SCX_SCHED_PCPU_BYPASSING;
 
-			if (pos->bypass_depth)
+			if (pos->bypass_depth) {
 				pcpu->flags |= SCX_SCHED_PCPU_BYPASSING;
-			else
+			} else {
 				pcpu->flags &= ~SCX_SCHED_PCPU_BYPASSING;
+				if (was_bypassing)
+					unbypass_renotify_idle(rq, pos, pcpu);
+			}
 		}
 
 		raw_spin_unlock(&scx_sched_lock);
diff --git a/kernel/sched/ext/idle.c b/kernel/sched/ext/idle.c
index 8e8c6201b7df..04b320f89b6f 100644
--- a/kernel/sched/ext/idle.c
+++ b/kernel/sched/ext/idle.c
@@ -12,6 +12,7 @@
 #include "internal.h"
 #include "cid.h"
 #include "idle.h"
+#include "sub.h"
 
 /* Enable/disable built-in idle CPU selection policy */
 static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
@@ -730,6 +731,46 @@ static void update_builtin_idle(int cpu, bool idle)
 	}
 }
 
+/*
+ * Notify schedulers of an idle transition on @cpu's cid, delivering to every
+ * sched that holds %SCX_CAP_BASE on the cid (the root holds every cap). A real
+ * transition (@do_notify) reaches all holders. A forced one (@root_renotify for
+ * the root, a sub-sched's idle_renotify marker for a sub) reaches only the owed
+ * scheds.
+ */
+static void scx_idle_notify(struct rq *rq, bool idle, bool do_notify, bool root_renotify)
+{
+	s32 cpu = cpu_of(rq);
+	s32 cid = scx_cpu_arg(cpu);
+	struct scx_sched *pos;
+
+	lockdep_assert_rq_held(rq);
+
+	pos = scx_next_descendant_pre(NULL, scx_root);
+	while (pos) {
+		bool forced = false;
+
+		if (unlikely(scx_missing_caps(pos, cpu, SCX_CAP_BASE))) {
+			pos = scx_skip_subtree_pre(pos, scx_root);
+			continue;
+		}
+
+		if (pos == scx_root) {
+			forced = root_renotify;
+		}
+#ifdef CONFIG_EXT_SUB_SCHED
+		else if (per_cpu_ptr(pos->pcpu, cpu)->idle_renotify) {
+			per_cpu_ptr(pos->pcpu, cpu)->idle_renotify = false;
+			forced = true;
+		}
+#endif
+		if ((do_notify || forced) && SCX_HAS_OP(pos, update_idle) &&
+		    !scx_bypassing(pos, cpu))
+			SCX_CALL_OP(pos, update_idle, rq, cid, idle);
+		pos = scx_next_descendant_pre(pos, scx_root);
+	}
+}
+
 /*
  * Update the idle state of a CPU to @idle.
  *
@@ -748,7 +789,6 @@ static void update_builtin_idle(int cpu, bool idle)
  */
 void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
 {
-	struct scx_sched *sch = scx_root;
 	int cpu = cpu_of(rq);
 
 	lockdep_assert_rq_held(rq);
@@ -772,20 +812,26 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
 			update_builtin_idle(cpu, idle);
 
 	/*
-	 * Trigger ops.update_idle() only when transitioning from a task to
-	 * the idle thread and vice versa.
-	 *
-	 * Idle transitions are indicated by do_notify being set to true,
-	 * managed by put_prev_task_idle()/set_next_task_idle().
+	 * ops.update_idle() fires on real idle transitions, indicated by
+	 * @do_notify and managed by put_prev_task_idle()/set_next_task_idle().
+	 * An idle pick also fires it to flush a forced notify owed to a sched
+	 * that missed transitions while bypassed or on a cid it just gained.
+	 * unbypass_renotify_idle() and scx_process_sync_ecaps() arm the per-rq
+	 * gates, and scx_idle_notify() targets the owed scheds.
 	 *
-	 * This must come after builtin idle update so that BPF schedulers can
-	 * create interlocking between ops.update_idle() and ops.enqueue() -
+	 * This must come after the builtin idle update so that BPF schedulers
+	 * can create interlocking between ops.update_idle() and ops.enqueue() -
 	 * either enqueue() sees the idle bit or update_idle() sees the task
 	 * that enqueue() queued.
 	 */
-	if (SCX_HAS_OP(sch, update_idle) && do_notify &&
-	    !scx_bypassing(sch, cpu_of(rq)))
-		SCX_CALL_OP(sch, update_idle, rq, scx_cpu_arg(cpu_of(rq)), idle);
+	if (do_notify ||
+	    (idle && (rq->scx.flags &
+		      (SCX_RQ_SUB_IDLE_RENOTIFY | SCX_RQ_ROOT_IDLE_RENOTIFY)))) {
+		bool root_renotify = rq->scx.flags & SCX_RQ_ROOT_IDLE_RENOTIFY;
+
+		rq->scx.flags &= ~(SCX_RQ_SUB_IDLE_RENOTIFY | SCX_RQ_ROOT_IDLE_RENOTIFY);
+		scx_idle_notify(rq, idle, do_notify, root_renotify);
+	}
 }
 
 static void reset_idle_masks(struct sched_ext_ops *ops)
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 6e2daf90a4ac..272639255e0d 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1244,6 +1244,8 @@ struct scx_sched_pcpu {
 	 */
 	u64			ecaps;
 	struct llist_node	ecaps_to_sync_node;
+	/* owed a forced update_idle() re-notify on this cpu */
+	bool			idle_renotify;
 	/* effective caps as of the last sub_ecaps_updated() delivery */
 	u64			reported_ecaps;
 #endif
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 67ba352828e0..90caf76db8bf 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -461,6 +461,10 @@ static void discard_queued_syncs(struct rq *rq)
  * pshard->caps[] is the target configuration. pcpu->ecaps is the effective
  * transposed copy owned by the cid's cpu and written only here under @rq's
  * lock.
+ *
+ * A sched that newly gains baseline access here is owed an update_idle() so it
+ * learns the cid's idle state. Such a gain arms the per-rq
+ * %SCX_RQ_SUB_IDLE_RENOTIFY gate so the next idle pick delivers it.
  */
 void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 {
@@ -493,7 +497,7 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 		struct scx_sched_pcpu *pcpu =
 			container_of(pos, struct scx_sched_pcpu, ecaps_to_sync_node);
 		struct scx_pshard *ps = pcpu->sch->pshard[shard];
-		u64 old, ecaps, lost;
+		u64 old, ecaps, lost, gained;
 
 		init_llist_node(pos);
 
@@ -505,6 +509,7 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 		WRITE_ONCE(pcpu->ecaps, ecaps);
 
 		lost = old & ~ecaps;
+		gained = ecaps & ~old;
 		lost_all |= lost;
 
 		/* tell the sched its effective caps on this cid changed */
@@ -522,6 +527,18 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 			scx_flush_dispatch_buf(pcpu->sch, rq);
 			pcpu->reported_ecaps = ecaps;
 		}
+
+		/*
+		 * Gaining baseline access owes an update_idle() so the sched
+		 * learns the cpu's idle state. Arm the per-rq gate so the next
+		 * idle pick flushes it. Losing access drops any pending notify.
+		 */
+		if (gained & SCX_CAP_BASE) {
+			pcpu->idle_renotify = true;
+			rq->scx.flags |= SCX_RQ_SUB_IDLE_RENOTIFY;
+		} else if (lost & SCX_CAP_BASE) {
+			pcpu->idle_renotify = false;
+		}
 	}
 
 	/*
@@ -1386,6 +1403,11 @@ __bpf_kfunc s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps,
 
 				caps_updated_record(cps, changed_cids, granted_caps,
 						    &to_deliver);
+				/*
+				 * The sync arms an update_idle() re-notify if
+				 * the cid gains baseline access, so the holder
+				 * learns of an already-idle cid.
+				 */
 				scx_cmask_for_each_cid(cid, changed_cids)
 					queue_sync_ecaps(child, cid);
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8db6b09d91bf..2f9a6a98a3c9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -787,6 +787,8 @@ enum scx_rq_flags {
 	SCX_RQ_BAL_KEEP		= 1 << 3, /* balance decided to keep current */
 	SCX_RQ_CLK_VALID	= 1 << 5, /* RQ clock is fresh and valid */
 	SCX_RQ_BAL_CB_PENDING	= 1 << 6, /* must queue a cb after dispatching */
+	SCX_RQ_SUB_IDLE_RENOTIFY	= 1 << 7, /* sub-scheds are owed update_idle() */
+	SCX_RQ_ROOT_IDLE_RENOTIFY	= 1 << 8, /* the root is owed update_idle() */
 
 	SCX_RQ_IN_WAKEUP	= 1 << 16,
 	SCX_RQ_IN_BALANCE	= 1 << 17,
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (27 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 14:28   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
                   ` (2 subsequent siblings)
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

scx_process_sync_ecaps() consumes ecaps syncs while the sched is bypassing
without delivering ops.sub_ecaps_updated(), leaving reported_ecaps stale.
Nothing re-queued a sync when bypass lifted, so a cid whose caps never
change again would never be notified. Attach-time initial grants hit this
every time: they are consumed during the enable bypass window, so a sched
never learned its initial effective caps through the callback.

Re-queue a sync for every (sched, cpu) with an undelivered delta at the
per-cpu bypass exit in scx_bypass(), next to the idle renotify catch-up. The
next balance on the cpu then delivers the pending delta with proper dispatch
context.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c |  4 +++-
 kernel/sched/ext/sub.c | 35 +++++++++++++++++++++++++++++++++++
 kernel/sched/ext/sub.h |  2 ++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index bd934928d31d..4c5c80393c2d 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -5597,8 +5597,10 @@ void scx_bypass(struct scx_sched *sch, bool bypass)
 				pcpu->flags |= SCX_SCHED_PCPU_BYPASSING;
 			} else {
 				pcpu->flags &= ~SCX_SCHED_PCPU_BYPASSING;
-				if (was_bypassing)
+				if (was_bypassing) {
 					unbypass_renotify_idle(rq, pos, pcpu);
+					scx_unbypass_replay_ecaps(rq, pos);
+				}
 			}
 		}
 
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 90caf76db8bf..15edcf4f81ee 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -550,6 +550,41 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
 		scx_schedule_reenq_local(rq, SCX_REENQ_CAP_REVOKE);
 }
 
+/**
+ * scx_unbypass_replay_ecaps - Replay a bypass-suppressed ecaps notification
+ * @rq: rq of the cpu leaving bypass
+ * @sch: scheduler that just left bypass on @rq's cpu
+ *
+ * scx_process_sync_ecaps() consumes syncs while bypassing without delivering
+ * ops.sub_ecaps_updated(), leaving reported_ecaps stale. Nothing re-queues a
+ * sync when bypass lifts, so without a replay a cid that never changes again
+ * would never be notified. The attach-time initial grants are the acute case
+ * as they are consumed during the enable bypass window. Re-queue a sync for
+ * any undelivered delta so the next balance delivers it.
+ */
+void scx_unbypass_replay_ecaps(struct rq *rq, struct scx_sched *sch)
+{
+	s32 cpu = cpu_of(rq);
+	struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+	struct scx_pshard *ps;
+	s32 cid;
+
+	lockdep_assert_rq_held(rq);
+
+	/* root holds every cap and never uses ecaps */
+	if (!sch->level)
+		return;
+
+	if (READ_ONCE(pcpu->ecaps) == pcpu->reported_ecaps)
+		return;
+
+	cid = __scx_cpu_to_cid(cpu);
+	ps = sch->pshard[scx_cid_to_shard[cid]];
+
+	guard(raw_spinlock)(&ps->lock);
+	queue_sync_ecaps(sch, cid);
+}
+
 /*
  * A cpu came back. Re-seed each sub-sched's ecaps on the cpu's cid. The sync
  * recomputes effective caps from the pshard and fires ops.sub_ecaps_updated()
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
index 9f74c142b73f..dd33472fadd4 100644
--- a/kernel/sched/ext/sub.h
+++ b/kernel/sched/ext/sub.h
@@ -29,6 +29,7 @@ void scx_free_pshards(struct scx_sched *sch);
 s32 scx_alloc_pshards(struct scx_sched *sch);
 void scx_init_root_caps(struct scx_sched *sch);
 void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev);
+void scx_unbypass_replay_ecaps(struct rq *rq, struct scx_sched *sch);
 void scx_online_ecaps(struct rq *rq);
 void scx_offline_ecaps(struct rq *rq);
 void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu);
@@ -51,6 +52,7 @@ static inline void scx_free_pshards(struct scx_sched *sch) {}
 static inline s32 scx_alloc_pshards(struct scx_sched *sch) { return 0; }
 static inline void scx_init_root_caps(struct scx_sched *sch) {}
 static inline void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev) {}
+static inline void scx_unbypass_replay_ecaps(struct rq *rq, struct scx_sched *sch) {}
 static inline void scx_online_ecaps(struct rq *rq) {}
 static inline void scx_offline_ecaps(struct rq *rq) {}
 static inline void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu) {}
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (28 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 14:45   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection Tejun Heo
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A cid-form scheduler can grant caps to and revoke them from its child
sub-schedulers but has no way to tear one down. Add scx_bpf_sub_kill() to
evict a direct child with a printf-style reason that reaches the child's
scx_exit_info. No exit code is taken because the child is a separate
scheduler whose exit-code semantics the parent cannot know. The child and
its subtree are disabled through the usual async path under a new exit kind,
SCX_EXIT_PARENT_KILL.

The bstr formatting infrastructure in ext.c is exposed through internal.h
with scx_ prefixes so the kfunc, which lives in sub.c, can format the
reason.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 kernel/sched/ext/ext.c                   | 28 +++++++-----
 kernel/sched/ext/internal.h              | 11 +++++
 kernel/sched/ext/sub.c                   | 57 ++++++++++++++++++++++++
 tools/sched_ext/include/scx/common.bpf.h | 18 ++++++++
 4 files changed, 102 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4c5c80393c2d..71081f4085c4 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -187,14 +187,8 @@ static const struct rhashtable_params dsq_hash_params = {
 
 static LLIST_HEAD(dsqs_to_free);
 
-/* string formatting from BPF */
-struct scx_bstr_buf {
-	u64			data[MAX_BPRINTF_VARARGS];
-	char			line[SCX_EXIT_MSG_LEN];
-};
-
-static DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock);
-static struct scx_bstr_buf scx_exit_bstr_buf;
+DEFINE_RAW_SPINLOCK(scx_exit_bstr_buf_lock);
+struct scx_bstr_buf scx_exit_bstr_buf;
 
 /* ops debug dump */
 static DEFINE_RAW_SPINLOCK(scx_dump_lock);
@@ -5690,6 +5684,8 @@ static const char *scx_exit_reason(enum scx_exit_kind kind)
 		return "disabled by sysrq-S";
 	case SCX_EXIT_PARENT:
 		return "parent exiting";
+	case SCX_EXIT_PARENT_KILL:
+		return "killed by parent scheduler";
 	case SCX_EXIT_ERROR:
 		return "runtime error";
 	case SCX_EXIT_ERROR_BPF:
@@ -9268,8 +9264,8 @@ static s32 __bstr_format(struct scx_sched *sch, u64 *data_buf, char *line_buf,
 }
 
 __printf(3, 0)
-static s32 bstr_format(struct scx_sched *sch, struct scx_bstr_buf *buf,
-		       char *fmt, unsigned long long *data, u32 data__sz)
+s32 scx_bstr_format(struct scx_sched *sch, struct scx_bstr_buf *buf,
+		    char *fmt, unsigned long long *data, u32 data__sz)
 {
 	return __bstr_format(sch, buf->data, buf->line, sizeof(buf->line),
 			     fmt, data, data__sz);
@@ -9299,7 +9295,7 @@ __bpf_kfunc void scx_bpf_exit_bstr(s64 exit_code, char *fmt,
 	raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags);
 	sch = scx_prog_sched(aux);
 	if (likely(sch) &&
-	    bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0)
+	    scx_bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0)
 		scx_exit(sch, SCX_EXIT_UNREG_BPF, exit_code, "%s", scx_exit_bstr_buf.line);
 	raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags);
 }
@@ -9324,7 +9320,7 @@ __bpf_kfunc void scx_bpf_error_bstr(char *fmt, unsigned long long *data,
 	raw_spin_lock_irqsave(&scx_exit_bstr_buf_lock, flags);
 	sch = scx_prog_sched(aux);
 	if (likely(sch) &&
-	    bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0)
+	    scx_bstr_format(sch, &scx_exit_bstr_buf, fmt, data, data__sz) >= 0)
 		scx_exit(sch, SCX_EXIT_ERROR_BPF, 0, "%s", scx_exit_bstr_buf.line);
 	raw_spin_unlock_irqrestore(&scx_exit_bstr_buf_lock, flags);
 }
@@ -9960,6 +9956,13 @@ __bpf_kfunc s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out_
 {
 	return -EOPNOTSUPP;
 }
+
+__bpf_kfunc s32 scx_bpf_sub_kill_bstr(u64 cgroup_id, char *fmt,
+				      unsigned long long *data, u32 data__sz,
+				      const struct bpf_prog_aux *aux)
+{
+	return -EOPNOTSUPP;
+}
 #endif	/* !CONFIG_EXT_SUB_SCHED */
 
 __bpf_kfunc_end_defs();
@@ -10009,6 +10012,7 @@ BTF_ID_FLAGS(func, scx_bpf_task_cgroup, KF_IMPLICIT_ARGS | KF_RCU | KF_ACQUIRE)
 BTF_ID_FLAGS(func, scx_bpf_sub_grant, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_sub_revoke, KF_IMPLICIT_ARGS)
 BTF_ID_FLAGS(func, scx_bpf_sub_caps, KF_IMPLICIT_ARGS)
+BTF_ID_FLAGS(func, scx_bpf_sub_kill_bstr, KF_IMPLICIT_ARGS)
 BTF_KFUNCS_END(scx_kfunc_ids_any)
 
 static const struct btf_kfunc_id_set scx_kfunc_set_any = {
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 272639255e0d..8935bc09ed3b 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -51,6 +51,7 @@ enum scx_exit_kind {
 	SCX_EXIT_UNREG_KERN,	/* kernel-initiated unregistration */
 	SCX_EXIT_SYSRQ,		/* requested by 'S' sysrq */
 	SCX_EXIT_PARENT,	/* parent exiting */
+	SCX_EXIT_PARENT_KILL,	/* killed by parent scheduler */
 
 	SCX_EXIT_ERROR = 1024,	/* runtime error, error msg contains details */
 	SCX_EXIT_ERROR_BPF,	/* ERROR but triggered through scx_bpf_error() */
@@ -1873,6 +1874,12 @@ struct scx_enable_cmd {
 	int			ret;
 };
 
+/* string formatting from BPF */
+struct scx_bstr_buf {
+	u64			data[MAX_BPRINTF_VARARGS];
+	char			line[SCX_EXIT_MSG_LEN];
+};
+
 extern struct scx_sched __rcu *scx_root;
 DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
 
@@ -1933,10 +1940,14 @@ struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
 int scx_sched_sysfs_add(struct scx_sched *sch);
 bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor);
+__printf(3, 0) s32 scx_bstr_format(struct scx_sched *sch, struct scx_bstr_buf *buf,
+				   char *fmt, unsigned long long *data, u32 data__sz);
 
 extern raw_spinlock_t scx_sched_lock;
 extern struct mutex scx_enable_mutex;
 extern struct percpu_rw_semaphore scx_fork_rwsem;
+extern raw_spinlock_t scx_exit_bstr_buf_lock;
+extern struct scx_bstr_buf scx_exit_bstr_buf;
 #ifdef CONFIG_EXT_SUB_SCHED
 extern const struct rhashtable_params scx_sched_hash_params;
 extern struct rhashtable scx_sched_hash;
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
index 15edcf4f81ee..4baa6f7be71b 100644
--- a/kernel/sched/ext/sub.c
+++ b/kernel/sched/ext/sub.c
@@ -1636,6 +1636,63 @@ __bpf_kfunc s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out_
 	return 0;
 }
 
+/**
+ * scx_bpf_sub_kill_bstr - Kill a direct child sub-scheduler
+ * @cgroup_id: cgroup id of the direct child to kill
+ * @fmt: reason message format string
+ * @data: format string parameters packaged using ___bpf_fill() macro
+ * @data__sz: @data len, must end in '__sz' for the verifier
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Evict a direct child sub-scheduler, disabling it with the supplied reason.
+ * The child and its subtree are torn down asynchronously through the usual
+ * disable path.
+ *
+ * Unlike scx_bpf_exit(), no exit code is taken: the child is a separate
+ * scheduler with its own exit-code semantics, so a code chosen by the parent
+ * would have no defined meaning. The reason string carries the intent.
+ *
+ * Return 0 on success or -ENODEV if @cgroup_id names no sub-scheduler, which
+ * can race with the child detaching on its own and so is not a scheduler error.
+ * Naming a sched that exists but is not a direct child aborts the parent.
+ */
+__printf(2, 0)
+__bpf_kfunc s32 scx_bpf_sub_kill_bstr(u64 cgroup_id, char *fmt,
+				      unsigned long long *data, u32 data__sz,
+				      const struct bpf_prog_aux *aux)
+{
+	struct scx_sched *parent, *child;
+	s32 ret;
+
+	guard(rcu)();
+
+	parent = scx_prog_sched(aux);
+	if (unlikely(!parent))
+		return -ENODEV;
+
+	if (!scx_is_cid_type()) {
+		scx_error(parent, "sub-cap kfuncs require a cid-form scheduler");
+		return -EOPNOTSUPP;
+	}
+
+	child = scx_find_sub_sched(cgroup_id);
+	if (unlikely(!child))
+		return -ENODEV;
+
+	if (unlikely(scx_parent(child) != parent)) {
+		scx_error(parent, "%s: sub-%llu is not a direct child",
+			  parent->cgrp_path, cgroup_id);
+		return -EINVAL;
+	}
+
+	guard(raw_spinlock_irqsave)(&scx_exit_bstr_buf_lock);
+	ret = scx_bstr_format(parent, &scx_exit_bstr_buf, fmt, data, data__sz);
+	if (ret < 0)
+		return ret;
+	scx_exit(child, SCX_EXIT_PARENT_KILL, 0, "%s", scx_exit_bstr_buf.line);
+	return 0;
+}
+
 __bpf_kfunc_end_defs();
 
 #endif	/* CONFIG_EXT_SUB_SCHED */
diff --git a/tools/sched_ext/include/scx/common.bpf.h b/tools/sched_ext/include/scx/common.bpf.h
index 09c21602b2ed..acc2b131ea8f 100644
--- a/tools/sched_ext/include/scx/common.bpf.h
+++ b/tools/sched_ext/include/scx/common.bpf.h
@@ -119,6 +119,8 @@ s32 scx_bpf_sub_grant(u64 cgroup_id, u64 caps, const struct scx_cmask *cmask,
 		      struct scx_cmask *denied) __ksym __weak;
 void scx_bpf_sub_revoke(u64 cgroup_id, u64 caps, const struct scx_cmask *cmask) __ksym __weak;
 s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out) __ksym __weak;
+s32 scx_bpf_sub_kill_bstr(u64 cgroup_id, char *fmt,
+			  unsigned long long *data, u32 data__sz) __ksym __weak;
 
 /*
  * Use the following as @it__iter when calling scx_bpf_dsq_move[_vtime]() from
@@ -165,6 +167,22 @@ void ___scx_bpf_bstr_format_checker(const char *fmt, ...) {}
 	___scx_bpf_bstr_format_checker(fmt, ##args);				\
 })
 
+/*
+ * scx_bpf_sub_kill() wraps the scx_bpf_sub_kill_bstr() kfunc with variadic
+ * arguments instead of an array of u64. It kills the direct child sub-scheduler
+ * @cgid, passing the formatted reason to its user space, and evaluates to the
+ * kfunc's return value. On a kernel without sub-scheduler support the kfunc is
+ * absent and it returns -EOPNOTSUPP.
+ */
+#define scx_bpf_sub_kill(cgid, fmt, args...)					\
+({										\
+	scx_bpf_bstr_preamble(fmt, args)					\
+	___scx_bpf_bstr_format_checker(fmt, ##args);				\
+	bpf_ksym_exists(scx_bpf_sub_kill_bstr) ?				\
+		scx_bpf_sub_kill_bstr((cgid), ___fmt, ___param,			\
+				      sizeof(___param)) : -EOPNOTSUPP;		\
+})
+
 /*
  * scx_bpf_error() wraps the scx_bpf_error_bstr() kfunc with variadic arguments
  * instead of an array of u64. Invoking this macro will cause the scheduler to
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (29 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  2026-07-03 14:57   ` sashiko-bot
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection Tejun Heo
  31 siblings, 1 reply; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

sched_ext sub-scheduling began as dispatch delegation only: a parent could
call into a child cgroup sub-scheduler's ops.dispatch() from its own
dispatch path, but could not delegate cpus to the child for enqueue and the
other paths. sched_ext has since gained cap-based cid delegation, where a
parent grants and revokes a child's per-cid caps. Expand scx_qmap to
demonstrate it.

scx_qmap can now delegate the cids it holds exclusively, split among itself
and its children by cpu.weight. Each gets the floor of its share as
dedicated cids. The leftover from rounding forms a shared pool,
round-robined among them as an ENQ_IMMED time-share.

This shape is deliberate. Exclusive cids exercise the basic grant and revoke
of ownership, and the shared pool exercises time-sharing one cid across
several schedulers. The implemented policy is impractical, but it covers
most of what a practical sub-scheduler would need without overcomplicating
qmap.

Delegation nests. A cid a node receives from its parent only as a
round-robin share stays self-local and is never re-delegated. A node left
with no exclusive cid, e.g. after its cpus went offline, evicts its
children.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 727 ++++++++++++++++++++++++++++++---
 tools/sched_ext/scx_qmap.c     | 296 +++++++++++++-
 tools/sched_ext/scx_qmap.h     |  98 ++++-
 3 files changed, 1061 insertions(+), 60 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index f6cfe63425d3..938a32514b2f 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -1,21 +1,39 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 /*
- * A simple five-level FIFO queue scheduler.
+ * scx_qmap: a demonstration and testing scheduler for sched_ext features.
  *
- * There are five FIFOs implemented as arena-backed doubly-linked lists
- * threaded through per-task context. A task gets assigned to one depending on
- * its compound weight. Each CPU round robins through the FIFOs and dispatches
- * more from FIFOs with higher indices - 1 from queue0, 2 from queue1, 4 from
- * queue2 and so on.
- *
- * This scheduler demonstrates:
+ * A simple scheduler that exercises a broad set of sched_ext features. Unlikely
+ * to be useful for real workloads. It demonstrates:
  *
  * - BPF-side queueing using TIDs.
  * - BPF arena for scheduler state.
  * - Core-sched support.
+ * - Hierarchical sub-scheduling: delegating cpus to child cgroup schedulers.
+ *
+ * Base design: Five FIFOs (arena-backed doubly-linked lists through per-task
+ * context). A task is assigned to a FIFO by its compound weight. Each cpu
+ * round-robins the FIFOs, dispatching more from higher ones.
+ *
+ * Sub-scheduling: Any qmap sched can delegate cpus to its own child cgroup
+ * schedulers and keep the rest for its tasks. Terminology:
+ *
+ *   excl   - A cpu the delegatee owns wholly (ENQ_IMMED|ENQ|PREEMPT).
+ *   shared - A cpu delegated as ENQ_IMMED only. Time-shared.
+ *   held_excl / held_shared - What this node was handed by its parent.
+ *            held-excl cpus are re-delegatable. A held-shared cpu is a
+ *            time-share that stays self-local.
+ *   self   - The excl cpus the node kept for itself, plus all of held_shared.
+ *   owner  - Who holds a cid - a child slot, CID_SELF, or CID_NONE.
  *
- * This scheduler is primarily for demonstration and testing of sched_ext
- * features and unlikely to be useful for actual workloads.
+ * The scheduler splits its held-excl cpus among self and the children in
+ * proportion to each node's cpu.weight, handing each the floor of its share as
+ * excl cpus. The leftover from rounding forms a shared pool the round-robin
+ * timer hands around. With no excl cpu to delegate, the node evicts its
+ * children.
+ *
+ * This policy is a demonstration only, not a practical one. The split
+ * considers only direct children and is not work-conserving. It only exists to
+ * drive sub-sched primitives with as simple logic as possible.
  *
  * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
  * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
@@ -52,6 +70,9 @@ const volatile bool always_enq_immed;
 const volatile u32 immed_stress_nth;
 const volatile u32 max_tasks;
 
+/* sub-sched: period for handing the round-robin cid pool to the next child */
+const volatile u64 round_robin_ns;
+
 /*
  * Optional cid-override test harness. When cid_override_mode is non-zero,
  * qmap_init_cids() calls scx_bpf_cid_override() with the caller-supplied arrays
@@ -89,12 +110,12 @@ struct {
 
 struct qmap_arena __arena_global qa;
 
-/*
- * Global idle-cid tracking, maintained via update_idle / cpu_offline and
- * scanned by the direct-dispatch path. Allocated in qmap_init() from one
- * arena page, sized to the full cid space.
- */
-struct scx_cmask __arena *qa_idle_cids;
+/* ensure that BPF and userspace are seeing the same size for qmap_cmask */
+_Static_assert(QMAP_CMASK_WORDS == CMASK_NR_WORDS(SCX_QMAP_MAX_CPUS),
+	       "QMAP_CMASK_WORDS must equal CMASK_NR_WORDS(SCX_QMAP_MAX_CPUS)");
+_Static_assert(sizeof(struct qmap_cmask) ==
+	       struct_size_t(struct scx_cmask, bits, QMAP_CMASK_WORDS),
+	       "qmap_cmask must be exactly sized to back a full scx_cmask");
 
 /* Per-queue locks. Each in its own .data section as bpf_res_spin_lock requires. */
 __hidden struct bpf_res_spin_lock qa_q_lock0 SEC(".data.qa_q_lock0");
@@ -196,7 +217,7 @@ static int qmap_spin_lock(struct bpf_res_spin_lock *lock)
 }
 
 /*
- * Try prev_cid, then scan taskc->cpus_allowed AND qa_idle_cids round-robin
+ * Try prev_cid, then scan taskc->cpus_allowed AND qa.idle_cids round-robin
  * from prev_cid + 1. Atomic claim retries on race; bounded by
  * IDLE_PICK_RETRIES to keep the verifier's insn budget in check.
  */
@@ -212,17 +233,17 @@ static s32 pick_direct_dispatch_cid(struct task_struct *p, s32 prev_cid,
 	if (!always_enq_immed && p->nr_cpus_allowed == 1)
 		return prev_cid;
 
-	if (cmask_test_and_clear(prev_cid, qa_idle_cids))
+	if (cmask_test_and_clear(prev_cid, &qa.idle_cids.mask))
 		return prev_cid;
 
 	cid = prev_cid;
 	bpf_for(i, 0, IDLE_PICK_RETRIES) {
 		cid = cmask_next_and_set_wrap(&taskc->cpus_allowed,
-					      qa_idle_cids, cid + 1);
+					      &qa.idle_cids.mask, cid + 1);
 		barrier_var(cid);
 		if (cid >= nr_cids)
 			return -1;
-		if (cmask_test_and_clear(cid, qa_idle_cids))
+		if (cmask_test_and_clear(cid, &qa.idle_cids.mask))
 			return cid;
 	}
 	return -1;
@@ -346,6 +367,15 @@ s32 BPF_STRUCT_OPS(qmap_select_cid, struct task_struct *p,
 	}
 }
 
+/*
+ * A received time-shared cid is held ENQ_IMMED-only, so inserts must set
+ * SCX_ENQ_IMMED.
+ */
+static u64 needs_immed(s32 cid)
+{
+	return qa.cid_shared[cid] ? SCX_ENQ_IMMED : 0;
+}
+
 static int weight_to_idx(u32 weight)
 {
 	/* Coarsely map the compound weight to a FIFO. */
@@ -369,9 +399,16 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	s32 cid;
 
 	if (enq_flags & SCX_ENQ_REENQ) {
+		u64 reason = p->scx.flags & SCX_TASK_REENQ_REASON_MASK;
+
 		__sync_fetch_and_add(&qa.nr_reenqueued, 1);
 		if (scx_bpf_task_cid(p) == 0)
 			__sync_fetch_and_add(&qa.nr_reenqueued_cid0, 1);
+		/* cap-loss and IMMED-handback bounces, relocated below */
+		if (reason == SCX_TASK_REENQ_CAP)
+			__sync_fetch_and_add(&qa.nr_reenq_cap, 1);
+		else if (reason == SCX_TASK_REENQ_IMMED)
+			__sync_fetch_and_add(&qa.nr_reenq_immed, 1);
 	}
 
 	if (p->flags & PF_KTHREAD) {
@@ -394,6 +431,27 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	 */
 	taskc->core_sched_seq = qa.core_sched_tail_seqs[idx]++;
 
+	/*
+	 * A node with children delegates most cids. A task of ours that can run
+	 * on none of our self cids (e.g. a per-NUMA kthread pinned to delegated
+	 * cids) would starve in SHARED/FIFO since we never pull those on a
+	 * delegated cid. Force it onto its first allowed cid's local DSQ with
+	 * needs_immed(): if we hold access there it runs, else the kernel
+	 * rejects and bounces it back via REENQ_CAP. Best-effort
+	 * anti-starvation nudge.
+	 */
+	if (qa.nr_sub_scheds && !(enq_flags & SCX_ENQ_REENQ) &&
+	    !cmask_intersects(&taskc->cpus_allowed, &qa.self_cids.mask)) {
+		s32 c = cmask_next_set_wrap(&taskc->cpus_allowed, 0);
+
+		if (c >= 0 && c < scx_bpf_nr_cids()) {
+			taskc->force_local = false;
+			scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | c, slice_ns,
+					   enq_flags | needs_immed(c));
+			return;
+		}
+	}
+
 	/*
 	 * IMMED stress testing: Every immed_stress_nth'th enqueue, dispatch
 	 * directly to prev_cpu's local DSQ even when busy to force dsq->nr > 1
@@ -416,7 +474,8 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	 */
 	if (taskc->force_local) {
 		taskc->force_local = false;
-		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, slice_ns, enq_flags);
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL, slice_ns,
+				   enq_flags | needs_immed(scx_bpf_task_cid(p)));
 		return;
 	}
 
@@ -431,7 +490,8 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 	if (!__COMPAT_is_enq_cpu_selected(enq_flags) &&
 	    (cid = pick_direct_dispatch_cid(p, scx_bpf_task_cid(p), taskc)) >= 0) {
 		__sync_fetch_and_add(&qa.nr_ddsp_from_enq, 1);
-		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cid, slice_ns, enq_flags);
+		scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | cid, slice_ns,
+				   enq_flags | needs_immed(cid));
 		return;
 	}
 
@@ -446,7 +506,7 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 
 		scx_bpf_dsq_insert(p, SHARED_DSQ, 0, enq_flags);
 		cid = cmask_next_and_set_wrap(&taskc->cpus_allowed,
-					      qa_idle_cids, 0);
+					      &qa.idle_cids.mask, 0);
 		if (cid < scx_bpf_nr_cids())
 			scx_bpf_kick_cid(cid, SCX_KICK_IDLE);
 		return;
@@ -567,11 +627,34 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cid, struct task_struct *prev)
 	struct cpu_ctx __arena *cpuc;
 	task_ctx_t *taskc;
 	u32 batch = dsp_batch ?: 1;
+	s32 owner;
 	s32 i;
 
 	if (dispatch_highpri(false))
 		return;
 
+	/*
+	 * Sub-sched routing: a child-owned cid goes to its owner. Never run
+	 * this node's own tasks on a delegated cid. Read without the guard.
+	 */
+	owner = qa.part.cid_owner[cid];
+	if (owner == CID_SHARED) {
+		/* route to the live rr holder (0 = self, runs below) */
+		s32 pos = qa.part.rr_pos;
+		u64 holder_cgid = (pos >= 0 && pos < MAX_PARTS) ?
+				  qa.part.rr_slots[pos] : 0;
+
+		if (holder_cgid) {
+			scx_bpf_sub_dispatch(holder_cgid);
+			return;
+		}
+	} else if (owner >= 0 && owner < MAX_SUB_SCHEDS &&
+		   qa.sub_sched_ctxs[owner].cgroup_id) {
+		if (scx_bpf_sub_dispatch(qa.sub_sched_ctxs[owner].cgroup_id))
+			qa.sub_sched_ctxs[owner].nr_dsps++;
+		return;
+	}
+
 	if (!qa.nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ, 0))
 		return;
 
@@ -672,12 +755,6 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cid, struct task_struct *prev)
 		cpuc->dsp_cnt = 0;
 	}
 
-	for (i = 0; i < MAX_SUB_SCHEDS; i++) {
-		if (qa.sub_sched_cgroup_ids[i] &&
-		    scx_bpf_sub_dispatch(qa.sub_sched_cgroup_ids[i]))
-			return;
-	}
-
 	/*
 	 * No other tasks. @prev will keep running. Update its core_sched_seq as
 	 * if the task were enqueued and dispatched immediately.
@@ -910,10 +987,17 @@ void BPF_STRUCT_OPS(qmap_cgroup_set_bandwidth, struct cgroup *cgrp,
 void BPF_STRUCT_OPS(qmap_update_idle, s32 cid, bool idle)
 {
 	QMAP_TOUCH_ARENA();
-	if (idle)
-		cmask_set(cid, qa_idle_cids);
+
+	/*
+	 * The kernel delivers update_idle() for every cid this node holds
+	 * SCX_CAP_BASE on, which includes cids delegated to children. Track
+	 * idle only on self_cids so the direct-dispatch path doesn't land a
+	 * task on a delegated cid.
+	 */
+	if (idle && cmask_test(cid, &qa.self_cids.mask))
+		cmask_set(cid, &qa.idle_cids.mask);
 	else
-		cmask_clear(cid, qa_idle_cids);
+		cmask_clear(cid, &qa.idle_cids.mask);
 }
 
 void BPF_STRUCT_OPS(qmap_set_cmask, struct task_struct *p,
@@ -1065,6 +1149,476 @@ static int lowpri_timerfn(void *map, int *key, struct bpf_timer *timer)
 	return 0;
 }
 
+struct round_robin_timer {
+	struct bpf_timer timer;
+};
+
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, u32);
+	__type(value, struct round_robin_timer);
+} round_robin_timer SEC(".maps");
+
+/*
+ * Partition update synchronization. qa.part can be written from concurrent
+ * contexts. This single-runner guard admits one writer at a time without
+ * holding a lock across the grant/revoke kfuncs. part_pending coalesces
+ * repartition requests that arrive while it is held.
+ *
+ * They live in .bss, not the arena: rr_advance() runs from a bpf_timer
+ * callback, where the verifier rejects atomic ops on arena memory.
+ */
+static u64 part_busy;
+static u64 part_pending;
+
+static bool part_try_start(void)
+{
+	/* set busy, report whether it was previously clear (we acquired it) */
+	return !__sync_fetch_and_or(&part_busy, 1);
+}
+
+static void part_end(void)
+{
+	__sync_fetch_and_and(&part_busy, 0);
+}
+
+/*
+ * compute_partition() scratch.
+ *
+ * The excl-held cids are handed out in cid order: position 0..nr_excl-1 over
+ * the held cids is split into contiguous ranges, one per participant that gets
+ * at least one excl cid. Range k is owned by cp_range_owner[k] and ends at the
+ * cumulative position cp_range_end[k].
+ */
+static s32 cp_range_owner[MAX_PARTS];	/* exclusive range k: its owner id ... */
+static s32 cp_range_end[MAX_PARTS];	/* ... and the cumulative position it ends at */
+
+/* a participant in the partition: self or an attached child */
+struct participant {
+	s32 slot;	/* child slot, or CID_SELF */
+	u32 weight;	/* cpu.weight */
+};
+
+/**
+ * place_one - assign one excl-held cid to its owner
+ * @cid: the excl-held cid to place
+ * @n: its position among the excl-held cids, in [0, nr_excl)
+ * @total_excl:	how many positions are owned exclusively (the rest are shared)
+ *
+ * Position @n below @total_excl is owned exclusively. It falls in the range
+ * whose cumulative end it is under, owned by cp_range_owner[]. A position at or
+ * above @total_excl is the rounding leftover which joins the shared pool.
+ *
+ * A separate __noinline function to help verification.
+ */
+__noinline int place_one(s32 cid, s32 n, s32 total_excl)
+{
+	s32 owner = CID_SELF, i, s;
+
+	if (cid < 0 || cid >= SCX_QMAP_MAX_CPUS || n < 0 || n >= SCX_QMAP_MAX_CPUS ||
+	    total_excl < 0) {
+		scx_bpf_error("-ERANGE");
+		return 0;
+	}
+
+	if (n < total_excl) {
+		for (i = 0; i < MAX_PARTS; i++) {
+			if (n < cp_range_end[i]) {
+				owner = cp_range_owner[i];
+				break;
+			}
+		}
+		qa.part.cid_owner[cid] = owner;
+	} else {
+		s = n - total_excl;
+		if (s < 0 || s >= MAX_PARTS) {
+			scx_bpf_error("-ERANGE");
+			return 0;
+		}
+		qa.part.shared_cids[s] = cid;
+		/* time-shared: dispatch resolves the live holder via rr_pos */
+		qa.part.cid_owner[cid] = CID_SHARED;
+	}
+	return 0;
+}
+
+/**
+ * compute_partition - build the cid partition from this node's held caps
+ *
+ * Decide each cid's owner, the shared pool and the rr rotation. __noinline to
+ * help verification. See the comment at the top of the file.
+ */
+__noinline void compute_partition(void)
+{
+	s32 nr_cids = qa.nr_cids;
+	s32 nr_excl, total_excl = 0, nr_rr = 0;
+	s32 sum_w, i, cid, n = 0, share;
+
+	if (nr_cids > SCX_QMAP_MAX_CPUS) {
+		scx_bpf_error("-ERANGE");
+		return;
+	}
+
+	/* find out the cids we hold */
+	scx_bpf_sub_caps(0, SCX_CAP_ENQ, (void *)(long)&qa.held_excl.mask);
+	scx_bpf_sub_caps(0, SCX_CAP_ENQ_IMMED, (void *)(long)&qa.held_shared.mask);
+	cmask_andnot(&qa.held_shared.mask, &qa.held_excl.mask);	/* held only as ENQ_IMMED */
+
+	qa.part.nr_shared = 0;
+	qa.part.nr_rr = 0;
+	qa.part.rr_pos = 0;
+
+	nr_excl = cmask_weight(&qa.held_excl.mask);
+	qa.part.nr_excl = nr_excl;
+
+	/* no excl cid: held_shared stays self-local, the rest unheld */
+	if (!nr_excl) {
+		bpf_for(cid, 0, nr_cids) {
+			if (cmask_test(cid, &qa.held_shared.mask))
+				qa.part.cid_owner[cid] = CID_SELF;
+			else
+				qa.part.cid_owner[cid] = CID_NONE;
+		}
+		return;
+	}
+
+	/*
+	 * Participants are self plus each child. Give each a fixed range/rr
+	 * slot: self at slot 0, child i at slot i+1.
+	 *
+	 * sum_w totals every participant's weight.
+	 */
+	sum_w = qa.self_weight ?: 100;
+	bpf_for(i, 0, MAX_SUB_SCHEDS) {
+		barrier_var(sum_w);
+		if (qa.sub_sched_ctxs[i].cgroup_id)
+			sum_w += qa.sub_sched_ctxs[i].weight ?: 100;
+	}
+
+	/*
+	 * Split [0, nr_excl) into one contiguous range per participant, each
+	 * the floor of its weight share. cp_range_owner[]/cp_range_end[] record
+	 * each range's owner and cumulative end, total_excl counts the
+	 * exclusive slots, and the rest (nr_excl - total_excl) are shared.
+	 * rr_slots[] lists every participant for the round-robin.
+	 */
+	share = (u64)nr_excl * (qa.self_weight ?: 100) / sum_w;
+	total_excl += share;
+	cp_range_owner[0] = CID_SELF;
+	cp_range_end[0] = total_excl;
+	qa.part.rr_slots[nr_rr++] = 0;		/* self holds slot 0 (cgid 0 = no grant) */
+
+	bpf_for(i, 0, MAX_SUB_SCHEDS) {
+		u64 cgid = qa.sub_sched_ctxs[i].cgroup_id;
+		s32 w = cgid ? (qa.sub_sched_ctxs[i].weight ?: 100) : 0;
+
+		barrier_var(total_excl);
+		share = (u64)nr_excl * w / sum_w;
+		total_excl += share;
+		cp_range_owner[i + 1] = cgid ? i : CID_NONE;
+		cp_range_end[i + 1] = total_excl;
+
+		if (cgid) {
+			barrier_var(nr_rr);
+			if (nr_rr < 0 || nr_rr >= MAX_PARTS) {
+				scx_bpf_error("-ERANGE");
+				return;
+			}
+			qa.part.rr_slots[nr_rr++] = cgid;
+		}
+	}
+
+	/* assign each cid: held-excl by position, the rest self/none */
+	bpf_for(cid, 0, nr_cids) {
+		if (cmask_test(cid, &qa.held_excl.mask)) {
+			place_one(cid, n, total_excl);
+			n++;
+			barrier_var(n);
+		} else if (cmask_test(cid, &qa.held_shared.mask)) {
+			qa.part.cid_owner[cid] = CID_SELF;	/* time-share, self-local */
+		} else {
+			qa.part.cid_owner[cid] = CID_NONE;	/* not held */
+		}
+	}
+
+	qa.part.nr_shared = nr_excl - total_excl;
+	qa.part.nr_rr = nr_rr;
+}
+
+/*
+ * Charge elapsed wall time to each cid's current owner. Runs under the
+ * partition guard before every ownership change and from the stats flush, so
+ * alloc_ns[] reflects the layout that was in effect. Shared-pool time is
+ * charged to the live round-robin holder.
+ */
+static __noinline void account_alloc(void)
+{
+	u64 now = bpf_ktime_get_ns();
+	s32 rr_owner = CID_SELF;
+	s32 nr_cids = qa.nr_cids;
+	u64 delta;
+	s32 cid, i;
+
+	if (nr_cids < 0 || nr_cids > SCX_QMAP_MAX_CPUS) {
+		scx_bpf_error("-ERANGE");
+		return;
+	}
+
+	/* first call starts the clock */
+	if (!qa.alloc_ts) {
+		qa.alloc_ts = now;
+		return;
+	}
+	delta = now - qa.alloc_ts;
+	qa.alloc_ts = now;
+	qa.alloc_window_ns += delta;
+
+	/* resolve the live shared-pool holder to an owner id */
+	if (qa.part.nr_shared && qa.part.nr_rr) {
+		u32 pos = qa.part.rr_pos;
+		u64 cgid = pos < MAX_PARTS ? qa.part.rr_slots[pos] : 0;
+
+		if (cgid) {
+			rr_owner = CID_NONE;
+			bpf_for(i, 0, MAX_SUB_SCHEDS)
+				if (qa.sub_sched_ctxs[i].cgroup_id == cgid)
+					rr_owner = i;
+		}
+	}
+
+	bpf_for(cid, 0, nr_cids) {
+		s32 owner = qa.part.cid_owner[cid];
+
+		if (owner == CID_SHARED)
+			owner = rr_owner;
+		if (owner >= 0 && owner < MAX_SUB_SCHEDS)
+			qa.alloc_ns[owner] += delta;
+		else if (owner == CID_SELF)
+			qa.self_alloc_ns += delta;
+	}
+}
+
+/*
+ * apply_partition - execute the plan compute_partition() built
+ *
+ * Turn the owner map into the per-child, shared and self cmasks and issue the
+ * grant/revoke kfuncs as a delta against each child's previous grant. If no
+ * excl cid, evict every child.
+ */
+__noinline void apply_partition(void)
+{
+	s32 nr_cids = qa.nr_cids;
+	s32 nr_shared = qa.part.nr_shared;
+	s32 i, cid;
+
+	/* close out the outgoing layout before ownership changes */
+	account_alloc();
+
+	if (nr_cids < 0 || nr_cids > SCX_QMAP_MAX_CPUS ||
+	    nr_shared < 0 || nr_shared > MAX_PARTS) {
+		scx_bpf_error("-ERANGE");
+		return;
+	}
+
+	/* no excl cpu: run own tasks on the held shares, evict children */
+	if (!qa.part.nr_excl) {
+		cmask_copy(&qa.self_cids.mask, &qa.held_shared.mask);
+		bpf_for(i, 0, MAX_SUB_SCHEDS)
+			if (qa.sub_sched_ctxs[i].cgroup_id)
+				scx_bpf_sub_kill(qa.sub_sched_ctxs[i].cgroup_id,
+						 "parent holds no excl cpu to distribute");
+		return;
+	}
+
+	/*
+	 * Snapshot the old pool. The per-child revoke below clears ENQ_IMMED on
+	 * the previously-granted pool, so a cid that left the pool (now a
+	 * sibling's excl) doesn't keep a stale ENQ_IMMED on its last holder.
+	 */
+	cmask_copy(&qa.prev_rr_cids.mask, &qa.rr_cids.mask);
+
+	/* turn the owner map into the rr pool, per-child excl, and self sets */
+	cmask_init(&qa.rr_cids.mask, 0, nr_cids);
+	cmask_init(&qa.self_cids.mask, 0, nr_cids);
+
+	/* snapshot each child's grant, then rebuild the new sets below */
+	bpf_for(i, 0, MAX_SUB_SCHEDS) {
+		cmask_copy(&qa.sub_sched_ctxs[i].prev_granted.mask,
+			   &qa.sub_sched_ctxs[i].granted_cids.mask);
+		cmask_init(&qa.sub_sched_ctxs[i].granted_cids.mask, 0, nr_cids);
+	}
+
+	bpf_for(i, 0, nr_shared)
+		cmask_set(qa.part.shared_cids[i], &qa.rr_cids.mask);
+	bpf_for(cid, 0, nr_cids) {
+		s32 o = qa.part.cid_owner[cid];
+
+		if (cmask_test(cid, &qa.rr_cids.mask))
+			continue;
+		if (o >= 0 && o < MAX_SUB_SCHEDS)
+			cmask_set(cid, &qa.sub_sched_ctxs[o].granted_cids.mask);
+		else if (o == CID_SELF)
+			cmask_set(cid, &qa.self_cids.mask);
+	}
+
+	/*
+	 * Apply each child's exclusive cids as a delta against its previous
+	 * grant. Separately clear the previous shared grant (ENQ_IMMED on the
+	 * old pool), covering cids still pooled and cids that left for a
+	 * sibling's excl. The current holder is granted the new pool below.
+	 */
+	bpf_for(i, 0, MAX_SUB_SCHEDS) {
+		struct sub_sched_ctx __arena *ssc = &qa.sub_sched_ctxs[i];
+		u64 cgid = ssc->cgroup_id;
+
+		if (!cgid)
+			continue;
+
+		cmask_copy(&qa.to_revoke_cids.mask, &ssc->prev_granted.mask);
+		cmask_andnot(&qa.to_revoke_cids.mask, &ssc->granted_cids.mask);
+		cmask_copy(&qa.to_grant_cids.mask, &ssc->granted_cids.mask);
+		cmask_andnot(&qa.to_grant_cids.mask, &ssc->prev_granted.mask);
+
+		scx_bpf_sub_revoke(cgid, SCX_CAP_ENQ_IMMED,
+				   (void *)(long)&qa.prev_rr_cids.mask);
+		scx_bpf_sub_revoke(cgid, SCX_CAP_ENQ | SCX_CAP_PREEMPT | SCX_CAP_ENQ_IMMED,
+				   (void *)(long)&qa.to_revoke_cids.mask);
+		scx_bpf_sub_grant(cgid, SCX_CAP_ENQ | SCX_CAP_PREEMPT | SCX_CAP_ENQ_IMMED,
+				  (void *)(long)&qa.to_grant_cids.mask, NULL);
+	}
+
+	/* the current holder of the shared pool gets ENQ_IMMED on all of it */
+	if (nr_shared) {
+		s32 pos = qa.part.rr_pos;
+		u64 holder_cgid;
+
+		if (pos < 0 || pos >= MAX_PARTS) {
+			scx_bpf_error("-ERANGE");
+			return;
+		}
+
+		holder_cgid = qa.part.rr_slots[pos];	/* 0 = self, nothing to grant */
+		if (holder_cgid)
+			scx_bpf_sub_grant(holder_cgid, SCX_CAP_ENQ_IMMED,
+					  (void *)(long)&qa.rr_cids.mask, NULL);
+	}
+}
+
+/*
+ * Recompute the split off the node's held caps and apply it. The contexts this
+ * runs from (the sub-sched callbacks, the userspace poke, the rr timer) are not
+ * serialized by the kernel, so a single runner does the work. A caller that
+ * finds the guard held leaves part_pending set; the holder drains it before
+ * releasing, with the rr timer as a backstop.
+ */
+static void redistribute(void)
+{
+	s32 i;
+
+	__sync_fetch_and_or(&part_pending, 1);
+
+	if (!part_try_start())
+		return;
+
+	bpf_for(i, 0, 1024) {
+		__sync_fetch_and_and(&part_pending, 0);
+		compute_partition();
+		apply_partition();
+		if (!__sync_fetch_and_or(&part_pending, 0))
+			break;
+	}
+
+	part_end();
+}
+
+/* userspace pokes this (PROG_RUN) to resplit after a cpu.weight change */
+SEC("syscall")
+int repartition(void *ctx)
+{
+	redistribute();
+	return 0;
+}
+
+/*
+ * Userspace pokes this (PROG_RUN) to bring alloc_ns[] current before reading
+ * it for the stats display. Skipping when the partition guard is held is
+ * fine - alloc_ts is untouched, so the elapsed time is charged next time.
+ */
+SEC("syscall")
+int flush_alloc(void *ctx)
+{
+	if (part_try_start()) {
+		account_alloc();
+		part_end();
+	}
+	return 0;
+}
+
+/*
+ * Hand the shared pool to the next participant in the rotation. Self's turn
+ * just revokes the pool back to this sched. A child's turn grants it ENQ_IMMED
+ * on the entire pool. As only excl-held cids are time-shared, a wall-clock
+ * rotation works. Driven by the round-robin timer.
+ */
+static void rr_advance(void)
+{
+	s32 nr_shared, old_pos, new_pos;
+	u64 old_cgid, new_cgid;
+	u32 nr_rr;		/* unsigned for % */
+
+	/* a redistribute holds the partition and rebuilds the pool, so skip */
+	if (!part_try_start())
+		return;
+
+	nr_rr = qa.part.nr_rr;
+	nr_shared = qa.part.nr_shared;
+
+	if (nr_shared < 0 || nr_shared > MAX_PARTS) {
+		scx_bpf_error("-ERANGE");
+		return;
+	}
+
+	if (nr_shared && nr_rr >= 2) {
+		/* close out the outgoing holder's pool time */
+		account_alloc();
+
+		old_pos = qa.part.rr_pos;
+		new_pos = (old_pos + 1) % nr_rr;
+		old_cgid = qa.part.rr_slots[old_pos];
+		new_cgid = qa.part.rr_slots[new_pos];
+		qa.part.rr_pos = new_pos;
+
+		/*
+		 * Move the ENQ_IMMED cap to the next participant. The shared
+		 * cids stay marked CID_SHARED. qmap_dispatch() resolves the
+		 * live holder via rr_pos without the guard, so a dispatch
+		 * racing this handoff may reenqueue a task once. Harmless for a
+		 * time-share.
+		 */
+		if (old_cgid)
+			scx_bpf_sub_revoke(old_cgid, SCX_CAP_ENQ_IMMED,
+					   (void *)(long)&qa.rr_cids.mask);
+		if (new_cgid)
+			scx_bpf_sub_grant(new_cgid, SCX_CAP_ENQ_IMMED,
+					  (void *)(long)&qa.rr_cids.mask, NULL);
+	}
+
+	part_end();
+
+	/* a resplit queued while we held the guard supersedes this rotation */
+	if (__sync_fetch_and_or(&part_pending, 0))
+		redistribute();
+}
+
+/* advance the time-shared cid pool every round_robin_ns */
+static int round_robin_timerfn(void *map, int *key, struct bpf_timer *timer)
+{
+	rr_advance();
+	bpf_timer_start(timer, round_robin_ns, 0);
+	return 0;
+}
+
 /*
  * Custom cid layout for the cid-override test. On invalid input the kfunc
  * scx_error()s and aborts the scheduler.
@@ -1142,16 +1696,43 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 	}
 	qa.task_free_head = (task_ctx_t *)slab;
 
+	/* cache the cid count, trusted to be <= SCX_QMAP_MAX_CPUS hereafter */
+	qa.nr_cids = nr_cids;
+
+	/* cmasks are embedded in qa, so they only need initializing */
+	cmask_init(&qa.idle_cids.mask, 0, nr_cids);
+	cmask_init(&qa.rr_cids.mask, 0, nr_cids);
+	cmask_init(&qa.prev_rr_cids.mask, 0, nr_cids);
+	cmask_init(&qa.self_cids.mask, 0, nr_cids);
+	cmask_init(&qa.to_revoke_cids.mask, 0, nr_cids);
+	cmask_init(&qa.to_grant_cids.mask, 0, nr_cids);
+	cmask_init(&qa.held_excl.mask, 0, nr_cids);
+	cmask_init(&qa.held_shared.mask, 0, nr_cids);
+
+	scx_bpf_sub_caps(0, SCX_CAP_ENQ, (void *)(long)&qa.held_excl.mask);
+	scx_bpf_sub_caps(0, SCX_CAP_ENQ_IMMED, (void *)(long)&qa.held_shared.mask);
+	cmask_andnot(&qa.held_shared.mask, &qa.held_excl.mask);
+
+	bpf_for(i, 0, MAX_SUB_SCHEDS) {
+		cmask_init(&qa.sub_sched_ctxs[i].granted_cids.mask, 0, nr_cids);
+		cmask_init(&qa.sub_sched_ctxs[i].prev_granted.mask, 0, nr_cids);
+	}
+
 	/*
-	 * Allocate and initialize the idle cmask. Starts empty - update_idle
-	 * fills it as cpus enter idle.
+	 * The root starts holding every cid. qmap_sub_ecaps_updated() maintains
+	 * per-cid shared state as effective caps settle, and redistribute()
+	 * rebuilds owner and self from held caps. A non-root node starts with
+	 * nothing.
 	 */
-	qa_idle_cids = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
-	if (!qa_idle_cids) {
-		scx_bpf_error("failed to allocate idle cmask");
-		return -ENOMEM;
+	bpf_for(i, 0, nr_cids) {
+		if (!sub_cgroup_id) {
+			cmask_set(i, &qa.self_cids.mask);
+			qa.part.cid_owner[i] = CID_SELF;
+		} else {
+			qa.part.cid_owner[i] = CID_NONE;
+		}
 	}
-	cmask_init(qa_idle_cids, 0, nr_cids);
+	qa.part.nr_shared = 0;
 
 	ret = scx_bpf_create_dsq(SHARED_DSQ, -1);
 	if (ret) {
@@ -1190,6 +1771,16 @@ s32 BPF_STRUCT_OPS_SLEEPABLE(qmap_init)
 			return ret;
 	}
 
+	/* sub-sched: drive the boundary-cid round-robin from a bpf timer */
+	timer = bpf_map_lookup_elem(&round_robin_timer, &key);
+	if (!timer)
+		return -ESRCH;
+	bpf_timer_init(timer, &round_robin_timer, CLOCK_MONOTONIC);
+	bpf_timer_set_callback(timer, round_robin_timerfn);
+	ret = bpf_timer_start(timer, round_robin_ns, 0);
+	if (ret)
+		return ret;
+
 	return 0;
 }
 
@@ -1202,13 +1793,20 @@ s32 BPF_STRUCT_OPS(qmap_sub_attach, struct scx_sub_attach_args *args)
 {
 	s32 i;
 
+	/* as long as there is at least one excl cpu, children can attach */
+	if (!cmask_weight(&qa.held_excl.mask))
+		return -ENOSPC;
+
 	for (i = 0; i < MAX_SUB_SCHEDS; i++) {
-		if (!qa.sub_sched_cgroup_ids[i]) {
-			qa.sub_sched_cgroup_ids[i] = args->ops->sub_cgroup_id;
-			bpf_printk("attaching sub-sched[%d] on %s",
-				   i, args->cgroup_path);
-			return 0;
-		}
+		if (qa.sub_sched_ctxs[i].cgroup_id)
+			continue;
+
+		qa.sub_sched_ctxs[i].cgroup_id = args->ops->sub_cgroup_id;
+		qa.sub_sched_ctxs[i].weight = 100;	/* until userspace feeds it */
+		qa.nr_sub_scheds++;
+		bpf_printk("attaching sub-sched[%d] on %s", i, args->cgroup_path);
+		redistribute();
+		return 0;
 	}
 
 	return -ENOSPC;
@@ -1219,12 +1817,37 @@ void BPF_STRUCT_OPS(qmap_sub_detach, struct scx_sub_detach_args *args)
 	s32 i;
 
 	for (i = 0; i < MAX_SUB_SCHEDS; i++) {
-		if (qa.sub_sched_cgroup_ids[i] == args->ops->sub_cgroup_id) {
-			qa.sub_sched_cgroup_ids[i] = 0;
-			bpf_printk("detaching sub-sched[%d] on %s",
-				   i, args->cgroup_path);
-			break;
-		}
+		if (qa.sub_sched_ctxs[i].cgroup_id != args->ops->sub_cgroup_id)
+			continue;
+
+		qa.sub_sched_ctxs[i].cgroup_id = 0;
+		qa.sub_sched_ctxs[i].weight = 100;
+		cmask_init(&qa.sub_sched_ctxs[i].granted_cids.mask, 0, qa.nr_cids);
+		qa.nr_sub_scheds--;
+		bpf_printk("detaching sub-sched[%d] on %s", i, args->cgroup_path);
+		redistribute();
+		break;
+	}
+}
+
+void BPF_STRUCT_OPS(qmap_sub_caps_updated, const struct scx_cmask *cmask, u64 caps)
+{
+	/* our held caps changed, redistribute */
+	redistribute();
+}
+
+void BPF_STRUCT_OPS(qmap_sub_ecaps_updated, s32 cid, u64 before, u64 after)
+{
+	/*
+	 * Effective caps updated. Track which cids it holds shared, so a self
+	 * task placed there enqueues IMMED, and drop a lost cid from idle
+	 * tracking.
+	 */
+	if (after & SCX_CAP_ENQ_IMMED) {
+		qa.cid_shared[cid] = (after & SCX_CAP_ENQ) ? 0 : 1;
+	} else {
+		qa.cid_shared[cid] = 0;
+		cmask_clear(cid, &qa.idle_cids.mask);
 	}
 }
 
@@ -1248,6 +1871,8 @@ SCX_OPS_CID_DEFINE(qmap_ops,
 	       .cgroup_set_bandwidth	= (void *)qmap_cgroup_set_bandwidth,
 	       .sub_attach		= (void *)qmap_sub_attach,
 	       .sub_detach		= (void *)qmap_sub_detach,
+	       .sub_caps_updated	= (void *)qmap_sub_caps_updated,
+	       .sub_ecaps_updated	= (void *)qmap_sub_ecaps_updated,
 	       .init_cids		= (void *)qmap_init_cids,
 	       .init			= (void *)qmap_init,
 	       .exit			= (void *)qmap_exit,
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 9124183bffec..1efffaaa8fe8 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -4,6 +4,9 @@
  * Copyright (c) 2022 Tejun Heo <tj@kernel.org>
  * Copyright (c) 2022 David Vernet <dvernet@meta.com>
  */
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
@@ -12,19 +15,44 @@
 #include <libgen.h>
 #include <sys/mman.h>
 #include <sys/stat.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <string.h>
+#include <time.h>
 #include <bpf/bpf.h>
 #include <scx/common.h>
 #include "scx_qmap.h"
 #include "scx_qmap.bpf.skel.h"
 
+/* kernfs file-handle type for open_by_handle_at(), from linux/exportfs.h */
+#ifndef FILEID_KERNFS
+#define FILEID_KERNFS	0xfe
+#endif
+
 const char help_fmt[] =
 "A simple five-level FIFO queue sched_ext scheduler.\n"
 "\n"
-"See the top-level comment in .bpf.c for more details.\n"
+"It also demonstrates hierarchical sub-scheduling: a scheduler can hand some\n"
+"of its cpus to a child cgroup that runs its own scheduler. Run one qmap as\n"
+"the parent, then run another qmap on a child cgroup with -c to attach it\n"
+"beneath the parent.\n"
+"\n"
+"The policy below is deliberately simplistic and the resulting behavior can\n"
+"look odd. qmap is a demo: it exists to exercise every sub-scheduling\n"
+"primitive the kernel offers with as little code as possible, not to schedule\n"
+"well.\n"
+"\n"
+"A parent divides the full cpus it holds among itself and its children in\n"
+"proportion to cpu.weight. The cpus left over by rounding are time-shared,\n"
+"handed to each participant in turn every -R ms. A cpu a scheduler only\n"
+"holds a time-share of is never handed further down, and a parent left with\n"
+"no full cpu of its own shuts its children down.\n"
+"\n"
+"See the top-of-file comment in .bpf.c for the design.\n"
 "\n"
 "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
 "       [-N COUNT] [-P] [-M] [-H] [-c CG_PATH] [-d PID] [-D LEN] [-S] [-p] [-I]\n"
-"       [-F COUNT] [-v]\n"
+"       [-F COUNT] [-i SEC] [-R MS] [-v]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -e COUNT      Trigger scx_bpf_error() after COUNT enqueues\n"
@@ -44,6 +72,8 @@ const char help_fmt[] =
 "  -I            Turn on SCX_OPS_ALWAYS_ENQ_IMMED\n"
 "  -F COUNT      IMMED stress: force every COUNT'th enqueue to a busy local DSQ (use with -I)\n"
 "  -C MODE       cid-override test (shuffle|bad-dup|bad-range|bad-mono)\n"
+"  -i SEC        Stats and weight-refresh interval, seconds (default 5)\n"
+"  -R MS         Round-robin period for time-shared cpus, ms (default 200)\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 
@@ -62,6 +92,213 @@ static void sigint_handler(int dummy)
 	exit_req = 1;
 }
 
+/*
+ * Open a cgroup directory directly from its id. In cgroup2 the cgroup id is the
+ * kernfs node id, so a FILEID_KERNFS handle built from the id resolves to the
+ * directory via open_by_handle_at() against the cgroup mount.
+ */
+static int open_cgroup_by_id(u64 cgid)
+{
+	static int mnt_fd = -1;
+	struct {
+		struct file_handle fh;
+		u64 id;
+	} h;
+
+	if (mnt_fd < 0) {
+		mnt_fd = open("/sys/fs/cgroup", O_RDONLY | O_DIRECTORY);
+		if (mnt_fd < 0)
+			return -1;
+	}
+	h.fh.handle_bytes = sizeof(h.id);
+	h.fh.handle_type = FILEID_KERNFS;
+	h.id = cgid;
+	return open_by_handle_at(mnt_fd, &h.fh, O_RDONLY | O_DIRECTORY);
+}
+
+/* read a cgroup's cpu.weight (1-10000) by id, 0 if unavailable */
+static u32 read_cgroup_weight(u64 cgid)
+{
+	char buf[32];
+	int dfd, wfd;
+	u32 w = 0;
+	ssize_t n;
+
+	dfd = open_cgroup_by_id(cgid);
+	if (dfd < 0)
+		return 0;
+	wfd = openat(dfd, "cpu.weight", O_RDONLY);
+	close(dfd);
+	if (wfd < 0)
+		return 0;
+	n = read(wfd, buf, sizeof(buf) - 1);
+	close(wfd);
+	if (n > 0) {
+		buf[n] = '\0';
+		w = strtoul(buf, NULL, 10);
+	}
+	return w;
+}
+
+/* read each direct child's cpu.weight into the arena, true if any changed */
+static bool feed_weights(struct qmap_arena *qa)
+{
+	bool changed = false;
+	int i;
+
+	for (i = 0; i < MAX_SUB_SCHEDS; i++) {
+		u64 cgid = qa->sub_sched_ctxs[i].cgroup_id;
+		u32 w;
+
+		if (!cgid)
+			continue;
+		w = read_cgroup_weight(cgid);
+		if (w && w != qa->sub_sched_ctxs[i].weight) {
+			qa->sub_sched_ctxs[i].weight = w;
+			changed = true;
+		}
+	}
+	return changed;
+}
+
+static void invoke_repartition(struct scx_qmap *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+
+	bpf_prog_test_run_opts(bpf_program__fd(skel->progs.repartition), &opts);
+}
+
+static void invoke_flush_alloc(struct scx_qmap *skel)
+{
+	LIBBPF_OPTS(bpf_test_run_opts, opts);
+
+	bpf_prog_test_run_opts(bpf_program__fd(skel->progs.flush_alloc), &opts);
+}
+
+/* previous counter snapshots for the per-interval hier stats */
+struct hier_prev {
+	u64 alloc_ns[MAX_SUB_SCHEDS];
+	u64 self_alloc_ns;
+	u64 alloc_window_ns;
+	u64 nr_dsps[MAX_SUB_SCHEDS];
+	u64 nr_reenq_cap;
+	u64 nr_reenq_immed;
+};
+
+/* current wall-clock time as "HH:MM:SS" for the startup and interval headers */
+static const char *tstamp(char *buf, size_t sz)
+{
+	time_t now = time(NULL);
+
+	strftime(buf, sz, "%H:%M:%S", localtime(&now));
+	return buf;
+}
+
+/* format the cids whose cid_owner[] matches @owner as "0-3,8", "-" if none */
+static void format_cid_ranges(struct qmap_arena *qa, s32 owner, char *buf, size_t sz)
+{
+	u32 nr = qa->nr_cids, cid;
+	size_t off = 0;
+	s32 start = -1;
+
+	buf[0] = '\0';
+	for (cid = 0; cid <= nr; cid++) {
+		bool match = cid < nr && qa->part.cid_owner[cid] == owner;
+		int n;
+
+		if (match) {
+			if (start < 0)
+				start = cid;
+			continue;
+		}
+		if (start < 0)
+			continue;
+
+		if (start == (s32)cid - 1)
+			n = snprintf(buf + off, sz - off, "%s%d",
+				     off ? "," : "", start);
+		else
+			n = snprintf(buf + off, sz - off, "%s%d-%d",
+				     off ? "," : "", start, cid - 1);
+		if (n < 0 || (size_t)n >= sz - off) {
+			strcpy(&buf[sz - 4], "...");
+			return;
+		}
+		off += n;
+		start = -1;
+	}
+	if (!off)
+		strcpy(buf, "-");
+}
+
+/* partition summary + one row per sched: weight, cpus, dispatch rate, cids */
+static void print_hier(struct qmap_arena *qa, struct hier_prev *prev, u64 own_cgid)
+{
+	char ranges[128], who[16];
+	const char *rr = "-";
+	double secs;
+	u32 i;
+
+	/*
+	 * account_alloc() bumps alloc_window_ns together with the per-owner
+	 * counters, so dividing by the same window yields exact cid counts.
+	 */
+	secs = (qa->alloc_window_ns - prev->alloc_window_ns) / 1e9;
+	prev->alloc_window_ns = qa->alloc_window_ns;
+
+	/* resolve the live shared-pool holder */
+	if (qa->part.nr_shared && qa->part.nr_rr) {
+		u64 cgid = qa->part.rr_slots[qa->part.rr_pos];
+
+		rr = "self";
+		if (cgid) {
+			rr = "?";
+			for (i = 0; i < MAX_SUB_SCHEDS; i++) {
+				if (qa->sub_sched_ctxs[i].cgroup_id == cgid) {
+					snprintf(who, sizeof(who), "sub%u", i);
+					rr = who;
+					break;
+				}
+			}
+		}
+	}
+
+	format_cid_ranges(qa, CID_SHARED, ranges, sizeof(ranges));
+	printf("hier   : nsub=%llu excl=%u shared=%s rr=%s reenq cap/immed +%llu/+%llu\n",
+	       (unsigned long long)qa->nr_sub_scheds, qa->part.nr_excl, ranges, rr,
+	       (unsigned long long)(qa->nr_reenq_cap - prev->nr_reenq_cap),
+	       (unsigned long long)(qa->nr_reenq_immed - prev->nr_reenq_immed));
+	prev->nr_reenq_cap = qa->nr_reenq_cap;
+	prev->nr_reenq_immed = qa->nr_reenq_immed;
+
+	printf("hier   : %-4s %10s %4s %6s %8s  %s\n",
+	       "", "cgroup", "w", "alloc", "disp/s", "cids");
+
+	format_cid_ranges(qa, CID_SELF, ranges, sizeof(ranges));
+	printf("hier   : %-4s %10llu %4u %6.2f %8s  %s\n", "self",
+	       (unsigned long long)own_cgid, qa->self_weight,
+	       secs > 0 ? (qa->self_alloc_ns - prev->self_alloc_ns) / (secs * 1e9) : 0.0,
+	       "-", ranges);
+	prev->self_alloc_ns = qa->self_alloc_ns;
+
+	for (i = 0; i < MAX_SUB_SCHEDS; i++) {
+		struct sub_sched_ctx *sc = &qa->sub_sched_ctxs[i];
+
+		if (!sc->cgroup_id)
+			continue;
+
+		snprintf(who, sizeof(who), "sub%u", i);
+		format_cid_ranges(qa, i, ranges, sizeof(ranges));
+		printf("hier   : %-4s %10llu %4u %6.2f %8.1f  %s\n", who,
+		       (unsigned long long)sc->cgroup_id, sc->weight,
+		       secs > 0 ? (qa->alloc_ns[i] - prev->alloc_ns[i]) / (secs * 1e9) : 0.0,
+		       secs > 0 ? (sc->nr_dsps - prev->nr_dsps[i]) / secs : 0.0,
+		       ranges);
+		prev->alloc_ns[i] = qa->alloc_ns[i];
+		prev->nr_dsps[i] = sc->nr_dsps;
+	}
+}
+
 int main(int argc, char **argv)
 {
 	struct scx_qmap *skel;
@@ -69,7 +306,11 @@ int main(int argc, char **argv)
 	struct qmap_arena *qa;
 	u32 test_error_cnt = 0;
 	u64 ecode;
-	int opt;
+	int opt, stats_intv = 5, i, round_robin_ms = 200;
+	struct hier_prev hprev = {};
+	const char *sub_cg_path = NULL;
+	char tbuf[32];
+	u64 own_cgid = 0;
 
 	libbpf_set_print(libbpf_print_fn);
 	signal(SIGINT, sigint_handler);
@@ -89,7 +330,7 @@ int main(int argc, char **argv)
 	skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
 	skel->rodata->max_tasks = 16384;
 
-	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:N:PMHc:d:D:SpIF:C:vh")) != -1) {
+	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:N:PMHc:d:D:SpIF:C:i:R:vh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -129,6 +370,8 @@ int main(int argc, char **argv)
 			}
 			skel->struct_ops.qmap_ops->sub_cgroup_id = st.st_ino;
 			skel->rodata->sub_cgroup_id = st.st_ino;
+			own_cgid = st.st_ino;
+			sub_cg_path = optarg;
 			break;
 		}
 		case 'd':
@@ -211,6 +454,16 @@ int main(int argc, char **argv)
 			}
 			break;
 		}
+		case 'i':
+			stats_intv = atoi(optarg);
+			if (stats_intv < 1)
+				stats_intv = 1;
+			break;
+		case 'R':
+			round_robin_ms = atoi(optarg);
+			if (round_robin_ms < 10)
+				round_robin_ms = 10;
+			break;
 		case 'v':
 			verbose = true;
 			break;
@@ -220,16 +473,30 @@ int main(int argc, char **argv)
 		}
 	}
 
+	skel->rodata->round_robin_ns = (u64)round_robin_ms * 1000000;
+
 	SCX_OPS_LOAD(skel, qmap_ops, scx_qmap, uei);
 	link = SCX_OPS_ATTACH(skel, qmap_ops, scx_qmap);
 
 	qa = &skel->arena->qa;
 	qa->test_error_cnt = test_error_cnt;
 
+	if (sub_cg_path)
+		printf("%s scx_qmap started: sub-scheduler on %s, stats every %ds\n",
+		       tstamp(tbuf, sizeof(tbuf)), sub_cg_path, stats_intv);
+	else
+		printf("%s scx_qmap started: root scheduler, stats every %ds\n",
+		       tstamp(tbuf, sizeof(tbuf)), stats_intv);
+	fflush(stdout);
+
 	while (!exit_req && !UEI_EXITED(skel, uei)) {
 		long nr_enqueued = qa->nr_enqueued;
 		long nr_dispatched = qa->nr_dispatched;
+		u32 self_weight;
+		bool repart;
 
+		printf("---- %s ----\n",
+		       tstamp(tbuf, sizeof(tbuf)));
 		printf("stats  : enq=%lu dsp=%lu delta=%ld reenq/cid0=%llu/%llu deq=%llu core=%llu enq_ddsp=%llu\n",
 		       nr_enqueued, nr_dispatched, nr_enqueued - nr_dispatched,
 		       (unsigned long long)qa->nr_reenqueued,
@@ -250,8 +517,27 @@ int main(int argc, char **argv)
 			       qa->cpuperf_target_min,
 			       qa->cpuperf_target_avg,
 			       qa->cpuperf_target_max);
+
+		self_weight = own_cgid ? read_cgroup_weight(own_cgid) : 100;
+		if (!self_weight)
+			self_weight = 100;
+
+		repart = feed_weights(qa);
+
+		if (self_weight != qa->self_weight) {
+			qa->self_weight = self_weight;
+			repart = true;
+		}
+
+		if (repart)
+			invoke_repartition(skel);
+
+		invoke_flush_alloc(skel);
+		print_hier(qa, &hprev, own_cgid);
 		fflush(stdout);
-		sleep(1);
+
+		for (i = 0; i < stats_intv && !exit_req && !UEI_EXITED(skel, uei); i++)
+			sleep(1);
 	}
 
 	bpf_link__destroy(link);
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index 6c3ea1fc74ed..c87d61c37fe1 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -20,6 +20,7 @@
 #endif
 
 #define MAX_SUB_SCHEDS		8
+#define MAX_PARTS		(MAX_SUB_SCHEDS + 1)	/* participants: children + self */
 
 /*
  * cpu_ctxs[] is sized to a fixed cap so the layout is shared between BPF and
@@ -27,6 +28,16 @@
  */
 #define SCX_QMAP_MAX_CPUS	1024
 
+/*
+ * An owner id identifies who holds a cid: a child slot in [0, MAX_SUB_SCHEDS),
+ * CID_SELF for this node, CID_NONE for a cid not currently held, or CID_SHARED
+ * for a cid in the round-robin pool (its live holder is rr_slots[rr_pos]). Used
+ * by the partition's cid_owner[].
+ */
+#define CID_SELF	(-1)
+#define CID_NONE	(-2)
+#define CID_SHARED	(-3)
+
 /* -C cid-override test modes. Selects cid_override_mode in scx_qmap.bpf.c. */
 enum qmap_cid_override {
 	QMAP_CID_OVR_OFF	= 0,	/* disabled */
@@ -43,15 +54,57 @@ struct cpu_ctx {
 	u32 cpuperf_target;
 };
 
-/* Opaque to userspace; defined in scx_qmap.bpf.c. */
-struct task_ctx;
-
 struct qmap_fifo {
 	struct task_ctx __arena *head;
 	struct task_ctx __arena *tail;
 	s32 idx;
 };
 
+/*
+ * scx_cmask's are embedded in struct qmap_arena with inline backing storage.
+ * The bpf side uses &field.mask with the normal cmask_* helpers. Userspace
+ * doesn't have access to the type definition and sees same-sized opaque words.
+ * _Static_assert()'s in .bpf.c ensure that they are in sync.
+ */
+#define QMAP_CMASK_WORDS	(((SCX_QMAP_MAX_CPUS) + 63) / 64 + 1)
+struct qmap_cmask {
+#ifdef __BPF__
+	union {
+		struct scx_cmask mask;
+		u64 words[QMAP_CMASK_WORDS + 2];
+	};
+#else
+	u64 words[QMAP_CMASK_WORDS + 2];
+#endif
+};
+
+/* Opaque to userspace; defined in scx_qmap.bpf.c. */
+struct task_ctx;
+
+/* per-direct-child state for the sub-scheduler */
+struct sub_sched_ctx {
+	u64 cgroup_id;
+	u32 weight;			/* cpu.weight, fed by userspace */
+	u64 nr_dsps;
+	struct qmap_cmask granted_cids;	/* cids granted excl to this child */
+	struct qmap_cmask prev_granted;	/* last grant, for delta calculation */
+};
+
+/*
+ * compute_partition() builds the following from this node's held caps, and
+ * apply_partition()/rr_advance() execute it. Userspace only reads for the
+ * hierarchy display.
+ */
+struct qmap_partition {
+	u32 nr_excl;			/* number of excl-held (delegatable) cids */
+	s32 cid_owner[SCX_QMAP_MAX_CPUS]; /* per cid: owner id, or CID_NONE */
+	s32 shared_cids[MAX_PARTS];	/* the round-robin cid pool */
+	u32 nr_shared;			/* number of shared_cids entries */
+	u64 rr_slots[MAX_PARTS];	/* rotation order: holder cgroup_id, 0 = self */
+	u32 nr_rr;			/* number of rr_slots entries */
+	u32 rr_pos;			/* current rotation index */
+};
+
 struct qmap_arena {
 	/* userspace-visible stats */
 	u64 nr_enqueued, nr_dispatched, nr_reenqueued, nr_reenqueued_cid0;
@@ -65,7 +118,6 @@ struct qmap_arena {
 	u32 cpuperf_target_min, cpuperf_target_avg, cpuperf_target_max;
 
 	/* kernel-side runtime state */
-	u64 sub_sched_cgroup_ids[MAX_SUB_SCHEDS];
 	u64 core_sched_head_seqs[5];
 	u64 core_sched_tail_seqs[5];
 
@@ -77,6 +129,44 @@ struct qmap_arena {
 
 	/* five priority FIFOs, each a doubly-linked list through task_ctx */
 	struct qmap_fifo fifos[5];
+
+	/*
+	 * Hierarchical sub-scheduling state. See the design comment at the top
+	 * of scx_qmap.bpf.c.
+	 */
+	u32 nr_cids;			/* cid count, cached at init */
+
+	/* bpf-owned partition: read by userspace for display */
+	struct qmap_partition part;
+
+	struct sub_sched_ctx sub_sched_ctxs[MAX_SUB_SCHEDS]; /* per-child context */
+	u64 nr_sub_scheds;		/* number of attached children */
+	u32 self_weight;		/* this node's cpu.weight, fed by userspace */
+
+	/* bpf-internal per-cid state */
+	u8 cid_shared[SCX_QMAP_MAX_CPUS]; /* per cid: 1 if held shared (ENQ_IMMED-only) */
+
+	/* allocated cid-time, charged per owner by account_alloc() */
+	u64 alloc_ns[MAX_SUB_SCHEDS];	/* per child slot */
+	u64 self_alloc_ns;
+	u64 alloc_ts;			/* last accounting timestamp */
+	u64 alloc_window_ns;		/* total accounted time, the alloc denominator */
+
+	/* bpf-internal cmasks (embedded, see struct qmap_cmask) */
+	struct qmap_cmask self_cids;	/* cids this node runs its own tasks on */
+	struct qmap_cmask idle_cids;	/* idle-cid tracking, scanned by direct dispatch */
+	struct qmap_cmask rr_cids;	/* the shared pool, as a mask for grant/revoke */
+
+	/* scratch cmasks */
+	struct qmap_cmask to_revoke_cids; /* delta cids to revoke */
+	struct qmap_cmask to_grant_cids; /* delta cids to grant */
+	struct qmap_cmask prev_rr_cids; /* previous shared pool, to clear stale grants */
+	struct qmap_cmask held_excl;	/* cids held excl (ENQ): delegatable */
+	struct qmap_cmask held_shared;	/* cids held shared (ENQ_IMMED only): self-local */
+
+	/* bpf -> userspace: stats */
+	u64 nr_reenq_cap;		/* SCX_TASK_REENQ_CAP bounces */
+	u64 nr_reenq_immed;		/* SCX_TASK_REENQ_IMMED bounces */
 };
 
 #endif /* __SCX_QMAP_H */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection
  2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
                   ` (30 preceding siblings ...)
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
@ 2026-07-03  8:01 ` Tejun Heo
  31 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Add a fault-injection mode to the scx_qmap sub-scheduler that deliberately
dispatches one of its own tasks to a cid it does not hold. The kernel cap
check must reject it and re-enqueue with SCX_TASK_REENQ_CAP, so the
nr_inject_attempts counter tracks nr_reenq_cap one to one, exercising the
delivery-time cap enforcement.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 tools/sched_ext/scx_qmap.bpf.c | 40 ++++++++++++++++++++++++++++++++++
 tools/sched_ext/scx_qmap.c     | 20 +++++++++++++----
 tools/sched_ext/scx_qmap.h     |  8 +++++++
 3 files changed, 64 insertions(+), 4 deletions(-)

diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
index 938a32514b2f..9a93a68c0d42 100644
--- a/tools/sched_ext/scx_qmap.bpf.c
+++ b/tools/sched_ext/scx_qmap.bpf.c
@@ -376,6 +376,24 @@ static u64 needs_immed(s32 cid)
 	return qa.cid_shared[cid] ? SCX_ENQ_IMMED : 0;
 }
 
+/* first cid this node does NOT hold for fault injection, -1 if none */
+static s32 first_unavail_cid(void)
+{
+	s32 nr_cids = qa.nr_cids, c;
+
+	if (nr_cids > SCX_QMAP_MAX_CPUS) {
+		scx_bpf_error("-ERANGE");
+		return -1;
+	}
+
+	bpf_for(c, 0, nr_cids) {
+		if (!cmask_test(c, &qa.held_excl.mask) &&
+		    !cmask_test(c, &qa.held_shared.mask))
+			return c;
+	}
+	return -1;
+}
+
 static int weight_to_idx(u32 weight)
 {
 	/* Coarsely map the compound weight to a FIFO. */
@@ -452,6 +470,28 @@ void BPF_STRUCT_OPS(qmap_enqueue, struct task_struct *p, u64 enq_flags)
 		}
 	}
 
+	/*
+	 * Fault injection: deliberately dispatch one of our own tasks to a cid
+	 * we don't hold. The kernel cap check must reject it and re-enqueue
+	 * with SCX_TASK_REENQ_CAP, so nr_inject_attempts tracks nr_reenq_cap
+	 * and proves delivery-time enforcement. Throttled.
+	 */
+	if (qa.inject_mode == QMAP_INJ_WRONG_CID && p->nr_cpus_allowed > 1 &&
+	    !(enq_flags & SCX_ENQ_REENQ)) {
+		static u32 inj_cnt;
+
+		if (!(++inj_cnt % 64)) {
+			s32 bad = first_unavail_cid();
+
+			if (bad >= 0 && cmask_test(bad, &taskc->cpus_allowed)) {
+				__sync_fetch_and_add(&qa.nr_inject_attempts, 1);
+				scx_bpf_dsq_insert(p, SCX_DSQ_LOCAL_ON | bad,
+						   slice_ns, enq_flags);
+				return;
+			}
+		}
+	}
+
 	/*
 	 * IMMED stress testing: Every immed_stress_nth'th enqueue, dispatch
 	 * directly to prev_cpu's local DSQ even when busy to force dsq->nr > 1
diff --git a/tools/sched_ext/scx_qmap.c b/tools/sched_ext/scx_qmap.c
index 1efffaaa8fe8..d3f766b692b8 100644
--- a/tools/sched_ext/scx_qmap.c
+++ b/tools/sched_ext/scx_qmap.c
@@ -52,7 +52,7 @@ const char help_fmt[] =
 "\n"
 "Usage: %s [-s SLICE_US] [-e COUNT] [-t COUNT] [-T COUNT] [-l COUNT] [-b COUNT]\n"
 "       [-N COUNT] [-P] [-M] [-H] [-c CG_PATH] [-d PID] [-D LEN] [-S] [-p] [-I]\n"
-"       [-F COUNT] [-i SEC] [-R MS] [-v]\n"
+"       [-F COUNT] [-i SEC] [-R MS] [-J MODE] [-v]\n"
 "\n"
 "  -s SLICE_US   Override slice duration\n"
 "  -e COUNT      Trigger scx_bpf_error() after COUNT enqueues\n"
@@ -74,6 +74,7 @@ const char help_fmt[] =
 "  -C MODE       cid-override test (shuffle|bad-dup|bad-range|bad-mono)\n"
 "  -i SEC        Stats and weight-refresh interval, seconds (default 5)\n"
 "  -R MS         Round-robin period for time-shared cpus, ms (default 200)\n"
+"  -J MODE       Fault injection (wrong-cid: dispatch to a cid not held)\n"
 "  -v            Print libbpf debug messages\n"
 "  -h            Display this help and exit\n";
 
@@ -183,6 +184,7 @@ struct hier_prev {
 	u64 nr_dsps[MAX_SUB_SCHEDS];
 	u64 nr_reenq_cap;
 	u64 nr_reenq_immed;
+	u64 nr_inject_attempts;
 };
 
 /* current wall-clock time as "HH:MM:SS" for the startup and interval headers */
@@ -264,12 +266,14 @@ static void print_hier(struct qmap_arena *qa, struct hier_prev *prev, u64 own_cg
 	}
 
 	format_cid_ranges(qa, CID_SHARED, ranges, sizeof(ranges));
-	printf("hier   : nsub=%llu excl=%u shared=%s rr=%s reenq cap/immed +%llu/+%llu\n",
+	printf("hier   : nsub=%llu excl=%u shared=%s rr=%s reenq cap/immed +%llu/+%llu inj=+%llu\n",
 	       (unsigned long long)qa->nr_sub_scheds, qa->part.nr_excl, ranges, rr,
 	       (unsigned long long)(qa->nr_reenq_cap - prev->nr_reenq_cap),
-	       (unsigned long long)(qa->nr_reenq_immed - prev->nr_reenq_immed));
+	       (unsigned long long)(qa->nr_reenq_immed - prev->nr_reenq_immed),
+	       (unsigned long long)(qa->nr_inject_attempts - prev->nr_inject_attempts));
 	prev->nr_reenq_cap = qa->nr_reenq_cap;
 	prev->nr_reenq_immed = qa->nr_reenq_immed;
+	prev->nr_inject_attempts = qa->nr_inject_attempts;
 
 	printf("hier   : %-4s %10s %4s %6s %8s  %s\n",
 	       "", "cgroup", "w", "alloc", "disp/s", "cids");
@@ -310,6 +314,7 @@ int main(int argc, char **argv)
 	struct hier_prev hprev = {};
 	const char *sub_cg_path = NULL;
 	char tbuf[32];
+	u32 inject_mode = 0;
 	u64 own_cgid = 0;
 
 	libbpf_set_print(libbpf_print_fn);
@@ -330,7 +335,7 @@ int main(int argc, char **argv)
 	skel->rodata->slice_ns = __COMPAT_ENUM_OR_ZERO("scx_public_consts", "SCX_SLICE_DFL");
 	skel->rodata->max_tasks = 16384;
 
-	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:N:PMHc:d:D:SpIF:C:i:R:vh")) != -1) {
+	while ((opt = getopt(argc, argv, "s:e:t:T:l:b:N:PMHc:d:D:SpIF:C:i:R:J:vh")) != -1) {
 		switch (opt) {
 		case 's':
 			skel->rodata->slice_ns = strtoull(optarg, NULL, 0) * 1000;
@@ -464,6 +469,12 @@ int main(int argc, char **argv)
 			if (round_robin_ms < 10)
 				round_robin_ms = 10;
 			break;
+		case 'J':
+			if (!strcmp(optarg, "wrong-cid"))
+				inject_mode = QMAP_INJ_WRONG_CID;
+			else
+				inject_mode = strtoul(optarg, NULL, 0);
+			break;
 		case 'v':
 			verbose = true;
 			break;
@@ -480,6 +491,7 @@ int main(int argc, char **argv)
 
 	qa = &skel->arena->qa;
 	qa->test_error_cnt = test_error_cnt;
+	qa->inject_mode = inject_mode;
 
 	if (sub_cg_path)
 		printf("%s scx_qmap started: sub-scheduler on %s, stats every %ds\n",
diff --git a/tools/sched_ext/scx_qmap.h b/tools/sched_ext/scx_qmap.h
index c87d61c37fe1..7b22b0f9737f 100644
--- a/tools/sched_ext/scx_qmap.h
+++ b/tools/sched_ext/scx_qmap.h
@@ -60,6 +60,12 @@ struct qmap_fifo {
 	s32 idx;
 };
 
+/* -J fault-injection modes. Selects inject_mode in struct qmap_arena. */
+enum qmap_inject {
+	QMAP_INJ_OFF		= 0,
+	QMAP_INJ_WRONG_CID	= 1,	/* dispatch to a cid we don't hold */
+};
+
 /*
  * scx_cmask's are embedded in struct qmap_arena with inline backing storage.
  * The bpf side uses &field.mask with the normal cmask_* helpers. Userspace
@@ -167,6 +173,8 @@ struct qmap_arena {
 	/* bpf -> userspace: stats */
 	u64 nr_reenq_cap;		/* SCX_TASK_REENQ_CAP bounces */
 	u64 nr_reenq_immed;		/* SCX_TASK_REENQ_IMMED bounces */
+	u64 nr_inject_attempts;		/* fault-injection: dispatches to an unheld cid */
+	u32 inject_mode;		/* fault-injection mode (QMAP_INJ_*) */
 };
 
 #endif /* __SCX_QMAP_H */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
@ 2026-07-03  8:33   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03  8:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:

New issues:
- [Medium] Missing READ_ONCE() in cmask_word_op1 leaves scx_cmask_empty() and the boundary checks of scx_cmask_subset() vulnerable to data races.

Pre-existing issues:
- [Critical] The kernel trusts inline metadata (nr_cids, base) from BPF arena cmasks without bounds checking, allowing userspace or BPF to trigger massive out-of-bounds zeroing of kernel memory.
--

commit dff79ba0834e741aea85b2036409e261619d8a3c
Author: Tejun Heo <tj@kernel.org>

sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants
    
The cmask ops can operate on BPF-arena cmasks which BPF programs may read
and write concurrently. The _RACY op variants existed to make such lockless
reads explicit but this turned out to be too restrictive. Mark the word
accesses in all the two-cmask ops with READ_ONCE/WRITE_ONCE instead and drop
the _RACY variants.

> diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
> index af83084ec740a..4b7279ca740f4 100644
> --- a/kernel/sched/ext/cid.c
> +++ b/kernel/sched/ext/cid.c
[ ... ]
> @@ -422,28 +420,22 @@ static __always_inline bool cmask_word_op2(u64 *av, const u64 *bp, u64 mask,
>  {
>  	switch (op) {
>  	case CMASK_OP2_AND:
> -		*av &= ~mask | *bp;
> +		WRITE_ONCE(*av, *av & (~mask | READ_ONCE(*bp)));
>  		return false;
>  	case CMASK_OP2_OR:
> -		*av |= *bp & mask;
> -		return false;
> -	case CMASK_OP2_OR_RACY:
> -		*av |= data_race(*bp) & mask;
> +		WRITE_ONCE(*av, *av | (READ_ONCE(*bp) & mask));
>  		return false;
>  	case CMASK_OP2_COPY:
> -		*av = (*av & ~mask) | (*bp & mask);
> -		return false;
> -	case CMASK_OP2_COPY_RACY:
> -		*av = (*av & ~mask) | (data_race(*bp) & mask);
> +		WRITE_ONCE(*av, (*av & ~mask) | (READ_ONCE(*bp) & mask));
>  		return false;
>  	case CMASK_OP2_ANDNOT:
> -		*av &= ~(*bp & mask);
> +		WRITE_ONCE(*av, *av & ~(READ_ONCE(*bp) & mask));
>  		return false;
>  	case CMASK_OP2_SUBSET:
>  		/* stop on the first bit in @sub not set in @super */
> -		return (*bp & ~*av) & mask;
> +		return (READ_ONCE(*bp) & ~READ_ONCE(*av)) & mask;
>  	case CMASK_OP2_INTERSECTS:
> -		return (*av & *bp) & mask;
> +		return (READ_ONCE(*av) & READ_ONCE(*bp)) & mask;
>  	}
>  	unreachable();
>  }

[Severity: Medium]
Since READ_ONCE() was added to cmask_word_op2() to prevent data races and load
tearing when reading BPF-arena cmasks locklessly, shouldn't it also be added to
cmask_word_op1() in kernel/sched/ext/cid.c?

For example, when scx_cmask_subset() checks the boundaries:

    if (sub->base < super->base &&
        cmask_any_set_in_range(sub, sub->base, min(super->base, sub_end)))
        return false;

This triggers a lockless cmask_word_op1() on sub, which can be an arena cmask
mutated by BPF:

    case CMASK_OP1_ANY_SET:
        return *ap & mask;

Could this unprotected read leave scx_cmask_empty() and the boundary checks of
scx_cmask_subset() vulnerable to data races?

[ ... ]
> diff --git a/kernel/sched/ext/cid.h b/kernel/sched/ext/cid.h
> index 9c4f4b907f12d..54b10df32fd50 100644
> --- a/kernel/sched/ext/cid.h
> +++ b/kernel/sched/ext/cid.h
> @@ -57,9 +57,7 @@ void scx_cmask_clear(struct scx_cmask *m);

[Severity: Critical]
This is a pre-existing issue, but does scx_cmask_clear() blindly trust inline
metadata from BPF arena cmasks without bounds checking?

When the kernel allocates set_cmask_scratch in the BPF arena using
scx_arena_alloc(), and later calls scx_cpumask_to_cmask() in
kernel/sched/ext/ext.c:scx_call_op_set_cpumask():

    struct scx_cmask *kern_va = *this_cpu_ptr(sch->set_cmask_scratch);
    ...
    scx_cpumask_to_cmask(cpumask, kern_va);

This eventually calls scx_cmask_clear(kern_va) in kernel/sched/ext/cid.c, which
calculates the memset size based on the cmask's inline fields:

    nr_words = (m->base + m->nr_cids - 1) / 64 - m->base / 64 + 1;
    memset(m->bits, 0, nr_words * sizeof(u64));

If an untrusted BPF program or userspace process concurrently overwrites
m->nr_cids or m->base with a massive value, could this trigger an out-of-bounds
memset that zeroes out kernel memory adjacent to the arena allocation?

>  void scx_cmask_fill(struct scx_cmask *m);
>  void scx_cmask_and(struct scx_cmask *dst, const struct scx_cmask *src);
>  void scx_cmask_or(struct scx_cmask *dst, const struct scx_cmask *src);
> -void scx_cmask_or_racy(struct scx_cmask *dst, const struct scx_cmask *src);
>  void scx_cmask_copy(struct scx_cmask *dst, const struct scx_cmask *src);
> -void scx_cmask_copy_racy(struct scx_cmask *dst, const struct scx_cmask *src);
>  void scx_cmask_andnot(struct scx_cmask *dst, const struct scx_cmask *src);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=3

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
@ 2026-07-03  9:02   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03  9:02 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] List corruption in `scx_kick_cpu()` and `kick_cpus_irq_workfn()` due to NMI context concurrency.
--

commit 9ac2a72a88974fa9fc65a32e794dca946c726004
Author: Tejun Heo <tj@kernel.org>

sched_ext: Make the kick machinery per-sched

This commit makes the preemption kick machinery per-scheduler instead of
sharing targets across every scheduler on the cpu. It moves the kick masks
into a per-cpu structure and updates the kick machinery to link the sched
onto a per-cpu list that is consumed by a local irq work function.

> diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> index 13af43bca8507..8445e34e205fb 100644
> --- a/kernel/sched/ext/ext.c
> +++ b/kernel/sched/ext/ext.c

[ ... ]

> @@ -7543,15 +7574,24 @@ static void kick_cpus_irq_workfn(struct irq_work *irq_work)
>  
>  	ksyncs = rcu_dereference_bh(ksyncs_pcpu)->syncs;
>  
> -	for_each_cpu(cpu, this_scx->cpus_to_kick) {
> -		should_wait |= kick_one_cpu(cpu, this_rq, ksyncs);
> -		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick);
> -		cpumask_clear_cpu(cpu, this_scx->cpus_to_kick_if_idle);
> -	}
> +	/*
> +	 * Walk scheds with pending kicks on this cpu. scx_kick_cpu() adds to
> +	 * the list under local_irq_save() and only this irq_work consumes it.
> +	 * A plain list without locking is sufficient.
> +	 */
> +	list_for_each_entry_safe(pcpu, tmp, &this_scx->sched_pcpus_to_kick, to_kick_node) {
> +		list_del_init(&pcpu->to_kick_node);

[Severity: High]
Does this code safely handle preemption by non-maskable interrupts?

The comment notes that a plain list without locking is sufficient because
scx_kick_cpu() adds to the list under local_irq_save(). However,
local_irq_save() does not mask nmi execution.

If an nmi fires while a cpu is in the middle of list_del_init() here in
kick_cpus_irq_workfn(), and the nmi handler invokes a bpf tracing program
that calls scx_bpf_kick_cpu(), could that concurrently execute
list_add_tail() on this same per-cpu list and corrupt the list pointers?

[ ... ]

> @@ -8499,16 +8541,18 @@ void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
>  			}
>  			raw_spin_rq_unlock(target_rq);
>  		}
> -		cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick_if_idle);
> +		cpumask_set_cpu(cpu, pcpu->cpus_to_kick_if_idle);
>  	} else {
> -		cpumask_set_cpu(cpu, this_rq->scx.cpus_to_kick);
> +		cpumask_set_cpu(cpu, pcpu->cpus_to_kick);
>  
>  		if (flags & SCX_KICK_PREEMPT)
> -			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_preempt);
> +			cpumask_set_cpu(cpu, pcpu->cpus_to_preempt);
>  		if (flags & SCX_KICK_WAIT)
> -			cpumask_set_cpu(cpu, this_rq->scx.cpus_to_wait);
> +			cpumask_set_cpu(cpu, pcpu->cpus_to_wait);
>  	}
>  
> +	if (list_empty(&pcpu->to_kick_node))
> +		list_add_tail(&pcpu->to_kick_node, &this_rq->scx.sched_pcpus_to_kick);

[Severity: High]
Can a similar race happen here during concurrent nmi execution?

Since scx_kick_cpu() protects list_add_tail() only with local_irq_save(),
an nmi can interrupt it. If the nmi executes a bpf program that also
triggers scx_kick_cpu(), won't the nested list_add_tail() operations
corrupt the standard doubly-linked list?

>  	irq_work_queue(&this_rq->scx.kick_cpus_irq_work);
>  out:
>  	local_irq_restore(irq_flags);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=7

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override()
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
@ 2026-07-03  9:51   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03  9:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Concurrent modification of userspace/BPF-accessible arrays `cpu_to_cid` and `shard_start` bypasses validation and leads to out-of-bounds writes.
- [High] Changing the signature of `scx_bpf_cid_override` without renaming the kfunc breaks backward compatibility and `compat.bpf.h` versioning, unconditionally causing load failures on older kernels.
--

commit 8e4715e8e1c648121e31ca200c7493656e918c11
Author: Tejun Heo <tj@kernel.org>

sched_ext: Add shard boundaries to scx_bpf_cid_override()

This commit extends scx_bpf_cid_override() to accept an explicit shard layout
alongside the cpu-to-cid mapping. It updates the validation and assignment
logic to handle shard assignments and modifies the tools/sched_ext compat
wrapper and qmap tests to supply the new arguments.

> diff --git a/kernel/sched/ext/cid.c b/kernel/sched/ext/cid.c
> index 9d75b9311978c..bd0467e8a8d29 100644
> --- a/kernel/sched/ext/cid.c
> +++ b/kernel/sched/ext/cid.c
[ ... ]
> @@ -442,13 +511,56 @@ __bpf_kfunc void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
>  			scx_error(sch, "cid %d assigned to multiple cpus", c);
>  			return;
>  		}
> +	}
> +
> +	for_each_possible_cpu(cpu) {
> +		s32 c = cpu_to_cid[cpu];
> +
>  		scx_cpu_to_cid_tbl[cpu] = c;
>  		scx_cid_to_cpu_tbl[c] = cpu;
>  	}

[Severity: Critical]
Since cpu_to_cid and shard_start point to BPF map memory which can be mmap'ed
and modified concurrently by userspace, can this cause a TOCTOU race leading to
out-of-bounds writes?

If a buggy or malicious userspace program modifies cpu_to_cid[cpu] after the
first validation loop finishes but before this second loop reads it, could 'c'
exceed num_possible_cpus() and cause scx_cid_to_cpu_tbl[c] = cpu to write out
of bounds?

The shard_start array also appears to be read repeatedly across multiple loops
after validation. Would it be safer to kmemdup() these arrays before
validation to ensure consistent state?

[ ... ]
> diff --git a/tools/sched_ext/include/scx/compat.bpf.h b/tools/sched_ext/include/scx/compat.bpf.h
> index 133058578668d..cf469d5ff9ca0 100644
> --- a/tools/sched_ext/include/scx/compat.bpf.h
> +++ b/tools/sched_ext/include/scx/compat.bpf.h
[ ... ]
> @@ -122,15 +122,18 @@ static inline bool scx_bpf_sub_dispatch(u64 cgroup_id)
>  }
>  
>  /*
> - * v7.2: scx_bpf_cid_override() for explicit cpu->cid mapping. Ignore if
> + * v7.3: scx_bpf_cid_override() for explicit cid and shard mapping. Ignore if
>   * missing.
>   */
> -void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz) __ksym __weak;
> +void scx_bpf_cid_override___compat(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
> +				    const s32 *shard_start, u32 shard_start__sz) __ksym __weak;
>  
> -static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz)
> +static inline void scx_bpf_cid_override(const s32 *cpu_to_cid, u32 cpu_to_cid__sz,
> +					 const s32 *shard_start, u32 shard_start__sz)
>  {
>  	if (bpf_ksym_exists(scx_bpf_cid_override___compat))
> -		return scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz);
> +		scx_bpf_cid_override___compat(cpu_to_cid, cpu_to_cid__sz,
> +					      shard_start, shard_start__sz);
>  }

[Severity: High]
Does changing the signature of scx_bpf_cid_override() without renaming the
kfunc break backward compatibility for BPF schedulers loaded on older kernels?

On older kernels that only have the 2-argument version of the kfunc,
bpf_ksym_exists() will still match by name and return true. However, because
the compatibility wrapper now makes a 4-argument call, the BPF verifier will
reject the program due to a BTF signature mismatch.

Should the kfunc be renamed (e.g., scx_bpf_cid_override_v2) so
bpf_ksym_exists() correctly returns false on kernels that don't support the
new signature?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=10

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
@ 2026-07-03 10:49   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 10:49 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] RCU-001: Remove Before Reclaim Ordering violation during scheduler teardown
- [Medium] RCU publication order violation in scx_link_sched()
--

commit 2105b47ea9f8a129fabfbb3668aa7498dfc13622
Author: Tejun Heo <tj@kernel.org>

sched_ext: RCU-protect the sub-sched tree's children/sibling lists

Future kfuncs need to walk descendants without scx_sched_lock. Make the
walker RCU-safe so that they can. A sub-sched's fields are initialized
before it is linked, so a walk that observes a linked node also observes its
setup.

> diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> index c0a3a1ead2833..1e38aaad43323 100644
> --- a/kernel/sched/ext/ext.c
> +++ b/kernel/sched/ext/ext.c

[ ... ]

> @@ -5492,7 +5493,10 @@ s32 scx_link_sched(struct scx_sched *sch)
>  				break;
>  			}
>  
> -			list_add_tail(&sch->sibling, &parent->children);
> +			list_add_tail_rcu(&sch->sibling, &parent->children);
> +
> +			/* inherit the ancestor bypass state */
> +			WRITE_ONCE(sch->bypass_depth, READ_ONCE(parent->bypass_depth));

[Severity: Medium]
Since the list is now RCU-protected and can be traversed locklessly, does
initializing sch->bypass_depth after list_add_tail_rcu() allow a concurrent
RCU reader (like scx_for_each_descendant_pre()) to observe the new node with
uninitialized or stale bypass state?

Should the payload be initialized before it is published to the list?

>  		}
>  #endif	/* CONFIG_EXT_SUB_SCHED */
>  
> @@ -5519,7 +5523,7 @@ void scx_unlink_sched(struct scx_sched *sch)
>  		if (scx_parent(sch)) {
>  			rhashtable_remove_fast(&scx_sched_hash, &sch->hash_node,
>  					       scx_sched_hash_params);
> -			list_del_init(&sch->sibling);
> +			list_del_rcu(&sch->sibling);

[Severity: High]
With these lists now being traversed under RCU, do the teardown paths properly
unlink the scheduler before waiting for grace periods?

Looking at scx_root_disable() and scx_sub_disable(), it appears
synchronize_rcu() (or its expedited variant) is called before
scx_unlink_sched().

Because the scheduler is still in the RCU-protected tree during the grace
period, could new RCU readers discover the scheduler after the grace period
finishes, and access it while its resources are being destroyed?

Should the node be unlinked before calling synchronize_rcu() to prevent new
readers from finding it?

>  		}
>  #endif	/* CONFIG_EXT_SUB_SCHED */
>  		list_del_rcu(&sch->all);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=14

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
@ 2026-07-03 11:17   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 11:17 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Partial allocation failure in `scx_alloc_pshards()` leaves the `pshard` array exposed, causing a NULL pointer dereference if the BPF `ops.exit` hook reads its own caps.
--

commit 80fa600c75235b9a81ec7604c35ea00c8688f882
Author: Tejun Heo <tj@kernel.org>

sched_ext: Add per-shard cap delegation for sub-schedulers

This patch introduces per-shard capability delegation for sub-schedulers in
sched_ext. It adds the framework and locking around shards.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 1e84f46201767..e7259623fa3c1 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> @@ -733,6 +782,281 @@ __bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *
>  				  true);
>  }
>  
[ ... ]
> +/**
> + * scx_bpf_sub_caps - Read self's or a direct child's cap cmasks
> + * @cgroup_id: 0 for self, or a direct child's cgroup id
> + * @caps: one or more SCX_CAP_* bits
> + * @out__ign: arena cmask to receive the union of @caps within its range
> + * @aux: implicit BPF argument
> + *
> + * Read the cap cmasks granted on each cid for self (@cgroup_id 0) or a direct
> + * child - the literal granted set. A sched can read only itself or a direct
> + * child.
> + *
> + * Return 0, -ENODEV if @cgroup_id names no direct child, or -EINVAL on bad
> + * inputs.
> + */
> +__bpf_kfunc s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out__ign,
> +				 const struct bpf_prog_aux *aux)
> +{
[ ... ]
> +	/*
> +	 * The target's caps storage may not be set up yet (e.g. a self-read
> +	 * during ops.init_cids()).
> +	 */
> +	if (unlikely(!target->pshard)) {
> +		scx_error(sch, "scx_bpf_sub_caps() called before caps storage is initialized");
> +		return -ENODEV;
> +	}
> +
> +	ret = scx_cmask_ref_init(sch, out__ign, &ref);
> +	if (ret) {
> +		scx_error(sch, "invalid out (%d)", ret);
> +		return ret;
> +	}
> +
> +	for (si = ref.shard_first; si < ref.shard_end; si++) {
> +		const struct scx_cid_shard *shard = &scx_cid_shard_ranges[si];
> +		SCX_CMASK_DEFINE_SHARD(local_out, shard->base_cid, shard->nr_cids);
> +		u32 cap_bit;
> +
> +		scx_for_each_cap_bit(cap_bit, caps)
> +			scx_cmask_or(local_out, &target->pshard[si]->caps[cap_bit].cmask);

[Severity: High]
If alloc_pshard() fails inside scx_alloc_pshards(), will this leave
sch->pshard populated with some NULL elements?

kernel/sched/ext/sub.c:scx_alloc_pshards() {
...
	sch->pshard = kzalloc_objs(sch->pshard[0], scx_nr_cid_shards, GFP_KERNEL);
	if (!sch->pshard)
		return -ENOMEM;

	sch->nr_pshards = scx_nr_cid_shards;

	for (si = 0; si < scx_nr_cid_shards; si++) {
		sch->pshard[si] = alloc_pshard(sch, si, scx_shard_node[si]);
		if (!sch->pshard[si])
			return -ENOMEM;
	}
...
}

If this returns -ENOMEM, the err_disable path in scx_sub_enable_workfn() is
triggered, which eventually calls the scheduler's BPF ops.exit() hook.

If the BPF program then calls scx_bpf_sub_caps(0, ...) to read its own
capabilities, the earlier check for !target->pshard passes because the array
itself was successfully allocated.

Will this loop then unconditionally dereference target->pshard[si] and
trigger a NULL pointer dereference on the unallocated elements?

> +		scx_cmask_ref_copy(&ref, local_out);
> +	}
> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=16

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
@ 2026-07-03 12:05   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 12:05 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Unconditional array indexing with a potentially negative CID in `scx_process_sync_ecaps()` causes out-of-bounds reads for CPUs without a valid CID.
--

commit 8b6bde8377767baa9501eb855a3f0e41f5b5ad02
Author: Tejun Heo <tj@kernel.org>

sched_ext: Maintain per-cpu effective cap copies for single-read checks

This commit introduces scx_sched_pcpu->ecaps to maintain per-cpu effective
capability copies. Grant and revoke operations queue sync requests on the
target CPU's runqueue, which are then processed locklessly under the rq lock.

[ ... ]
> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index c821d604ac9da..08d9367cf2188 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> @@ -283,6 +283,125 @@ static void scx_sub_seed_caps(struct scx_sched *sch)
>  	caps_updated_deliver(&to_deliver);
>  }
>  
[ ... ]
> +void scx_process_sync_ecaps(struct rq *rq)
> +{
> +	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
> +	s32 shard = scx_cid_to_shard[cid];

[Severity: High]
Can this cause an out-of-bounds array access if the CPU does not have a valid
CID?

If scx_cid_init() fails, it can leave scx_cpu_to_cid_tbl with -1 sentinels
for some CPUs. The error handling then disables the scheduler, triggering
scx_sched_free_rcu_work() which iterates over all possible CPUs and calls
scx_process_sync_ecaps() on them.

Because scx_cid_to_shard[cid] is evaluated unconditionally here before the
llist_empty() check, a CID of -1 will index outside the scx_cid_to_shard
array and potentially cause a page fault.

> +	struct llist_node *batch, *pos, *tmp;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (likely(llist_empty(&rq->scx.ecaps_to_sync)))
> +		return;
> +
> +	batch = llist_del_all(&rq->scx.ecaps_to_sync);
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=18

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
@ 2026-07-03 12:25   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 12:25 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Concurrent freeing of a sub-scheduler can permanently discard ecaps syncs for all active sub-schedulers on a CPU that is coming online.
- [High] scx_discard_ecaps_to_sync() can hang indefinitely in a worker thread if the target CPU is continuously executing higher-priority tasks, locking up the system RCU workqueue.
- [High] Early clearing of `ecaps_to_sync_node` in `scx_process_sync_ecaps()` causes a Use-After-Free of `pcpu` and `sch` during sub-scheduler teardown.
--

commit f00aae3acf0710a9423504a897d70bfb5d9571f6
Author: Tejun Heo <tj@kernel.org>
Subject: sched_ext: Add sub_ecaps_updated() effective-cap change notifier

The patch adds an ops.sub_ecaps_updated() callback fired when a
sub-scheduler's effective caps on a cid change. CPU hotplug uses this
mechanism to zero caps when a CPU goes down, and to resync when coming
back online.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 08d9367cf2188..55437f1d1965f 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c

[ ... ]

> @@ -351,33 +354,130 @@ void scx_process_sync_ecaps(struct rq *rq)
>  	if (likely(llist_empty(&rq->scx.ecaps_to_sync)))
>  		return;
>  
> +	/*
> +	 * ecaps are zeroed while the cpu is inactive and must stay zero.
> +	 * Discard queued syncs instead of processing them - the
> +	 * scx_online_ecaps() reseed re-syncs every sched on activation.
> +	 * cpu_active() clears before the offline zeroing and sets before the
> +	 * reseed is queued, so this test can neither miss a racing sync nor
> +	 * eat the reseed.
> +	 */
> +	if (unlikely(!cpu_active(cpu))) {
> +		discard_queued_syncs(rq);
> +		return;
> +	}
> +
>  	batch = llist_del_all(&rq->scx.ecaps_to_sync);
>  	llist_for_each_safe(pos, tmp, batch) {
>  		struct scx_sched_pcpu *pcpu =
>  			container_of(pos, struct scx_sched_pcpu, ecaps_to_sync_node);
>  		struct scx_pshard *ps = pcpu->sch->pshard[shard];
> +		u64 ecaps;
>  
>  		init_llist_node(pos);
>  
>  		/* pairs with smp_mb() in queue_sync_ecaps(), see there */
>  		smp_mb();
>  
> -		WRITE_ONCE(pcpu->ecaps, calc_effective_caps(ps, cid));
> +		ecaps = calc_effective_caps(ps, cid);
> +		WRITE_ONCE(pcpu->ecaps, ecaps);
> +
> +		/* tell the sched its effective caps on this cid changed */
> +		if (ecaps != pcpu->reported_ecaps &&
> +		    SCX_HAS_OP(pcpu->sch, sub_ecaps_updated) &&
> +		    !scx_bypassing(pcpu->sch, cpu)) {
> +			struct scx_dsp_ctx *dspc = &pcpu->dsp_ctx;
> +
> +			dspc->rq = rq;
> +			/* stash @prev so nested dispatches can access it */
> +			rq->scx.sub_dispatch_prev = prev;
> +			SCX_CALL_OP(pcpu->sch, sub_ecaps_updated, rq, scx_cpu_arg(cpu),
> +				    pcpu->reported_ecaps, ecaps);
> +			rq->scx.sub_dispatch_prev = NULL;
> +			scx_flush_dispatch_buf(pcpu->sch, rq);
> +			pcpu->reported_ecaps = ecaps;

[Severity: High]
Does this sequence create a use-after-free of pcpu?

In scx_process_sync_ecaps(), the pcpu node is unlinked via init_llist_node(pos).
Later, scx_flush_dispatch_buf() might drop the runqueue lock if it dispatches
to a remote CPU's local DSQ (e.g., via dispatch_to_local_dsq()).

If the lock is dropped, scx_discard_ecaps_to_sync() executing in the RCU worker
can acquire the lock, observe !llist_on_list(), and return, allowing the
pcpu structure to be freed.

When scx_flush_dispatch_buf() finishes and reacquires the lock,
scx_process_sync_ecaps() then writes to the freed pcpu:

    pcpu->reported_ecaps = ecaps;

> +		}
> +	}
> +}

[ ... ]

>  void scx_discard_ecaps_to_sync(s32 cpu, struct scx_sched_pcpu *pcpu)
>  {
> -	scoped_guard (rq_lock_irqsave, cpu_rq(cpu))
> -		scx_process_sync_ecaps(cpu_rq(cpu));
> +	struct rq *rq = cpu_rq(cpu);
>  
> -	WARN_ON_ONCE(llist_on_list(&pcpu->ecaps_to_sync_node));
> +	while (true) {
> +		scoped_guard (rq_lock_irqsave, rq) {
> +			/*
> +			 * scx_process_sync_ecaps() takes the node off the list
> +			 * before it is done accessing @pcpu but does all of it
> +			 * under the rq lock. Off-list observed under the rq
> +			 * lock guarantees that the sync is complete.
> +			 */
> +			if (!llist_on_list(&pcpu->ecaps_to_sync_node))
> +				return;
> +			if (!scx_enabled() || !scx_rq_online(rq)) {

[Severity: High]
Could this destructively drop valid queued syncs for active sub-schedulers?

If a sub-scheduler is being freed via scx_sched_free_rcu_work() concurrently
with CPU hotplug (scx_online_ecaps()), scx_rq_online() checks SCX_RQ_ONLINE.
This flag is set by sched_set_rq_online() after the CPU comes online.

During the window where !scx_rq_online(rq) is true, discard_queued_syncs()
deletes all queued syncs for all sub-schedulers via llist_del_all().

Checking !cpu_active(cpu) instead (as in scx_process_sync_ecaps) might avoid
this race.

> +				discard_queued_syncs(rq);
> +				return;
> +			}
> +		}
> +		resched_cpu(cpu);
> +		msleep(1);

[Severity: High]
Can this loop hang indefinitely if the target CPU is saturated with higher
priority tasks?

The loop spins waiting for the target CPU to process the pending ecaps sync.
The node is removed when scx_process_sync_ecaps() runs on the target CPU,
which is called from balance_one().

Because ext_sched_class has a lower priority than FAIR or RT classes,
if the target CPU is continuously executing higher priority tasks,
balance_one() is never called. The RCU worker thread will spin forever
waiting for !llist_on_list().

> +	}
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=19

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
@ 2026-07-03 12:57   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 12:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] Infinite loop of deferred requeues for exiting tasks or tasks on offline CPUs when the sub-scheduler lacks capabilities.
- [High] Tasks undergoing migration (with p->migration_pending set) are incorrectly rejected to reject_dsq and then skipped during reenqueuing, leaving them stranded indefinitely.
--

commit a1844a55d6fd4ac62f0bf384be1b2b5048f269d7
Author: Tejun Heo <tj@kernel.org>

sched_ext: Add reject DSQ for cap-rejected dispatches

When a sub-scheduler dispatches a task to a CPU it lacks the required
capability on, the task must be rejected rather than allowed to run.
This patch adds a reject DSQ to divert these tasks and hand them back
to the BPF scheduler to re-decide.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 55437f1d1965f..aea63484edc5a 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c

[ ... ]

> @@ -204,6 +204,116 @@ void scx_init_root_caps(struct scx_sched *sch)
>  	}
>  }
>  
> +struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq *rq,
> +					       struct task_struct *p, u64 *enq_flags)
> +{
> +	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
> +	u64 missing = scx_missing_caps(sch, cpu_of(rq), scx_caps_for_enq(*enq_flags));
> +
> +	/* requirements met */
> +	if (likely(!missing))
> +		return &rq->scx.local_dsq;
> +
> +	/*
> +	 * A migration-disabled task must run on this CPU. Let it run and count
> +	 * the violation.
> +	 */
> +	if (unlikely(is_migration_disabled(p))) {
> +		__scx_add_event(sch, SCX_EV_SUB_FORCED_ADMIT, 1);
> +		return &rq->scx.local_dsq;
> +	}

[Severity: High]
Could this create an infinite loop of deferred requeues for exiting tasks
or tasks on offline CPUs?

If a task is exiting (PF_EXITING) or enqueued on an offline CPU while
the sub-scheduler lacks capabilities on that CPU, scx_do_enqueue_task()
routes it directly to the local DSQ:

scx_do_enqueue_task() {
    if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
        unlikely(p->flags & PF_EXITING)) {
        __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
        goto local;
    }
}

Since these tasks aren't forcefully admitted here in scx_local_or_reject_dsq()
like migration-disabled tasks are, they are diverted to the reject_dsq.

Then scx_reenq_reject() attempts to re-enqueue them, which triggers the
same goto local path in scx_do_enqueue_task(), resulting in an endless cycle.

> +
> +	p->scx.reenq_reason_caps = missing;
> +	p->scx.reenq_reason_cid = cid;

[ ... ]

> +void scx_reenq_reject(struct rq *rq)
> +{
> +	LIST_HEAD(tasks);
> +	struct task_struct *p, *n;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (list_empty(&rq->scx.reject_dsq.list))
> +		return;
> +
> +	/*
> +	 * Move to a private list so a task re-rejected by the
> +	 * scx_do_enqueue_task() below isn't revisited this round.
> +	 */
> +	list_for_each_entry_safe(p, n, &rq->scx.reject_dsq.list, scx.dsq_list.node) {
> +		/* migration_pending tasks should have bypassed to local DSQ */
> +		if (WARN_ON_ONCE(p->migration_pending))
> +			continue;

[Severity: High]
Are we permanently stranding tasks that have p->migration_pending set?

During an active migration, a task with p->migration_pending might be
routed to a new CPU. If the sub-scheduler lacks capabilities on that CPU,
scx_local_or_reject_dsq() does not forcefully admit it, and instead
diverts the task to the reject_dsq.

When this list is processed here, the task triggers this warning and is
skipped via the continue statement, meaning it never gets re-enqueued
and is left stranded indefinitely.

> +
> +		scx_dispatch_dequeue(rq, p);
> +
> +		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=21

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
@ 2026-07-03 13:34   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 13:34 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] scx_bpf_task_set_slice() assumes holding task_rq(p) implies exclusive access, but blocked tasks can migrate locklessly, allowing capability bypass and data races.
--

commit d3cb4ee9fc08ad3b3488c4dd100374ef3a61801c
Author: Tejun Heo <tj@kernel.org>

sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice

This commit ties CPU occupancy to the SCX_CAP_BASE capability, ensuring
that a task's slice can only be extended if its scheduler holds baseline
CPU access. It updates scx_bpf_task_set_slice() to apply the slice
synchronously when the rq lock is held, or out-of-band otherwise.

> diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
> --- a/kernel/sched/ext/ext.c
> +++ b/kernel/sched/ext/ext.c

[ ... ]

> @@ -8626,20 +8761,45 @@ __bpf_kfunc_start_defs();
>  __bpf_kfunc bool scx_bpf_task_set_slice(struct task_struct *p, u64 slice,
>  					const struct bpf_prog_aux *aux)
>  {
>  	struct scx_sched *sch;
> +	struct rq *rq = task_rq(p);
>  
>  	guard(rcu)();
>  	sch = scx_prog_sched(aux);
>  	if (unlikely(!sch || !scx_task_on_sched(sch, p)))
>  		return false;
>  
> -	set_task_slice(p, slice);
> +	/*
> +	 * Out of band: stash and apply under the rq lock at the next drain,
> +	 * where it is re-validated against @p's current owner.
> +	 */
> +	if (scx_locked_rq() != rq) {
> +		set_task_slice_oob(sch, p, slice);
> +		return true;
> +	}
> +
> +	/* under the rq lock: apply now, extensions gated on baseline access */
> +	if (slice > p->scx.slice &&
> +	    unlikely(scx_missing_caps(sch, cpu_of(rq), SCX_CAP_BASE)))

[Severity: High]
Does this direct write path allow a capability bypass and a data race for
blocked tasks?

If a BPF program on CPU A holds the rq_A lock and calls
scx_bpf_task_set_slice() on a blocked task 'p' that last ran on CPU A,
task_rq(p) will return rq_A. This matches scx_locked_rq(), allowing the code
to skip the out-of-band path.

Could try_to_wake_up() concurrently migrate 'p' to CPU B in this scenario?
Since try_to_wake_up() only needs to hold pi_lock (not the rq_A lock) while
changing p->cpu, the BPF program might directly write to p->scx.slice at the
same time update_curr_scx() is running on CPU B.

Additionally, since the capability check above uses the old CPU (rq_A) rather
than the new CPU the task is migrating to, would this allow a sub-scheduler
to bypass the SCX_CAP_BASE revocation check for the new CPU?

> +		__scx_add_event(sch, SCX_EV_SLICE_DENIED, 1);
> +	else
> +		set_task_slice(p, slice);
> +
>  	return true;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=25

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT Tejun Heo
@ 2026-07-03 14:01   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 14:01 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Inconsistent capability enforcement when a sub-scheduler dispatches a descendant's task to a local DSQ using SCX_ENQ_PREEMPT.
--

commit 35cbaaff89094ffe7ebdd01799138d2095bbc3dd
Author: Tejun Heo <tj@kernel.org>

sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT

A kick forces a scheduling event on the target cpu, and a preemption also
evicts the running task. Gate both on caps. Any kick requires baseline
access on the cid, and preempting a task the sub-sched does not own -
whether by a SCX_ENQ_PREEMPT insert or a SCX_KICK_PREEMPT kick - requires
the new SCX_CAP_PREEMPT.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 2f1e19db8e72c..67ba352828e05 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> @@ -222,7 +225,12 @@ struct scx_dispatch_q *scx_local_or_reject_dsq(struct scx_sched *sch, struct rq
>  					       struct task_struct *p, u64 *enq_flags)
>  {
>  	s32 cid = __scx_cpu_to_cid(cpu_of(rq));
> -	u64 missing = scx_missing_caps(sch, cpu_of(rq), scx_caps_for_enq(*enq_flags));
> +	u64 needed = scx_caps_for_enq(*enq_flags);
> +	u64 missing;
> +
> +	if (*enq_flags & SCX_ENQ_PREEMPT)
> +		needed |= scx_caps_for_preempt(sch, rq);
> +	missing = scx_missing_caps(sch, cpu_of(rq), needed);

[Severity: High]
This isn't a bug introduced by this patch, but the newly added SCX_CAP_PREEMPT
check seems to expose a pre-existing context loss issue.

Can this lead to spurious preemption failures when a sub-scheduler dispatches a
descendant's task remotely?

When a sub-scheduler (e.g., A) dispatches a descendant's task (e.g., owned by
B) to a remote CPU using SCX_ENQ_PREEMPT, dispatch_to_local_dsq() delegates to
move_remote_task_to_local_dsq(), which doesn't pass the active scheduler
context (A).

Later, when the task is activated on the remote CPU:

enqueue_task_scx() {
    ...
    struct scx_sched *sch = scx_task_sched(p);
    ...
}

This reconstructs the scheduler context from the task itself, yielding B
instead of A.

When execution reaches scx_local_or_reject_dsq(), the needed capabilities are
checked against B instead of A. Could this incorrectly reject the task, moving
it to the reject_dsq and creating an infinite scheduling loop that triggers the
sched_ext watchdog?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=27

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
@ 2026-07-03 14:14   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 14:14 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Double delivery of `ops.update_idle(..., true)` to BPF schedulers during an idle transition if a forced re-notify is pending.
--

commit f8054dfa3b3f832d674e303615ccbb3864f6e4c8
Author: Tejun Heo <tj@kernel.org>

sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds

This commit routes ops.update_idle() to all sub-schedulers that hold base
capabilities on a CPU, rather than just the root scheduler. It also adds a
re-notify mechanism to flush suppressed idle transitions when a scheduler
leaves bypass mode or gains baseline access.

> diff --git a/kernel/sched/ext/idle.c b/kernel/sched/ext/idle.c
> index 8e8c6201b7dfd..04b320f89b6f2 100644
> --- a/kernel/sched/ext/idle.c
> +++ b/kernel/sched/ext/idle.c

[ ... ]

> @@ -730,6 +731,46 @@ static void update_builtin_idle(int cpu, bool idle)
>  	}
>  }
>  
> +/*
> + * Notify schedulers of an idle transition on @cpu's cid, delivering to every
> + * sched that holds %SCX_CAP_BASE on the cid (the root holds every cap). A real
> + * transition (@do_notify) reaches all holders. A forced one (@root_renotify for
> + * the root, a sub-sched's idle_renotify marker for a sub) reaches only the owed
> + * scheds.
> + */
> +static void scx_idle_notify(struct rq *rq, bool idle, bool do_notify, bool root_renotify)
> +{
> +	s32 cpu = cpu_of(rq);
> +	s32 cid = scx_cpu_arg(cpu);
> +	struct scx_sched *pos;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	pos = scx_next_descendant_pre(NULL, scx_root);
> +	while (pos) {
> +		bool forced = false;
> +
> +		if (unlikely(scx_missing_caps(pos, cpu, SCX_CAP_BASE))) {
> +			pos = scx_skip_subtree_pre(pos, scx_root);
> +			continue;
> +		}
> +
> +		if (pos == scx_root) {
> +			forced = root_renotify;
> +		}
> +#ifdef CONFIG_EXT_SUB_SCHED
> +		else if (per_cpu_ptr(pos->pcpu, cpu)->idle_renotify) {
> +			per_cpu_ptr(pos->pcpu, cpu)->idle_renotify = false;
> +			forced = true;
> +		}
> +#endif
> +		if ((do_notify || forced) && SCX_HAS_OP(pos, update_idle) &&
> +		    !scx_bypassing(pos, cpu))
> +			SCX_CALL_OP(pos, update_idle, rq, cid, idle);
> +		pos = scx_next_descendant_pre(pos, scx_root);
> +	}
> +}

[ ... ]

> @@ -772,20 +812,26 @@ void __scx_update_idle(struct rq *rq, bool idle, bool do_notify)
>  			update_builtin_idle(cpu, idle);
>  
>  	/*
> -	 * Trigger ops.update_idle() only when transitioning from a task to
> -	 * the idle thread and vice versa.
> -	 *
> -	 * Idle transitions are indicated by do_notify being set to true,
> -	 * managed by put_prev_task_idle()/set_next_task_idle().
> +	 * ops.update_idle() fires on real idle transitions, indicated by
> +	 * @do_notify and managed by put_prev_task_idle()/set_next_task_idle().
> +	 * An idle pick also fires it to flush a forced notify owed to a sched
> +	 * that missed transitions while bypassed or on a cid it just gained.
> +	 * unbypass_renotify_idle() and scx_process_sync_ecaps() arm the per-rq
> +	 * gates, and scx_idle_notify() targets the owed scheds.
>  	 *
> -	 * This must come after builtin idle update so that BPF schedulers can
> -	 * create interlocking between ops.update_idle() and ops.enqueue() -
> +	 * This must come after the builtin idle update so that BPF schedulers
> +	 * can create interlocking between ops.update_idle() and ops.enqueue() -
>  	 * either enqueue() sees the idle bit or update_idle() sees the task
>  	 * that enqueue() queued.
>  	 */
> -	if (SCX_HAS_OP(sch, update_idle) && do_notify &&
> -	    !scx_bypassing(sch, cpu_of(rq)))
> -		SCX_CALL_OP(sch, update_idle, rq, scx_cpu_arg(cpu_of(rq)), idle);
> +	if (do_notify ||
> +	    (idle && (rq->scx.flags &
> +		      (SCX_RQ_SUB_IDLE_RENOTIFY | SCX_RQ_ROOT_IDLE_RENOTIFY)))) {
> +		bool root_renotify = rq->scx.flags & SCX_RQ_ROOT_IDLE_RENOTIFY;
> +
> +		rq->scx.flags &= ~(SCX_RQ_SUB_IDLE_RENOTIFY | SCX_RQ_ROOT_IDLE_RENOTIFY);
> +		scx_idle_notify(rq, idle, do_notify, root_renotify);
> +	}
>  }

[Severity: High]
Does this code introduce a regression where ops.update_idle(..., true) is
delivered twice consecutively to BPF schedulers?

If a BPF scheduler sets idle_renotify (e.g. via scx_bypass() or
scx_process_sync_ecaps()) while a CPU is active, and the CPU subsequently
goes idle:

1. __pick_next_task() calls pick_task_idle()
2. pick_task_idle() calls __scx_update_idle() with idle=true and do_notify=false
3. __scx_update_idle() sees the renotify flags, clears them, and calls scx_idle_notify()
4. scx_idle_notify() delivers update_idle(..., true) to the owed scheduler because forced is true.

Immediately after, __pick_next_task() calls put_prev_set_next_task(), which
calls set_next_task_idle():

5. set_next_task_idle() calls __scx_update_idle() with idle=true and do_notify=true
6. __scx_update_idle() sees do_notify=true and calls scx_idle_notify()
7. scx_idle_notify() delivers update_idle(..., true) to all schedulers, including the owed scheduler again.

Could this double delivery violate edge-trigger semantics and cause BPF
schedulers maintaining idle CPU lists to push the CPU multiple times?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=28

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
@ 2026-07-03 14:28   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 14:28 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:

Pre-existing issues:
- [High] Concurrent CPU hotplug, bypass toggle, or scx_bpf_sub_grant() crashes due to a NULL pshard dereference when a cpu-form sub-scheduler is linked but not yet rejected.
--

commit 8e75c7343c899785bda5b0be0886c20991519a76
Author: Tejun Heo <tj@kernel.org>

sched_ext: Replay ecaps notifications suppressed by bypass

scx_process_sync_ecaps() consumes ecaps syncs while the sched is bypassing
without delivering ops.sub_ecaps_updated(), leaving reported_ecaps stale.
This patch re-queues a sync for every (sched, cpu) with an undelivered delta
at the per-cpu bypass exit in scx_bypass(), next to the idle renotify
catch-up.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 90caf76db8bf6..15edcf4f81ee2 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> @@ -550,6 +550,41 @@ void scx_process_sync_ecaps(struct rq *rq, struct task_struct *prev)
>  		scx_schedule_reenq_local(rq, SCX_REENQ_CAP_REVOKE);
>  }
>  
> +/**
> + * scx_unbypass_replay_ecaps - Replay a bypass-suppressed ecaps notification
> + * @rq: rq of the cpu leaving bypass
> + * @sch: scheduler that just left bypass on @rq's cpu
> + *
[ ... ]
> + */
> +void scx_unbypass_replay_ecaps(struct rq *rq, struct scx_sched *sch)
> +{
> +	s32 cpu = cpu_of(rq);
> +	struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
> +	struct scx_pshard *ps;
> +	s32 cid;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/* root holds every cap and never uses ecaps */
> +	if (!sch->level)
> +		return;
> +
> +	if (READ_ONCE(pcpu->ecaps) == pcpu->reported_ecaps)
> +		return;
> +
> +	cid = __scx_cpu_to_cid(cpu);
> +	ps = sch->pshard[scx_cid_to_shard[cid]];
> +
> +	guard(raw_spinlock)(&ps->lock);

[Severity: High]
This is a pre-existing issue and was not introduced by the patch under review,
but this new function also appears to be affected by it. Can this cause a NULL
pointer dereference if sch->pshard is NULL?

If a user attaches a cpu-form scheduler (where !sch->is_cid_type) as a
sub-scheduler, scx_alloc_pshards() returns 0 without allocating sch->pshard.
The scheduler is then added to the descendant tree via scx_link_sched().

Before it is rejected by scx_validate_ops() and unlinked, a concurrent
parent lifting bypass can trigger this new scx_unbypass_replay_ecaps()
path, unconditionally accessing sch->pshard.

This pre-existing issue also seems to affect scx_online_ecaps(),
scx_bpf_sub_grant(), and scx_bpf_sub_revoke() where pshard is accessed
unconditionally. Could this be triggered by a concurrent CPU hotplug or BPF call
during this window?

> +	queue_sync_ecaps(sch, cid);
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=29

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
@ 2026-07-03 14:45   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 14:45 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Out-of-bounds read due to missing `__str` suffix on the `fmt` argument in `scx_bpf_sub_kill_bstr`.
--

commit dae92dbfa20b120640a7210997136f2e4d4af380
Author: Tejun Heo <tj@kernel.org>

sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler

This commit adds scx_bpf_sub_kill() to allow a cid-form scheduler to evict
a direct child sub-scheduler. It exposes BPF string formatting infrastructure
and introduces a new exit kind, SCX_EXIT_PARENT_KILL, to handle the teardown.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> index 15edcf4f81ee2..4baa6f7be71bd 100644
> --- a/kernel/sched/ext/sub.c
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> @@ -1636,6 +1636,63 @@ __bpf_kfunc s32 scx_bpf_sub_caps(u64 cgroup_id, u64 caps, struct scx_cmask *out_
>  	return 0;
>  }
>  
> +/**
> + * scx_bpf_sub_kill_bstr - Kill a direct child sub-scheduler
> + * @cgroup_id: cgroup id of the direct child to kill
> + * @fmt: reason message format string
> + * @data: format string parameters packaged using ___bpf_fill() macro
> + * @data__sz: @data len, must end in '__sz' for the verifier
> + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
[ ... ]
> + */
> +__printf(2, 0)
> +__bpf_kfunc s32 scx_bpf_sub_kill_bstr(u64 cgroup_id, char *fmt,
> +				      unsigned long long *data, u32 data__sz,
> +				      const struct bpf_prog_aux *aux)

[Severity: High]
Does this kfunc definition miss the __str suffix on the fmt parameter?

Because the argument is named fmt instead of fmt__str, the BPF verifier
treats it as a standard memory pointer (KF_ARG_PTR_TO_MEM) with a size of
1 byte (the size of char), rather than as a null-terminated string.

If a BPF program passes a pointer to a 1-byte map value without a null
terminator, the verifier will allow it. Later, this pointer is passed
through scx_bstr_format() to bpf_bprintf_prepare(), which scans for a null
terminator up to UINT_MAX:

kernel/sched/ext/ext.c:__bstr_format() {
    ...
    ret = bpf_bprintf_prepare(fmt, UINT_MAX, data_buf, data__sz / 8,
                              &bprintf_data);
    ...
}

Could this lead to an out-of-bounds read in the kernel if
bpf_bprintf_prepare() scans beyond the 1-byte verified bounds?

> +{
> +	struct scx_sched *parent, *child;
> +	s32 ret;
> +
> +	guard(rcu)();
> +
> +	parent = scx_prog_sched(aux);
> +	if (unlikely(!parent))
> +		return -ENODEV;
> +
[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=30

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
@ 2026-07-03 14:57   ` sashiko-bot
  2026-07-04  0:54     ` Tejun Heo
  0 siblings, 1 reply; 61+ messages in thread
From: sashiko-bot @ 2026-07-03 14:57 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Medium] Missing atomic increment for sub-scheduler dispatch counter (`nr_dsps`).
--

commit ce8113fd0fe8bd38281cc676e6e4f5bc5e262e96
Author: Tejun Heo <tj@kernel.org>

tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling

This commit expands scx_qmap to demonstrate cap-based cid delegation for
hierarchical sub-scheduling. It allows a parent to delegate exclusive cids
to child cgroup sub-schedulers based on cpu.weight. Remaining cids form a
shared pool that is round-robined among participants.

> diff --git a/tools/sched_ext/scx_qmap.bpf.c b/tools/sched_ext/scx_qmap.bpf.c
> index f6cfe63425d36..938a32514b2f6 100644
> --- a/tools/sched_ext/scx_qmap.bpf.c
> +++ b/tools/sched_ext/scx_qmap.bpf.c
[ ... ]
> @@ -567,11 +627,34 @@ void BPF_STRUCT_OPS(qmap_dispatch, s32 cid, struct task_struct *prev)
[ ... ]
> +	} else if (owner >= 0 && owner < MAX_SUB_SCHEDS &&
> +		   qa.sub_sched_ctxs[owner].cgroup_id) {
> +		if (scx_bpf_sub_dispatch(qa.sub_sched_ctxs[owner].cgroup_id))
> +			qa.sub_sched_ctxs[owner].nr_dsps++;

[Severity: Medium]
Since qmap_dispatch() can run concurrently on multiple CPUs delegated to the
same sub-scheduler owner, will this non-atomic increment lead to lost
updates?

When multiple CPUs update this counter simultaneously, it looks like data
races can cause lost increments, resulting in incorrect statistics and an
under-reporting of disp/s in the userspace tool. Would it be better to use
an atomic add here, similar to how other counters are handled?

> +		return;
> +	}
>  
>  	if (!qa.nr_highpri_queued && scx_bpf_dsq_move_to_local(SHARED_DSQ, 0))
>  		return;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260703080159.2314350-1-tj@kernel.org?part=31

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants
  2026-07-03  8:33   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [Medium] Missing READ_ONCE() in cmask_word_op1 leaves
> scx_cmask_empty() and the boundary checks of scx_cmask_subset()
> vulnerable to data races.

v2 will apply READ_ONCE() to cmask_word_op1() as well - uniform marking
across the cmask word ops is the right call regardless of the current
callers.

> [Critical] The kernel trusts inline metadata (nr_cids, base) from BPF
> arena cmasks without bounds checking, allowing userspace or BPF to
> trigger massive out-of-bounds zeroing of kernel memory.

On scx_cmask_clear() trusting the inline header: it's pre-existing and
worth hardening. v2 will add a hardening patch that sources the memset
bound from kernel state rather than the arena-resident fields.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched
  2026-07-03  9:02   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] List corruption in `scx_kick_cpu()` and `kick_cpus_irq_workfn()`
> due to NMI context concurrency.

scx_bpf_kick_cpu() isn't expected to be called from NMI context, but v2
will add an in_nmi() guard that triggers scx_error() rather than touch
the list.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override()
  2026-07-03  9:51   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [Critical] Concurrent modification of userspace/BPF-accessible arrays
> `cpu_to_cid` and `shard_start` bypasses validation and leads to
> out-of-bounds writes.

On the TOCTOU: v2 will snapshot cpu_to_cid[] and shard_start[] and
validate/use the copies, so a concurrent mutation can't drive an
out-of-bounds write.

> [High] Changing the signature of `scx_bpf_cid_override` without
> renaming the kfunc breaks backward compatibility and `compat.bpf.h`
> versioning, unconditionally causing load failures on older kernels.

On the signature change: scx_bpf_cid_override() ships in v7.2 and its
signature changes in v7.3. That's fine here - the interface isn't
expected to be in use yet, and the feature is still experimental and
under active development.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists
  2026-07-03 10:49   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] RCU-001: Remove Before Reclaim Ordering violation during
> scheduler teardown

On unlink vs. grace period: the synchronize_rcu() before
scx_unlink_sched() drains in-progress ops, not the free. The sched is
freed by scx_sched_free_rcu_work() via queue_rcu_work() after unlink,
which waits its own grace period, so the order is unlink -> grace ->
free.

> [Medium] RCU publication order violation in scx_link_sched()

On bypass_depth: v2 will write it before list_add_tail_rcu() so the
payload is set before the node is published.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers
  2026-07-03 11:17   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Partial allocation failure in `scx_alloc_pshards()` leaves the
> `pshard` array exposed, causing a NULL pointer dereference if the BPF
> `ops.exit` hook reads its own caps.

v2 will free and NULL sch->pshard on a partial scx_alloc_pshards()
failure, so the !pshard guard in scx_bpf_sub_caps() holds.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks
  2026-07-03 12:05   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Unconditional array indexing with a potentially negative CID in
> `scx_process_sync_ecaps()` causes out-of-bounds reads for CPUs without
> a valid CID.

The index isn't reachable in practice - the free path uses
discard_queued_syncs() rather than scx_process_sync_ecaps(), and
balance_one() only runs with a valid cid - but v2 will move the
cid/shard computation below the guards.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier
  2026-07-03 12:25   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Early clearing of `ecaps_to_sync_node` in
> `scx_process_sync_ecaps()` causes a Use-After-Free of `pcpu` and `sch`
> during sub-scheduler teardown.

pcpu is freed only by the post-unlink queue_rcu_work(), and this CPU
stays non-quiescent through balance_one() (the rq-lock handoff keeps
IRQs disabled), so the grace period can't complete before the write.

> [High] Concurrent freeing of a sub-scheduler can permanently discard
> ecaps syncs for all active sub-schedulers on a CPU that is coming
> online.

v2 will change the discard predicate to !cpu_active(cpu). The online
reseed is queued before SCX_RQ_ONLINE is set, so gating on the flag can
discard a live sched's reseed; cpu_active() brackets the reseed
correctly and matches scx_process_sync_ecaps().

> [High] scx_discard_ecaps_to_sync() can hang indefinitely in a worker
> thread if the target CPU is continuously executing higher-priority
> tasks, locking up the system RCU workqueue.

The ext deadline server guarantees ext_sched_class bandwidth, so
balance_one() runs and drains the sync. Disabling the DL server and
saturating the CPU with higher classes is accepted starvation.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches
  2026-07-03 12:57   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Infinite loop of deferred requeues for exiting tasks or tasks
> on offline CPUs when the sub-scheduler lacks capabilities.

This is intended - re-rejection is unbounded by design and a capless
sub is ejected by the stall watchdog, as the function comment notes.

> [High] Tasks undergoing migration (with p->migration_pending set) are
> incorrectly rejected to reject_dsq and then skipped during
> reenqueuing, leaving them stranded indefinitely.

The WARN is an assertion. The migration stopper clears migration_pending
before the re-enqueue, and queued tasks are skipped by
scx_reenq_local(), so it shouldn't reach here. If it turns out it can,
force-admitting it like migration-disabled is a straightforward
follow-up.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice
  2026-07-03 13:34   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] scx_bpf_task_set_slice() assumes holding task_rq(p) implies
> exclusive access, but blocked tasks can migrate locklessly, allowing
> capability bypass and data races.

v2 will gate the direct write on task_on_rq_queued(p) in addition to
scx_locked_rq() == rq, so a blocked or migrating task takes the
deferred path that re-validates under its real rq lock.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT
  2026-07-03 14:01   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Inconsistent capability enforcement when a sub-scheduler
> dispatches a descendant's task to a local DSQ using SCX_ENQ_PREEMPT.

The consume path already authorizes against the scheduling sched, but
you're right the remote re-activation (move_remote_task_to_local_dsq()
-> enqueue_task_scx()) reconstructs the task owner and loses the
placer. v2 will carry the scheduling sched across the activate_task()
boundary (a per-rq stash alongside extra_enq_flags) and authorize the
insert against it.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds
  2026-07-03 14:14   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Double delivery of `ops.update_idle(..., true)` to BPF
> schedulers during an idle transition if a forced re-notify is pending.

This only happens when a forced renotify coincides with a real
running->idle transition. v2 will suppress the forced delivery when the
CPU is transitioning (the real notify follows), delivering only in the
already-idle case.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass
  2026-07-03 14:28   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Concurrent CPU hotplug, bypass toggle, or scx_bpf_sub_grant()
> crashes due to a NULL pshard dereference when a cpu-form sub-scheduler
> is linked but not yet rejected.

v2 will move validation (which rejects cpu-form subs) before
scx_link_sched(), so an invalid sub never enters the RCU tree with a
NULL pshard.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler
  2026-07-03 14:45   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [High] Out-of-bounds read due to missing `__str` suffix on the `fmt`
> argument in `scx_bpf_sub_kill_bstr`.

This matches the existing bstr kfuncs (scx_bpf_exit_bstr(),
scx_bpf_error_bstr(), scx_bpf_dump_bstr()). It's worth doing, but as a
change across the whole family - I'll follow up on it separately rather
than fold it into this series.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling
  2026-07-03 14:57   ` sashiko-bot
@ 2026-07-04  0:54     ` Tejun Heo
  0 siblings, 0 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-04  0:54 UTC (permalink / raw)
  To: sashiko-bot; +Cc: sched-ext

> [Medium] Missing atomic increment for sub-scheduler dispatch counter
> (`nr_dsps`).

v2 will use __sync_fetch_and_add() for nr_dsps, matching the other
counters.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2026-07-04  0:54 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
2026-07-03  8:33   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
2026-07-03  9:02   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
2026-07-03  9:51   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
2026-07-03 10:49   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
2026-07-03 11:17   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier " Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
2026-07-03 12:05   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
2026-07-03 12:25   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
2026-07-03 12:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
2026-07-03 13:34   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT Tejun Heo
2026-07-03 14:01   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
2026-07-03 14:14   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
2026-07-03 14:28   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
2026-07-03 14:45   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
2026-07-03 14:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox