Sched_ext development
 help / color / mirror / Atom feed
* [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers
@ 2026-07-03  8:01 Tejun Heo
  2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
                   ` (31 more replies)
  0 siblings, 32 replies; 61+ messages in thread
From: Tejun Heo @ 2026-07-03  8:01 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Hello,

The existing sub-scheduler support only covers a part of scheduling. An
scx_sched can attach to a cgroup subtree under another scx_sched and detach,
and the dispatch path has programmatic delegation: dispatch runs as top-down
recursion via scx_bpf_sub_dispatch(), so a parent decides when a child gets
to pick tasks for the local cpu by calling into it.

The enqueue path is not implemented yet. Every scheduler in the hierarchy
can insert into every local DSQ, so a parent has no way to partition CPUs
among its children or protect its own share from them.

This patchset adds that control. Most importantly it enables sub-sched
support on the enqueue path, and it extends the same control to the other
paths that touch cpu access: occupancy, kicks and preemption, and idle
reporting.

The enqueue path cannot use the dispatch path's programmatic model.
Dispatch is top-down and the delegation is implicit in the call chain.
Enqueue starts at the other end: a leaf picks a cid for a task, and which
cids it may use is the cumulative result of its ancestors' delegation
decisions. Resolving that programmatically on every enqueue would mean a
cross-sched round-trip call chain, possibly retrying when a request
cannot be granted as-is. Delegation is therefore state, not calls:

- Ownership is tracked per (scheduler, cid) as a set of capabilities:

  - SCX_CAP_ENQ_IMMED: insert an IMMED task onto the cid's local DSQ.
    This is the baseline cap (SCX_CAP_BASE) required to make any use of
    a cpu.

  - SCX_CAP_ENQ: insert any task onto the cid's local DSQ.

  - SCX_CAP_PREEMPT: preempt a task outside the scheduler's own subtree.

  Higher caps imply the lower ones for the holder's own use.

- The root owns every cap on every cid. A parent delegates caps to a
  child with scx_bpf_sub_grant() and takes them back with
  scx_bpf_sub_revoke(). A child's cap set is always a subset of its
  parent's.

- The cap state is sharded. The cid space is split into topology-aligned
  shards, and each scheduler tracks its caps in per-shard cmasks with a
  per-shard lock. Operations are broken up on shard boundaries and different
  shards never contend. Shards are expected to serve as the locality unit
  when cids are handed out to schedulers, so lock granularity scales
  naturally with the allocation pattern.

- Hot paths check a per-cpu effective-caps copy with a single read.
  Grant/revoke rings a per-cpu doorbell and the owning cpu folds the new
  config in at the top of balance(). Cross-sched communication happens only
  when the delegation set changes.

Enforcement:

- An insert that lacks the required cap is diverted to a kernel-internal
  per-rq reject DSQ and handed back to the BPF scheduler to re-decide,
  tagged with SCX_TASK_REENQ_CAP and the missing caps.

- A revoke evicts already-queued and running tasks through the same
  reenqueue path.

- CPU occupancy is tied to the task slice. Extending a slice requires
  baseline access on the cpu.

- Kicks are enforced at delivery. Any kick needs baseline access on the
  target cid, and SCX_KICK_PREEMPT degrades to a plain reschedule if the
  victim is outside the kicker's subtree and the kicker lacks
  SCX_CAP_PREEMPT.

Notifications:

- ops.sub_caps_updated() reports config-level changes per shard with
  coalescing.

- ops.sub_ecaps_updated() reports per-cpu effective changes when they
  take effect.

- ops.update_idle() is routed to the cid's owner, with a re-notification
  when ownership changes while the cpu sits idle.

A parent can evict a misbehaving child with scx_bpf_sub_kill().

To make it concrete, consider a root scheduler R with a child A, which
in turn has a child A1:

- On enable, R owns every cap on every cid. A child starts with no caps
  and cannot use any cpu on its own, so a parent normally makes an
  initial grant when a child attaches.

- R grants cids 8-15 to A by calling scx_bpf_sub_grant() with A's cgroup
  id and a cmask. A grant is all-or-nothing per cid and can only hand
  down caps R holds literally.

- A learns of the grant through ops.sub_caps_updated(). As each cpu
  folds the change into its effective caps, A gets
  ops.sub_ecaps_updated() for that cid, and if the cpu is sitting idle,
  an ops.update_idle() re-notification so A can use it right away.

- Nesting chains through the same notifier: from its
  ops.sub_caps_updated(), A calls scx_bpf_sub_grant() to pass a subset, say
  cids 12-15, down to A1, whose own ops.sub_caps_updated() then fires. A1's
  cap set is a subset of A's, which is a subset of R's. Caps A holds only
  through implication cannot be re-delegated.

From A's point of view, using and then losing a cpu looks like this:

- Holding SCX_CAP_ENQ on cid 8, A inserts its tasks into cid 8's local
  DSQ from its enqueue and dispatch paths, extends their slices, and
  kicks the cpu as usual.

- R revokes cid 8. The revoke clears it across A's whole subtree, so A1
  loses whatever it held on the cid too.

- Once the revoke reaches the cpu's effective caps, A's tasks queued on
  cid 8 are handed back to A's ops.enqueue() tagged with
  SCX_TASK_REENQ_CAP and the missing caps, a running task is evicted by
  zeroing its slice, and new inserts are rejected and handed back the
  same way.

- A places the bounced tasks somewhere it still holds, and the notifiers
  tell it that its holdings shrank.

The final two patches expand the scx_qmap hierarchical demo: a qmap
instance splits the cpus it fully owns among itself and child qmaps in
proportion to cpu.weight, time-shares the rounding leftovers through a
round-robin pool, and can fault-inject dispatches to unheld cids to
demonstrate the kernel-side enforcement.

Based on sched_ext/for-7.3 (daf8e166ba59).

This patchset contains the following 32 patches.

 0001 sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched()
 0002 tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight()
 0003 sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants
 0004 tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types
 0005 sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers
 0006 sched_ext: Make scx_bpf_kick_cid() return void
 0007 sched_ext: Make the kick machinery per-sched
 0008 sched_ext: Add ops.init_cids() to finalize the cid layout before init
 0009 sched_ext: Add CID sharding
 0010 sched_ext: Add shard boundaries to scx_bpf_cid_override()
 0011 sched_ext: Defer scx_sched kobj sysfs add into the enable workfns
 0012 sched_ext: Add per-shard scx_sched storage scaffolding
 0013 sched_ext: Add scx_cmask_ref for validated arena cmask access
 0014 sched_ext: RCU-protect the sub-sched tree's children/sibling lists
 0015 sched_ext: Add scx_skip_subtree_pre()
 0016 sched_ext: Add per-shard cap delegation for sub-schedulers
 0017 sched_ext: Add coalescing sub_caps_updated() notifier for sub-schedulers
 0018 sched_ext: Maintain per-cpu effective cap copies for single-read checks
 0019 sched_ext: Add sub_ecaps_updated() effective-cap change notifier
 0020 sched_ext: Generalize local-DSQ handling to rq-owned DSQs
 0021 sched_ext: Add reject DSQ for cap-rejected dispatches
 0022 sched_ext: Add the SCX_CAP_ENQ_IMMED cap
 0023 sched_ext: Assign a unique id to each scheduler instance
 0024 sched_ext: Route task slice writes through set_task_slice()
 0025 sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice
 0026 sched_ext: Add the SCX_CAP_ENQ cap
 0027 sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT
 0028 sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds
 0029 sched_ext: Replay ecaps notifications suppressed by bypass
 0030 sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler
 0031 tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling
 0032 tools/sched_ext: scx_qmap - Add sub-sched cap fault injection

The patches are organized as follows:

- 01-05: fixes and small prep.

- 06-15: plumbing. Per-sched kick machinery (06-07), ops.init_cids()
  (08), cid sharding (09-10), sysfs and per-shard storage prep (11-12),
  validated arena cmask access (13) and RCU-safe tree walking (14-15).

- 16-19: cap delegation. Per-shard caps with grant/revoke (16), the
  coalescing sub_caps_updated() notifier (17), per-cpu effective caps
  (18) and the sub_ecaps_updated() notifier (19).

- 20-27: enforcement. The reject DSQ (20-21), SCX_CAP_ENQ_IMMED (22),
  slice-based occupancy (23-25), SCX_CAP_ENQ (26) and preemption and
  kick gating (27).

- 28-30: idle routing and re-notification (28), bypass replay (29) and
  scx_bpf_sub_kill() (30).

- 31-32: scx_qmap hierarchical demo.

The patchset is also available in the following git branch:

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-sub-caps

diffstat follows. Thanks.

 include/linux/sched/ext.h                |   32 +-
 kernel/sched/ext/cid.c                   |  474 ++++++++++---
 kernel/sched/ext/cid.h                   |   15 +-
 kernel/sched/ext/ext.c                   |  792 +++++++++++++++++-----
 kernel/sched/ext/idle.c                  |   68 +-
 kernel/sched/ext/internal.h              |  303 ++++++++-
 kernel/sched/ext/sub.c                   | 1058 +++++++++++++++++++++++++++++-
 kernel/sched/ext/sub.h                   |  117 +++-
 kernel/sched/ext/types.h                 |   72 +-
 kernel/sched/sched.h                     |   14 +-
 tools/sched_ext/include/scx/cid.bpf.h    |   88 ++-
 tools/sched_ext/include/scx/common.bpf.h |   26 +-
 tools/sched_ext/include/scx/compat.bpf.h |   11 +-
 tools/sched_ext/scx_qmap.bpf.c           |  821 +++++++++++++++++++++--
 tools/sched_ext/scx_qmap.c               |  375 ++++++++++-
 tools/sched_ext/scx_qmap.h               |  147 ++++-
 16 files changed, 3979 insertions(+), 434 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2026-07-04  0:54 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-03  8:01 [PATCHSET sched_ext/for-7.3] sched_ext: Capability-based CPU delegation for sub-schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 01/32] sched_ext: Fix premature ops->priv publication in scx_alloc_and_add_sched() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 02/32] tools/sched_ext: scx - Fix cmask_subset(), cmask_equal() and cmask_weight() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 03/32] sched_ext: Use READ_ONCE/WRITE_ONCE in cmask word ops and drop _RACY variants Tejun Heo
2026-07-03  8:33   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 04/32] tools/sched_ext: scx_qmap - Use bare u64/u32/s32 integer types Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 05/32] sched_ext: Reject direct slice and dsq_vtime writes for cid-form schedulers Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 06/32] sched_ext: Make scx_bpf_kick_cid() return void Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 07/32] sched_ext: Make the kick machinery per-sched Tejun Heo
2026-07-03  9:02   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 08/32] sched_ext: Add ops.init_cids() to finalize the cid layout before init Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 09/32] sched_ext: Add CID sharding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 10/32] sched_ext: Add shard boundaries to scx_bpf_cid_override() Tejun Heo
2026-07-03  9:51   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 11/32] sched_ext: Defer scx_sched kobj sysfs add into the enable workfns Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 12/32] sched_ext: Add per-shard scx_sched storage scaffolding Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 13/32] sched_ext: Add scx_cmask_ref for validated arena cmask access Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 14/32] sched_ext: RCU-protect the sub-sched tree's children/sibling lists Tejun Heo
2026-07-03 10:49   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 15/32] sched_ext: Add scx_skip_subtree_pre() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 16/32] sched_ext: Add per-shard cap delegation for sub-schedulers Tejun Heo
2026-07-03 11:17   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 17/32] sched_ext: Add coalescing sub_caps_updated() notifier " Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 18/32] sched_ext: Maintain per-cpu effective cap copies for single-read checks Tejun Heo
2026-07-03 12:05   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 19/32] sched_ext: Add sub_ecaps_updated() effective-cap change notifier Tejun Heo
2026-07-03 12:25   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 20/32] sched_ext: Generalize local-DSQ handling to rq-owned DSQs Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 21/32] sched_ext: Add reject DSQ for cap-rejected dispatches Tejun Heo
2026-07-03 12:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 22/32] sched_ext: Add the SCX_CAP_ENQ_IMMED cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 23/32] sched_ext: Assign a unique id to each scheduler instance Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 24/32] sched_ext: Route task slice writes through set_task_slice() Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 25/32] sched_ext: Tie cpu occupancy to SCX_CAP_BASE through the task slice Tejun Heo
2026-07-03 13:34   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 26/32] sched_ext: Add the SCX_CAP_ENQ cap Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 27/32] sched_ext: Gate kicks on SCX_CAP_BASE and preemption on SCX_CAP_PREEMPT Tejun Heo
2026-07-03 14:01   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 28/32] sched_ext: Route ops.update_idle() to sub-schedulers and re-notify owed scheds Tejun Heo
2026-07-03 14:14   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 29/32] sched_ext: Replay ecaps notifications suppressed by bypass Tejun Heo
2026-07-03 14:28   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 30/32] sched_ext: Add scx_bpf_sub_kill() to evict a child sub-scheduler Tejun Heo
2026-07-03 14:45   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 31/32] tools/sched_ext: scx_qmap - Expand hierarchical sub-scheduling Tejun Heo
2026-07-03 14:57   ` sashiko-bot
2026-07-04  0:54     ` Tejun Heo
2026-07-03  8:01 ` [PATCH sched_ext/for-7.3 32/32] tools/sched_ext: scx_qmap - Add sub-sched cap fault injection Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox