* [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c
@ 2026-07-01 18:10 Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
` (4 more replies)
0 siblings, 5 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 18:10 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
Hello,
v2: Fold the scx_dispatch_sched() sub.h promotion into the split (patch 4) so
it is self-contained. v1 left it to a later patch, so the posted split had
sub.c call an ext.c file-local static (Andrea). __always_inline kept.
Patches 1-3 unchanged.
v1: https://lore.kernel.org/all/20260701031429.1892218-1-tj@kernel.org
The sub-scheduler implementation has grown and will keep growing. Move it
out of ext.c into a new kernel/sched/ext/sub.c. The first three patches are
mechanical prep (prefix file-local helpers, expose shared internals, inline
a few trivial helpers) so the move itself stays pure code motion. No
functional change.
Based on sched_ext/for-7.3 (5df6a4506d06) with sched_ext/for-7.2-fixes
(b7d9c359e5cf) assumed merged.
Tejun Heo (4):
sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
sched_ext: Expose the ext.c internals used by the sub.c split
sched_ext: Inline small ext.c helpers shared across the sub.c split
sched_ext: Split sub-scheduler implementation into sub.c
kernel/sched/build_policy.c | 2 +
kernel/sched/ext/ext.c | 1117 +++++--------------------------------------
kernel/sched/ext/internal.h | 164 ++++++-
kernel/sched/ext/sub.c | 668 ++++++++++++++++++++++++++
kernel/sched/ext/sub.h | 161 +++++++
5 files changed, 1101 insertions(+), 1011 deletions(-)
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
2026-07-01 18:10 [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 18:10 ` Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
` (3 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 18:10 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
A later change moves the sub-scheduler implementation out of ext.c into its
own file, from where it calls a number of file-local ext.c helpers. Give
those helpers the scx_ prefix that cross-file sched_ext symbols carry, ahead
of the move so the mechanical rename stays out of the code-motion patch. No
functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext/ext.c | 192 ++++++++++++++++++------------------
kernel/sched/ext/internal.h | 2 +-
2 files changed, 97 insertions(+), 97 deletions(-)
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4e0cd08a6a2e..56e6a13fd0f8 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -369,7 +369,7 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p)
return __setscheduler_class(p->policy, p->prio);
}
-static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu)
+static struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
{
return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
}
@@ -392,11 +392,11 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
sch = scx_parent(sch);
#endif /* CONFIG_EXT_SUB_SCHED */
- return bypass_dsq(sch, cpu);
+ return scx_bypass_dsq(sch, cpu);
}
/**
- * bypass_dsp_enabled - Check if bypass dispatch path is enabled
+ * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
* @sch: scheduler to check
*
* When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
@@ -408,9 +408,9 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
*
* This function checks bypass_dsp_enable_depth which is managed separately from
* bypass_depth to enable this decoupling. See enable_bypass_dsp() and
- * disable_bypass_dsp().
+ * scx_disable_bypass_dsp().
*/
-static bool bypass_dsp_enabled(struct scx_sched *sch)
+static bool scx_bypass_dsp_enabled(struct scx_sched *sch)
{
return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
}
@@ -1079,7 +1079,7 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
}
/**
- * ops_sanitize_err - Sanitize a -errno value
+ * scx_ops_sanitize_err - Sanitize a -errno value
* @sch: scx_sched to error out on error
* @ops_name: operation to blame on failure
* @err: -errno value to sanitize
@@ -1091,7 +1091,7 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
* value fails IS_ERR() test after being encoded with ERR_PTR() and then is
* handled as a pointer.
*/
-static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
+static int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
{
if (err < 0 && err >= -MAX_ERRNO)
return err;
@@ -1251,7 +1251,7 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
schedule_deferred(rq);
}
-static void schedule_reenq_local(struct rq *rq, u64 reenq_flags)
+static void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
{
struct scx_sched *root = rcu_dereference_sched(scx_root);
@@ -1347,8 +1347,8 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
* to the CPU or dequeued. In both cases, the only way @p can go back to
* the BPF sched is through enqueueing. If being inserted into a local
* DSQ with IMMED, persist the state until the next enqueueing event in
- * do_enqueue_task() so that we can maintain IMMED protection through
- * e.g. SAVE/RESTORE cycles and slice extensions.
+ * scx_do_enqueue_task() so that we can maintain IMMED protection
+ * through e.g. SAVE/RESTORE cycles and slice extensions.
*/
if (enq_flags & SCX_ENQ_IMMED) {
if (unlikely(dsq->id != SCX_DSQ_LOCAL)) {
@@ -1371,7 +1371,7 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
* done yet, @p can't go on the CPU immediately. Re-enqueue.
*/
if (unlikely(dsq->nr > 1 || !rq_is_open(rq, enq_flags)))
- schedule_reenq_local(rq, 0);
+ scx_schedule_reenq_local(rq, 0);
}
}
@@ -1488,9 +1488,9 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
}
}
-static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
- struct scx_dispatch_q *dsq, struct task_struct *p,
- u64 enq_flags)
+static void scx_dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, struct task_struct *p,
+ u64 enq_flags)
{
bool is_local = dsq->id == SCX_DSQ_LOCAL;
@@ -1638,7 +1638,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
}
}
-static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
+static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
{
struct scx_dispatch_q *dsq = p->scx.dsq;
bool is_local = dsq == &rq->scx.local_dsq;
@@ -1692,8 +1692,8 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
}
/*
- * Abbreviated version of dispatch_dequeue() that can be used when both @p's rq
- * and dsq are locked.
+ * Abbreviated version of scx_dispatch_dequeue() that can be used when both
+ * @p's rq and dsq are locked.
*/
static void dispatch_dequeue_locked(struct task_struct *p,
struct scx_dispatch_q *dsq)
@@ -1774,10 +1774,10 @@ static void mark_direct_dispatch(struct scx_sched *sch,
* - direct_dispatch(): cleared on the synchronous enqueue path, deferred
* dispatch keeps the state until consumed
* - process_ddsp_deferred_locals(): cleared after consuming deferred state,
- * - do_enqueue_task(): cleared on enqueue fallbacks where the dispatch
+ * - scx_do_enqueue_task(): cleared on enqueue fallbacks where the dispatch
* verdict is ignored (local/global/bypass)
- * - dequeue_task_scx(): cleared after dispatch_dequeue(), covering deferred
- * cancellation and holding_cpu races
+ * - dequeue_task_scx(): cleared after scx_dispatch_dequeue(), covering
+ * deferred cancellation and holding_cpu races
* - scx_disable_task(): cleared for queued wakeup tasks, which are excluded by
* the scx_bypass() loop, so that stale state is not reused by a subsequent
* scheduler instance
@@ -1838,7 +1838,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
ddsp_enq_flags = p->scx.ddsp_enq_flags;
clear_direct_dispatch(p);
- dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
+ scx_dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}
static bool scx_rq_online(struct rq *rq)
@@ -1853,8 +1853,8 @@ static bool scx_rq_online(struct rq *rq)
return likely((rq->scx.flags & SCX_RQ_ONLINE) && cpu_active(cpu_of(rq)));
}
-static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
- int sticky_cpu)
+static void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+ int sticky_cpu)
{
struct scx_sched *sch = scx_task_sched(p);
struct task_struct **ddsp_taskp;
@@ -1941,7 +1941,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
direct_dispatch(sch, p, enq_flags);
return;
local_norefill:
- dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
+ scx_dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
return;
local:
dsq = &rq->scx.local_dsq;
@@ -1962,7 +1962,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
touch_core_sched(rq, p);
refill_task_slice_dfl(sch, p);
clear_direct_dispatch(p);
- dispatch_enqueue(sch, rq, dsq, p, enq_flags);
+ scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags);
}
static bool task_runnable(const struct task_struct *p)
@@ -2031,7 +2031,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
if (rq->scx.nr_running == 1)
dl_server_start(&rq->ext_server);
- do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+ scx_do_enqueue_task(rq, p, enq_flags, sticky_cpu);
if (sticky_cpu >= 0)
p->scx.sticky_cpu = -1;
@@ -2167,7 +2167,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
rq->scx.nr_running--;
sub_nr_running(rq, 1);
- dispatch_dequeue(rq, p);
+ scx_dispatch_dequeue(rq, p);
clear_direct_dispatch(p);
return true;
}
@@ -2215,7 +2215,7 @@ static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_fl
* - A higher-priority wakes up while SCX dispatch is in progress.
*/
if (rq->scx.nr_immed)
- schedule_reenq_local(rq, 0);
+ scx_schedule_reenq_local(rq, 0);
}
static void move_local_task_to_local_dsq(struct scx_sched *sch,
@@ -2380,7 +2380,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
* values afterwards, as this operation can't be preempted or recurse, the
* holding_cpu can never become this CPU again before we're done. Thus, we can
* tell whether we lost to dequeue by testing whether the holding_cpu still
- * points to this CPU. See dispatch_dequeue() for the counterpart.
+ * points to this CPU. See scx_dispatch_dequeue() for the counterpart.
*
* On return, @dsq is unlocked and @src_rq is locked. Returns %true if @p is
* still valid. %false if lost to dequeue.
@@ -2485,14 +2485,14 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
dispatch_dequeue_locked(p, src_dsq);
raw_spin_unlock(&src_dsq->lock);
- dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
+ scx_dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
}
return dst_rq;
}
-static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
- struct scx_dispatch_q *dsq, u64 enq_flags)
+static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, u64 enq_flags)
{
struct task_struct *p;
retry:
@@ -2538,11 +2538,11 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
return false;
}
-static bool consume_global_dsq(struct scx_sched *sch, struct rq *rq)
+static bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
{
int node = cpu_to_node(cpu_of(rq));
- return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0);
+ return scx_consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0);
}
/**
@@ -2575,8 +2575,8 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
* If dispatching to @rq that @p is already on, no lock dancing needed.
*/
if (rq == src_rq && rq == dst_rq) {
- dispatch_enqueue(sch, rq, dst_dsq, p,
- enq_flags | SCX_ENQ_CLEAR_OPSS);
+ scx_dispatch_enqueue(sch, rq, dst_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
return;
}
@@ -2614,13 +2614,13 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
*/
if (src_rq == dst_rq) {
p->scx.holding_cpu = -1;
- dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
- enq_flags);
+ scx_dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
+ enq_flags);
} else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
p->scx.holding_cpu = -1;
fallback = true;
- dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
- p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
+ scx_dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
+ p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
} else {
move_remote_task_to_local_dsq(p, enq_flags,
src_rq, dst_rq);
@@ -2708,10 +2708,10 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
goto retry;
case SCX_OPSS_QUEUEING:
/*
- * do_enqueue_task() is in the process of transferring the task
- * to the BPF scheduler while holding @p's rq lock. As we aren't
- * holding any kernel or BPF resource that the enqueue path may
- * depend upon, it's safe to wait.
+ * scx_do_enqueue_task() is in the process of transferring the
+ * task to the BPF scheduler while holding @p's rq lock. As we
+ * aren't holding any kernel or BPF resource that the enqueue
+ * path may depend upon, it's safe to wait.
*/
wait_ops_state(p, opss);
goto retry;
@@ -2724,10 +2724,10 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
if (dsq->id == SCX_DSQ_LOCAL)
dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
else
- dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
}
-static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
+static void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
{
struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
u32 u;
@@ -2771,13 +2771,13 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
scx_task_on_sched(sch, prev);
- if (consume_global_dsq(sch, rq))
+ if (scx_consume_global_dsq(sch, rq))
return true;
- if (bypass_dsp_enabled(sch)) {
+ if (scx_bypass_dsp_enabled(sch)) {
/* if @sch is bypassing, only the bypass DSQs are active */
if (scx_bypassing(sch, cpu))
- return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0);
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
#ifdef CONFIG_EXT_SUB_SCHED
/*
@@ -2795,7 +2795,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
- consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0)) {
+ scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
__scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
return true;
}
@@ -2808,8 +2808,8 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
dspc->rq = rq;
/*
- * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
- * the local DSQ might still end up empty after a successful
+ * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
+ * lock, the local DSQ might still end up empty after a successful
* ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
* produced some tasks, retry. The BPF scheduler may depend on this
* looping behavior to simplify its implementation.
@@ -2828,7 +2828,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
rq->scx.sub_dispatch_prev = NULL;
}
- flush_dispatch_buf(sch, rq);
+ scx_flush_dispatch_buf(sch, rq);
if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
rq->scx.flags |= SCX_RQ_BAL_KEEP;
@@ -2836,7 +2836,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
}
if (rq->scx.local_dsq.nr)
return true;
- if (consume_global_dsq(sch, rq))
+ if (scx_consume_global_dsq(sch, rq))
return true;
/*
@@ -2859,8 +2859,8 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
* queued. Without this fallback, bypassed tasks could stall if the host
* scheduler's ops.dispatch() doesn't yield any tasks.
*/
- if (bypass_dsp_enabled(sch))
- return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0);
+ if (scx_bypass_dsp_enabled(sch))
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
return false;
}
@@ -2939,7 +2939,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
* between the IMMED queueing and the subsequent scheduling event.
*/
if (unlikely(rq->scx.local_dsq.nr > 1 && rq->scx.nr_immed))
- schedule_reenq_local(rq, 0);
+ scx_schedule_reenq_local(rq, 0);
rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
return true;
@@ -2955,7 +2955,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
* dispatched. Call ops_dequeue() to notify the BPF scheduler.
*/
ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
- dispatch_dequeue(rq, p);
+ scx_dispatch_dequeue(rq, p);
}
p->se.exec_start = rq_clock_task(rq);
@@ -3067,10 +3067,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
if (p->scx.slice && !scx_bypassing(sch, cpu_of(rq))) {
if (p->scx.flags & SCX_TASK_IMMED) {
p->scx.flags |= SCX_TASK_REENQ_PREEMPTED;
- do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+ scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
} else {
- dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ scx_dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD);
}
goto switch_class;
}
@@ -3088,9 +3088,9 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
WARN_ON_ONCE(sched_cpu_cookie_match(rq, p) &&
!(sch->ops.flags & SCX_OPS_ENQ_LAST));
- do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
+ scx_do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
} else {
- do_enqueue_task(rq, p, 0, -1);
+ scx_do_enqueue_task(rq, p, 0, -1);
}
}
@@ -3562,7 +3562,7 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo
ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args);
if (unlikely(ret)) {
- ret = ops_sanitize_err(sch, "init_task", ret);
+ ret = scx_ops_sanitize_err(sch, "init_task", ret);
return ret;
}
}
@@ -4107,7 +4107,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
if (!local_task_should_reenq(p, &reenq_flags, &reason))
continue;
- dispatch_dequeue(rq, p);
+ scx_dispatch_dequeue(rq, p);
if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
@@ -4119,7 +4119,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
list_del_init(&p->scx.dsq_list.node);
- do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+ scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
nr_enqueued++;
@@ -4234,7 +4234,7 @@ static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flag
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
p->scx.flags |= reason;
- do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
+ scx_do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
@@ -4354,7 +4354,7 @@ int scx_tg_online(struct task_group *tg)
ret = SCX_CALL_OP_RET(sch, cgroup_init,
NULL, tg->css.cgroup, &args);
if (ret)
- ret = ops_sanitize_err(sch, "cgroup_init", ret);
+ ret = scx_ops_sanitize_err(sch, "cgroup_init", ret);
}
if (ret == 0)
tg->scx.flags |= SCX_TG_ONLINE | SCX_TG_INITED;
@@ -4422,7 +4422,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
p->scx.cgrp_moving_from = NULL;
}
- return ops_sanitize_err(sch, "cgroup_prep_move", ret);
+ return scx_ops_sanitize_err(sch, "cgroup_prep_move", ret);
}
void scx_cgroup_move_task(struct task_struct *p)
@@ -4700,7 +4700,7 @@ static void destroy_dsq(struct scx_sched *sch, u64 dsq_id)
goto out_unlock_dsq;
/*
- * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
+ * Mark dead by invalidating ->id to prevent scx_dispatch_enqueue() from
* queueing more tasks. As this function can be called from anywhere,
* freeing is bounced through an irq work to avoid nesting RCU
* operations inside scheduler locks.
@@ -4928,7 +4928,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
*/
WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
- exit_dsq(bypass_dsq(sch, cpu));
+ exit_dsq(scx_bypass_dsq(sch, cpu));
}
free_percpu(sch->pcpu);
@@ -5239,7 +5239,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
u32 nr_donor_target, u32 nr_donee_target)
{
struct rq *donor_rq = cpu_rq(donor);
- struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor);
+ struct scx_dispatch_q *donor_dsq = scx_bypass_dsq(sch, donor);
struct task_struct *p, *n;
struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0);
s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target;
@@ -5287,7 +5287,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
if (donee >= nr_cpu_ids)
continue;
- donee_dsq = bypass_dsq(sch, donee);
+ donee_dsq = scx_bypass_dsq(sch, donee);
/*
* $p's rq is not locked but $p's DSQ lock protects its
@@ -5308,7 +5308,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
* between bypass DSQs.
*/
dispatch_dequeue_locked(p, donor_dsq);
- dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED);
+ scx_dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED);
/*
* $donee might have been idle and need to be woken up. No need
@@ -5351,7 +5351,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
/* count the target tasks and CPUs */
for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
- u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr);
+ u32 nr = READ_ONCE(scx_bypass_dsq(sch, cpu)->nr);
nr_tasks += nr;
nr_cpus++;
@@ -5373,7 +5373,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
cpumask_clear(donee_mask);
for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
- if (READ_ONCE(bypass_dsq(sch, cpu)->nr) < nr_target)
+ if (READ_ONCE(scx_bypass_dsq(sch, cpu)->nr) < nr_target)
cpumask_set_cpu(cpu, donee_mask);
}
@@ -5384,7 +5384,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
break;
if (cpumask_test_cpu(cpu, donee_mask))
continue;
- if (READ_ONCE(bypass_dsq(sch, cpu)->nr) <= nr_donor_target)
+ if (READ_ONCE(scx_bypass_dsq(sch, cpu)->nr) <= nr_donor_target)
continue;
nr_balanced += bypass_lb_cpu(sch, cpu, donee_mask, resched_mask,
@@ -5395,7 +5395,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
resched_cpu(cpu);
for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
- u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr);
+ u32 nr = READ_ONCE(scx_bypass_dsq(sch, cpu)->nr);
after_min = min(nr, after_min);
after_max = max(nr, after_max);
@@ -5421,7 +5421,7 @@ static void scx_bypass_lb_timerfn(struct timer_list *timer)
int node;
u32 intv_us;
- if (!bypass_dsp_enabled(sch))
+ if (!scx_bypass_dsp_enabled(sch))
return;
for_each_node_with_cpus(node)
@@ -5487,9 +5487,9 @@ static void enable_bypass_dsp(struct scx_sched *sch)
* dispatch enabled while a descendant is bypassing, which is all that's
* required.
*
- * bypass_dsp_enabled() test is used to determine whether to enter the
- * bypass dispatch handling path from both bypassing and hosting scheds.
- * Bump enable depth on both @sch and bypass dispatch host.
+ * scx_bypass_dsp_enabled() test is used to determine whether to enter
+ * the bypass dispatch handling path from both bypassing and hosting
+ * scheds. Bump enable depth on both @sch and bypass dispatch host.
*/
ret = atomic_inc_return(&sch->bypass_dsp_enable_depth);
WARN_ON_ONCE(ret <= 0);
@@ -5509,7 +5509,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
}
/* may be called without holding scx_bypass_lock */
-static void disable_bypass_dsp(struct scx_sched *sch)
+static void scx_disable_bypass_dsp(struct scx_sched *sch)
{
s32 ret;
@@ -5654,7 +5654,7 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
/* disarming must come after moving all tasks out of the bypass DSQs */
if (!bypass)
- disable_bypass_dsp(sch);
+ scx_disable_bypass_dsp(sch);
unlock:
raw_spin_unlock_irqrestore(&scx_bypass_lock, flags);
}
@@ -6003,7 +6003,7 @@ static void scx_sub_disable(struct scx_sched *sch)
* DSQs for us.
*/
synchronize_rcu_expedited();
- disable_bypass_dsp(sch);
+ scx_disable_bypass_dsp(sch);
scx_unlink_sched(sch);
@@ -6810,7 +6810,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
}
for_each_possible_cpu(cpu) {
- ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+ ret = init_dsq(scx_bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
if (ret) {
bypass_fail_cpu = cpu;
goto err_free_pcpu;
@@ -6963,7 +6963,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
for_each_possible_cpu(cpu) {
if (cpu == bypass_fail_cpu)
break;
- exit_dsq(bypass_dsq(sch, cpu));
+ exit_dsq(scx_bypass_dsq(sch, cpu));
}
free_percpu(sch->pcpu);
err_free_pnode:
@@ -7007,7 +7007,7 @@ static int check_hotplug_seq(struct scx_sched *sch,
return 0;
}
-static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
+static int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
{
/*
* It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
@@ -7170,7 +7170,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
if (sch->ops.init) {
ret = SCX_CALL_OP_RET(sch, init, NULL);
if (ret) {
- ret = ops_sanitize_err(sch, "init", ret);
+ ret = scx_ops_sanitize_err(sch, "init", ret);
cpus_read_unlock();
scx_error(sch, "ops.init() failed (%d)", ret);
goto err_disable;
@@ -7203,7 +7203,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
cpus_read_unlock();
- ret = validate_ops(sch, ops);
+ ret = scx_validate_ops(sch, ops);
if (ret)
goto err_disable;
@@ -7545,7 +7545,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
if (sch->ops.init) {
ret = SCX_CALL_OP_RET(sch, init, NULL);
if (ret) {
- ret = ops_sanitize_err(sch, "init", ret);
+ ret = scx_ops_sanitize_err(sch, "init", ret);
scx_error(sch, "ops.init() failed (%d)", ret);
goto err_disable;
}
@@ -7560,7 +7560,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
if (ret)
goto err_disable;
- if (validate_ops(sch, ops))
+ if (scx_validate_ops(sch, ops))
goto err_disable;
struct scx_sub_attach_args sub_attach_args = {
@@ -7571,7 +7571,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
&sub_attach_args);
if (ret) {
- ret = ops_sanitize_err(sch, "sub_attach", ret);
+ ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
scx_error(sch, "parent rejected (%d)", ret);
goto err_disable;
}
@@ -8830,7 +8830,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
/*
* If the BPF scheduler keeps calling this function repeatedly, it can
- * cause similar live-lock conditions as consume_dispatch_q().
+ * cause similar live-lock conditions as scx_consume_dispatch_q().
*/
if (unlikely(READ_ONCE(sch->aborting)))
return false;
@@ -8991,7 +8991,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
- flush_dispatch_buf(sch, dspc->rq);
+ scx_flush_dispatch_buf(sch, dspc->rq);
dsq = find_user_dsq(sch, dsq_id);
if (unlikely(!dsq)) {
@@ -8999,7 +8999,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
return false;
}
- if (consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) {
+ if (scx_consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) {
/*
* A successfully consumed task can be dequeued before it starts
* running while the CPU is trying to migrate other dispatched
@@ -10683,7 +10683,7 @@ static int __init scx_init(void)
/* @priv tail must align since both share the same data block */
CID_OFFSET_MATCH(priv, priv);
/*
- * cid-form must end exactly at @priv - validate_ops() skips
+ * cid-form must end exactly at @priv - scx_validate_ops() skips
* cpu_acquire/cpu_release for cid-form because reading those fields
* past the BPF allocation would be UB.
*/
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 0256931a379a..743980dc60b0 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1172,7 +1172,7 @@ struct scx_sched {
u64 bypass_timestamp;
s32 bypass_depth;
- /* bypass dispatch path enable state, see bypass_dsp_enabled() */
+ /* bypass dispatch path enable state, see scx_bypass_dsp_enabled() */
unsigned long bypass_dsp_claim;
atomic_t bypass_dsp_enable_depth;
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used by the sub.c split
2026-07-01 18:10 [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
@ 2026-07-01 18:10 ` Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
` (2 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 18:10 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
The sub-scheduler implementation is about to move into its own sub.c, from
where it calls a set of ext.c helpers and shares a few ext.c globals. Make
those reachable across the new file boundary ahead of the move.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext/ext.c | 109 ++++++++++++------------------------
kernel/sched/ext/internal.h | 77 +++++++++++++++++++++++++
2 files changed, 113 insertions(+), 73 deletions(-)
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 56e6a13fd0f8..58856a429821 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -20,7 +20,7 @@
#include "arena.h"
#include "idle.h"
-static DEFINE_RAW_SPINLOCK(scx_sched_lock);
+DEFINE_RAW_SPINLOCK(scx_sched_lock);
/*
* NOTE: sched_ext is in the process of growing multiple scheduler support and
@@ -39,14 +39,14 @@ struct scx_sched __rcu *scx_root;
static LIST_HEAD(scx_sched_all);
#ifdef CONFIG_EXT_SUB_SCHED
-static const struct rhashtable_params scx_sched_hash_params = {
+const struct rhashtable_params scx_sched_hash_params = {
.key_len = sizeof_field(struct scx_sched, ops.sub_cgroup_id),
.key_offset = offsetof(struct scx_sched, ops.sub_cgroup_id),
.head_offset = offsetof(struct scx_sched, hash_node),
.insecure_elasticity = true, /* inserted under scx_sched_lock */
};
-static struct rhashtable scx_sched_hash;
+struct rhashtable scx_sched_hash;
#endif
/* see SCX_OPS_TID_TO_TASK */
@@ -68,9 +68,9 @@ static DEFINE_RAW_SPINLOCK(scx_tasks_lock);
static LIST_HEAD(scx_tasks);
/* ops enable/disable */
-static DEFINE_MUTEX(scx_enable_mutex);
+DEFINE_MUTEX(scx_enable_mutex);
DEFINE_STATIC_KEY_FALSE(__scx_enabled);
-DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
+DEFINE_PERCPU_RWSEM(scx_fork_rwsem);
static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
static DEFINE_RAW_SPINLOCK(scx_bypass_lock);
static bool scx_init_task_enabled;
@@ -101,7 +101,7 @@ static atomic64_t scx_tid_cursor = ATOMIC64_INIT(1);
* tasks for the sub-sched being enabled. Use a global variable instead of a
* per-task field as all enables are serialized.
*/
-static struct scx_sched *scx_enabling_sub_sched;
+struct scx_sched *scx_enabling_sub_sched;
#else
#define scx_enabling_sub_sched (struct scx_sched *)NULL
#endif /* CONFIG_EXT_SUB_SCHED */
@@ -676,12 +676,12 @@ struct bpf_iter_scx_dsq {
} __attribute__((aligned(8)));
-static u32 scx_get_task_state(const struct task_struct *p)
+u32 scx_get_task_state(const struct task_struct *p)
{
return p->scx.flags & SCX_TASK_STATE_MASK;
}
-static void scx_set_task_state(struct task_struct *p, u32 state)
+void scx_set_task_state(struct task_struct *p, u32 state)
{
u32 prev_state = scx_get_task_state(p);
bool warn = false;
@@ -721,23 +721,6 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
p->scx.flags |= state;
}
-/*
- * SCX task iterator.
- */
-struct scx_task_iter {
- struct sched_ext_entity cursor;
- struct task_struct *locked_task;
- struct rq *rq;
- struct rq_flags rf;
- u32 cnt;
- bool list_locked;
-#ifdef CONFIG_EXT_SUB_SCHED
- struct cgroup *cgrp;
- struct cgroup_subsys_state *css_pos;
- struct css_task_iter css_iter;
-#endif
-};
-
/**
* scx_task_iter_start - Lock scx_tasks_lock and start a task iteration
* @iter: iterator to init
@@ -766,7 +749,7 @@ struct scx_task_iter {
* All tasks which existed when the iteration started are guaranteed to be
* visited as long as they are not dead.
*/
-static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
+void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
{
memset(iter, 0, sizeof(*iter));
@@ -805,7 +788,7 @@ static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
* This function can be safely called anytime during an iteration. The next
* iterator operation will automatically restore the necessary locking.
*/
-static void scx_task_iter_unlock(struct scx_task_iter *iter)
+void scx_task_iter_unlock(struct scx_task_iter *iter)
{
__scx_task_iter_rq_unlock(iter);
if (iter->list_locked) {
@@ -848,7 +831,7 @@ static void scx_task_iter_relock(struct scx_task_iter *iter,
* which is released on return. If the iterator holds a task's rq lock, that rq
* lock is also released. See scx_task_iter_start() for details.
*/
-static void scx_task_iter_stop(struct scx_task_iter *iter)
+void scx_task_iter_stop(struct scx_task_iter *iter)
{
#ifdef CONFIG_EXT_SUB_SCHED
if (iter->cgrp) {
@@ -923,7 +906,7 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
* whether they would like to filter out dead tasks. See scx_task_iter_start()
* for details.
*/
-static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
+struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
{
struct task_struct *p;
@@ -1186,8 +1169,8 @@ static void schedule_deferred_locked(struct rq *rq)
schedule_deferred(rq);
}
-static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
- u64 reenq_flags, struct rq *locked_rq)
+void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+ u64 reenq_flags, struct rq *locked_rq)
{
struct rq *rq;
@@ -2491,8 +2474,8 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
return dst_rq;
}
-static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
- struct scx_dispatch_q *dsq, u64 enq_flags)
+bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, u64 enq_flags)
{
struct task_struct *p;
retry:
@@ -2538,7 +2521,7 @@ static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
return false;
}
-static bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
+bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
{
int node = cpu_to_node(cpu_of(rq));
@@ -3548,7 +3531,7 @@ static struct cgroup *tg_cgrp(struct task_group *tg)
#endif /* CONFIG_EXT_GROUP_SCHED */
-static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
+int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
{
int ret;
@@ -3631,7 +3614,7 @@ static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight);
}
-static void scx_enable_task(struct scx_sched *sch, struct task_struct *p)
+void scx_enable_task(struct scx_sched *sch, struct task_struct *p)
{
__scx_enable_task(sch, p);
scx_set_task_state(p, SCX_TASK_ENABLED);
@@ -3665,8 +3648,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p)
WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
}
-static void __scx_disable_and_exit_task(struct scx_sched *sch,
- struct task_struct *p)
+void __scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p)
{
struct scx_exit_task_args args = {
.cancelled = false,
@@ -3700,7 +3682,7 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch,
* ran. The task state has not been transitioned, so this mirrors the
* SCX_TASK_INIT branch in __scx_disable_and_exit_task().
*/
-static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
+void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
{
struct scx_exit_task_args args = { .cancelled = true };
@@ -3711,8 +3693,7 @@ static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *
SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args);
}
-static void scx_disable_and_exit_task(struct scx_sched *sch,
- struct task_struct *p)
+void scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p)
{
__scx_disable_and_exit_task(sch, p);
@@ -4525,7 +4506,7 @@ static struct cgroup *root_cgroup(void)
return &cgrp_dfl_root.cgrp;
}
-static void scx_cgroup_lock(void)
+void scx_cgroup_lock(void)
{
#ifdef CONFIG_EXT_GROUP_SCHED
percpu_down_write(&scx_cgroup_ops_rwsem);
@@ -4533,7 +4514,7 @@ static void scx_cgroup_lock(void)
cgroup_lock();
}
-static void scx_cgroup_unlock(void)
+void scx_cgroup_unlock(void)
{
cgroup_unlock();
#ifdef CONFIG_EXT_GROUP_SCHED
@@ -4851,7 +4832,7 @@ static void free_exit_info(struct scx_exit_info *ei);
static const char *scx_exit_reason(enum scx_exit_kind kind);
static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind);
-static s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch)
+s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch)
{
size_t size = struct_size_t(struct scx_cmask, bits,
SCX_CMASK_NR_WORDS(num_possible_cpus()));
@@ -5509,7 +5490,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
}
/* may be called without holding scx_bypass_lock */
-static void scx_disable_bypass_dsp(struct scx_sched *sch)
+void scx_disable_bypass_dsp(struct scx_sched *sch)
{
s32 ret;
@@ -5557,7 +5538,7 @@ static void scx_disable_bypass_dsp(struct scx_sched *sch)
*
* - scx_prio_less() reverts to the default core_sched_at order.
*/
-static void scx_bypass(struct scx_sched *sch, bool bypass)
+void scx_bypass(struct scx_sched *sch, bool bypass)
{
struct scx_sched *pos;
unsigned long flags;
@@ -5746,7 +5727,7 @@ static void refresh_watchdog(void)
cancel_delayed_work_sync(&scx_watchdog_work);
}
-static s32 scx_link_sched(struct scx_sched *sch)
+s32 scx_link_sched(struct scx_sched *sch)
{
const char *err_msg = "";
s32 ret = 0;
@@ -5795,7 +5776,7 @@ static s32 scx_link_sched(struct scx_sched *sch)
return 0;
}
-static void scx_unlink_sched(struct scx_sched *sch)
+void scx_unlink_sched(struct scx_sched *sch)
{
scoped_guard(raw_spinlock_irq, &scx_sched_lock) {
#ifdef CONFIG_EXT_SUB_SCHED
@@ -5816,13 +5797,13 @@ static void scx_unlink_sched(struct scx_sched *sch)
* @sch. Once @sch becomes empty during disable, there's no point in dumping it.
* This prevents calling dump ops on a dead sch.
*/
-static void scx_disable_dump(struct scx_sched *sch)
+void scx_disable_dump(struct scx_sched *sch)
{
guard(raw_spinlock_irqsave)(&scx_dump_lock);
sch->dump_disabled = true;
}
-static void scx_log_sched_disable(struct scx_sched *sch)
+void scx_log_sched_disable(struct scx_sched *sch)
{
struct scx_exit_info *ei = sch->exit_info;
const char *type = scx_parent(sch) ? "sub-scheduler" : "scheduler";
@@ -6285,7 +6266,7 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
* as a noop. Syncing the irq_work first is required to guarantee the
* kthread work has been queued before waiting for it.
*/
-static void scx_flush_disable_work(struct scx_sched *sch)
+void scx_flush_disable_work(struct scx_sched *sch)
{
int kind;
@@ -6739,31 +6720,13 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
return pnode;
}
-/*
- * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
- * starvation. During the READY -> ENABLED task switching loop, the calling
- * thread's sched_class gets switched from fair to ext. As fair has higher
- * priority than ext, the calling thread can be indefinitely starved under
- * fair-class saturation, leading to a system hang.
- */
-struct scx_enable_cmd {
- struct kthread_work work;
- union {
- struct sched_ext_ops *ops;
- struct sched_ext_ops_cid *ops_cid;
- };
- bool is_cid_type;
- struct bpf_map *arena_map; /* arena ref to transfer to sch */
- int ret;
-};
-
/*
* Allocate and initialize a new scx_sched. @cgrp's reference is always
* consumed whether the function succeeds or fails.
*/
-static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
- struct cgroup *cgrp,
- struct scx_sched *parent)
+struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
+ struct cgroup *cgrp,
+ struct scx_sched *parent)
{
struct sched_ext_ops *ops = cmd->ops;
struct scx_sched *sch;
@@ -7007,7 +6970,7 @@ static int check_hotplug_seq(struct scx_sched *sch,
return 0;
}
-static int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
+int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
{
/*
* It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 743980dc60b0..53a8aec8652e 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1535,6 +1535,41 @@ enum scx_ops_state {
#define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1)
#define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK)
+/*
+ * SCX task iterator.
+ */
+struct scx_task_iter {
+ struct sched_ext_entity cursor;
+ struct task_struct *locked_task;
+ struct rq *rq;
+ struct rq_flags rf;
+ u32 cnt;
+ bool list_locked;
+#ifdef CONFIG_EXT_SUB_SCHED
+ struct cgroup *cgrp;
+ struct cgroup_subsys_state *css_pos;
+ struct css_task_iter css_iter;
+#endif
+};
+
+/*
+ * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
+ * starvation. During the READY -> ENABLED task switching loop, the calling
+ * thread's sched_class gets switched from fair to ext. As fair has higher
+ * priority than ext, the calling thread can be indefinitely starved under
+ * fair-class saturation, leading to a system hang.
+ */
+struct scx_enable_cmd {
+ struct kthread_work work;
+ union {
+ struct sched_ext_ops *ops;
+ struct sched_ext_ops_cid *ops_cid;
+ };
+ bool is_cid_type;
+ struct bpf_map *arena_map; /* arena ref to transfer to sch */
+ int ret;
+};
+
extern struct scx_sched __rcu *scx_root;
DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
@@ -1555,6 +1590,48 @@ __printf(5, 0) bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
__printf(5, 6) bool __scx_exit(struct scx_sched *sch, enum scx_exit_kind kind,
s64 exit_code, s32 exit_cpu, const char *fmt, ...);
+u32 scx_get_task_state(const struct task_struct *p);
+void scx_set_task_state(struct task_struct *p, u32 state);
+void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp);
+void scx_task_iter_unlock(struct scx_task_iter *iter);
+void scx_task_iter_stop(struct scx_task_iter *iter);
+struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter);
+bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, u64 enq_flags);
+bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq);
+void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+ u64 reenq_flags, struct rq *locked_rq);
+int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork);
+void scx_enable_task(struct scx_sched *sch, struct task_struct *p);
+void __scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p);
+void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p);
+void scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p);
+#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
+void scx_cgroup_lock(void);
+void scx_cgroup_unlock(void);
+#endif
+s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch);
+void scx_disable_bypass_dsp(struct scx_sched *sch);
+void scx_bypass(struct scx_sched *sch, bool bypass);
+s32 scx_link_sched(struct scx_sched *sch);
+void scx_unlink_sched(struct scx_sched *sch);
+void scx_disable_dump(struct scx_sched *sch);
+void scx_log_sched_disable(struct scx_sched *sch);
+void scx_flush_disable_work(struct scx_sched *sch);
+struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
+ struct cgroup *cgrp,
+ struct scx_sched *parent);
+int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
+
+extern raw_spinlock_t scx_sched_lock;
+extern struct mutex scx_enable_mutex;
+extern struct percpu_rw_semaphore scx_fork_rwsem;
+#ifdef CONFIG_EXT_SUB_SCHED
+extern const struct rhashtable_params scx_sched_hash_params;
+extern struct rhashtable scx_sched_hash;
+extern struct scx_sched *scx_enabling_sub_sched;
+#endif
+
#define scx_exit(sch, kind, exit_code, fmt, args...) \
__scx_exit(sch, kind, exit_code, raw_smp_processor_id(), fmt, ##args)
#define scx_error(sch, fmt, args...) \
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across the sub.c split
2026-07-01 18:10 [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
@ 2026-07-01 18:10 ` Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 19:43 ` [PATCHSET v2 sched_ext/for-7.3] " Andrea Righi
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 18:10 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
The following trivial helpers in ext.c are called from both ext.c and the
sub-scheduler code. Define them as static inline in internal.h.
- scx_bypass_dsq()
- scx_bypass_dsp_enabled()
- scx_ops_sanitize_err()
- scx_schedule_reenq_local()
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext/ext.c | 57 -------------------------------------
kernel/sched/ext/internal.h | 57 +++++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+), 57 deletions(-)
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 58856a429821..f48d15ecd736 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -369,11 +369,6 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p)
return __setscheduler_class(p->policy, p->prio);
}
-static struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
-{
- return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
-}
-
static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu)
{
#ifdef CONFIG_EXT_SUB_SCHED
@@ -395,26 +390,6 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
return scx_bypass_dsq(sch, cpu);
}
-/**
- * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
- * @sch: scheduler to check
- *
- * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
- * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors
- * are bypassing. In the former case, the ancestor is not itself bypassing but
- * its bypass DSQs will be populated with bypassed tasks from descendants. Thus,
- * the ancestor's bypass dispatch path must be active even though its own
- * bypass_depth remains zero.
- *
- * This function checks bypass_dsp_enable_depth which is managed separately from
- * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
- * scx_disable_bypass_dsp().
- */
-static bool scx_bypass_dsp_enabled(struct scx_sched *sch)
-{
- return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
-}
-
/**
* rq_is_open - Is the rq available for immediate execution of an SCX task?
* @rq: rq to test
@@ -1061,28 +1036,6 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
}
}
-/**
- * scx_ops_sanitize_err - Sanitize a -errno value
- * @sch: scx_sched to error out on error
- * @ops_name: operation to blame on failure
- * @err: -errno value to sanitize
- *
- * Verify @err is a valid -errno. If not, trigger scx_error() and return
- * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
- * cause misbehaviors. For an example, a large negative return from
- * ops.init_task() triggers an oops when passed up the call chain because the
- * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
- * handled as a pointer.
- */
-static int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
-{
- if (err < 0 && err >= -MAX_ERRNO)
- return err;
-
- scx_error(sch, "ops.%s() returned an invalid errno %d", ops_name, err);
- return -EPROTO;
-}
-
static void deferred_bal_cb_workfn(struct rq *rq)
{
run_deferred(rq);
@@ -1234,16 +1187,6 @@ void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
schedule_deferred(rq);
}
-static void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
-{
- struct scx_sched *root = rcu_dereference_sched(scx_root);
-
- if (WARN_ON_ONCE(!root))
- return;
-
- schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
-}
-
/**
* touch_core_sched - Update timestamp used for core-sched task ordering
* @rq: rq to read clock from, must be locked
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 53a8aec8652e..5d861cb0727d 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1637,6 +1637,63 @@ extern struct scx_sched *scx_enabling_sub_sched;
#define scx_error(sch, fmt, args...) \
scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
+static inline struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
+{
+ return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
+}
+
+/**
+ * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
+ * @sch: scheduler to check
+ *
+ * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
+ * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors
+ * are bypassing. In the former case, the ancestor is not itself bypassing but
+ * its bypass DSQs will be populated with bypassed tasks from descendants. Thus,
+ * the ancestor's bypass dispatch path must be active even though its own
+ * bypass_depth remains zero.
+ *
+ * This function checks bypass_dsp_enable_depth which is managed separately from
+ * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
+ * scx_disable_bypass_dsp().
+ */
+static inline bool scx_bypass_dsp_enabled(struct scx_sched *sch)
+{
+ return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
+}
+
+/**
+ * scx_ops_sanitize_err - Sanitize a -errno value
+ * @sch: scx_sched to error out on error
+ * @ops_name: operation to blame on failure
+ * @err: -errno value to sanitize
+ *
+ * Verify @err is a valid -errno. If not, trigger scx_error() and return
+ * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
+ * cause misbehaviors. For an example, a large negative return from
+ * ops.init_task() triggers an oops when passed up the call chain because the
+ * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
+ * handled as a pointer.
+ */
+static inline int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
+{
+ if (err < 0 && err >= -MAX_ERRNO)
+ return err;
+
+ scx_error(sch, "ops.%s() returned an invalid errno %d", ops_name, err);
+ return -EPROTO;
+}
+
+static inline void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
+{
+ struct scx_sched *root = rcu_dereference_sched(scx_root);
+
+ if (WARN_ON_ONCE(!root))
+ return;
+
+ schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
+}
+
/*
* Return the rq currently locked from an scx callback, or NULL if no rq is
* locked.
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 18:10 [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
` (2 preceding siblings ...)
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
@ 2026-07-01 18:10 ` Tejun Heo
2026-07-01 18:34 ` sashiko-bot
2026-07-01 19:43 ` [PATCHSET v2 sched_ext/for-7.3] " Andrea Righi
4 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 18:10 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
The sub-scheduler implementation has grown and will continue to expand. Move
the sub-scheduler functions from ext.c into a new kernel/sched/ext/sub.c.
sub.h holds the prototypes and the !CONFIG_EXT_SUB_SCHED no-op stubs.
scx_dispatch_sched() is shared: balance_one() in ext.c and the
scx_bpf_sub_dispatch() kfunc in sub.c both call it, and the latter re-enters
it as sub-scheduler dispatch nests. It moves into sub.h as a static
__always_inline so both callers keep it inlined and per-level stack stays
bounded across the recursion. The event macros it uses move to internal.h.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
v2: Fold the scx_dispatch_sched() sub.h promotion into this patch (was a
separate later patch in v1) so the split is self-contained (Andrea).
kernel/sched/build_policy.c | 2 +
kernel/sched/ext/ext.c | 811 +-----------------------------------
kernel/sched/ext/internal.h | 28 ++
kernel/sched/ext/sub.c | 668 +++++++++++++++++++++++++++++
kernel/sched/ext/sub.h | 161 +++++++
5 files changed, 860 insertions(+), 810 deletions(-)
create mode 100644 kernel/sched/ext/sub.c
create mode 100644 kernel/sched/ext/sub.h
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index d74b54f81992..01dc7bf89af8 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -66,10 +66,12 @@
# include "ext/cid.h"
# include "ext/arena.h"
# include "ext/idle.h"
+# include "ext/sub.h"
# include "ext/ext.c"
# include "ext/cid.c"
# include "ext/arena.c"
# include "ext/idle.c"
+# include "ext/sub.c"
#endif
#include "syscalls.c"
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index f48d15ecd736..cd18ca6c8a59 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -19,6 +19,7 @@
#include "cid.h"
#include "arena.h"
#include "idle.h"
+#include "sub.h"
DEFINE_RAW_SPINLOCK(scx_sched_lock);
@@ -272,58 +273,6 @@ static bool u32_before(u32 a, u32 b)
return (s32)(a - b) < 0;
}
-#ifdef CONFIG_EXT_SUB_SCHED
-/**
- * scx_next_descendant_pre - find the next descendant for pre-order walk
- * @pos: the current position (%NULL to initiate traversal)
- * @root: sched whose descendants to walk
- *
- * To be used by scx_for_each_descendant_pre(). Find the next descendant to
- * visit for pre-order traversal of @root's descendants. @root is included in
- * the iteration and the first node to be visited.
- */
-static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos,
- struct scx_sched *root)
-{
- struct scx_sched *next;
-
- lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
- lockdep_is_held(&scx_sched_lock));
-
- /* if first iteration, visit @root */
- if (!pos)
- return root;
-
- /* visit the first child if exists */
- next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
- if (next)
- return next;
-
- /* no child, visit my or the closest ancestor's next sibling */
- while (pos != root) {
- if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
- return list_next_entry(pos, sibling);
- pos = scx_parent(pos);
- }
-
- return NULL;
-}
-
-static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
-{
- return rhashtable_lookup(&scx_sched_hash, &cgroup_id,
- scx_sched_hash_params);
-}
-
-static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch)
-{
- rcu_assign_pointer(p->scx.sched, sch);
-}
-#else /* CONFIG_EXT_SUB_SCHED */
-static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
-static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
/**
* scx_is_descendant - Test whether sched is a descendant
* @sch: sched to test
@@ -338,19 +287,6 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
return sch->ancestors[ancestor->level] == ancestor;
}
-/**
- * scx_for_each_descendant_pre - pre-order walk of a sched's descendants
- * @pos: iteration cursor
- * @root: sched to walk the descendants of
- *
- * Walk @root's descendants. @root is included in the iteration and the first
- * node to be visited. Must be called with either scx_enable_mutex or
- * scx_sched_lock held.
- */
-#define scx_for_each_descendant_pre(pos, root) \
- for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \
- (pos) = scx_next_descendant_pre((pos), (root)))
-
static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu)
{
return &sch->pnode[cpu_to_node(cpu)]->global_dsq;
@@ -936,32 +872,6 @@ struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
return NULL;
}
-/**
- * scx_add_event - Increase an event counter for 'name' by 'cnt'
- * @sch: scx_sched to account events for
- * @name: an event name defined in struct scx_event_stats
- * @cnt: the number of the event occurred
- *
- * This can be used when preemption is not disabled.
- */
-#define scx_add_event(sch, name, cnt) do { \
- this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
- trace_sched_ext_event(#name, (cnt)); \
-} while(0)
-
-/**
- * __scx_add_event - Increase an event counter for 'name' by 'cnt'
- * @sch: scx_sched to account events for
- * @name: an event name defined in struct scx_event_stats
- * @cnt: the number of the event occurred
- *
- * This should be used only when preemption is disabled.
- */
-#define __scx_add_event(sch, name, cnt) do { \
- __this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
- trace_sched_ext_event(#name, cnt); \
-} while(0)
-
/**
* scx_dump_event - Dump an event 'kind' in 'events' to 's'
* @s: output seq_buf
@@ -2682,115 +2592,6 @@ static inline void maybe_queue_balance_callback(struct rq *rq)
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
}
-/*
- * One user of this function is scx_bpf_dispatch() which can be called
- * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
- * from the call frame.
- */
-static __always_inline bool
-scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
- struct task_struct *prev, bool nested)
-{
- struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
- int nr_loops = SCX_DSP_MAX_LOOPS;
- s32 cpu = cpu_of(rq);
- bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
- scx_task_on_sched(sch, prev);
-
- if (scx_consume_global_dsq(sch, rq))
- return true;
-
- if (scx_bypass_dsp_enabled(sch)) {
- /* if @sch is bypassing, only the bypass DSQs are active */
- if (scx_bypassing(sch, cpu))
- return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
-
-#ifdef CONFIG_EXT_SUB_SCHED
- /*
- * If @sch isn't bypassing but its children are, @sch is
- * responsible for making forward progress for both its own
- * tasks that aren't bypassing and the bypassing descendants'
- * tasks. The following implements a simple built-in behavior -
- * let each CPU try to run the bypass DSQ every Nth time.
- *
- * Later, if necessary, we can add an ops flag to suppress the
- * auto-consumption and a kfunc to consume the bypass DSQ and,
- * so that the BPF scheduler can fully control scheduling of
- * bypassed tasks.
- */
- struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
-
- if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
- scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
- __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
- return true;
- }
-#endif /* CONFIG_EXT_SUB_SCHED */
- }
-
- if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
- return false;
-
- dspc->rq = rq;
-
- /*
- * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
- * lock, the local DSQ might still end up empty after a successful
- * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
- * produced some tasks, retry. The BPF scheduler may depend on this
- * looping behavior to simplify its implementation.
- */
- do {
- dspc->nr_tasks = 0;
-
- if (nested) {
- SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
- prev_on_sch ? prev : NULL);
- } else {
- /* stash @prev so that nested invocations can access it */
- rq->scx.sub_dispatch_prev = prev;
- SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
- prev_on_sch ? prev : NULL);
- rq->scx.sub_dispatch_prev = NULL;
- }
-
- scx_flush_dispatch_buf(sch, rq);
-
- if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
- rq->scx.flags |= SCX_RQ_BAL_KEEP;
- return true;
- }
- if (rq->scx.local_dsq.nr)
- return true;
- if (scx_consume_global_dsq(sch, rq))
- return true;
-
- /*
- * ops.dispatch() can trap us in this loop by repeatedly
- * dispatching ineligible tasks. Break out once in a while to
- * allow the watchdog to run. As IRQ can't be enabled in
- * balance(), we want to complete this scheduling cycle and then
- * start a new one. IOW, we want to call resched_curr() on the
- * next, most likely idle, task, not the current one. Use
- * __scx_bpf_kick_cpu() for deferred kicking.
- */
- if (unlikely(!--nr_loops)) {
- scx_kick_cpu(sch, cpu, 0);
- break;
- }
- } while (dspc->nr_tasks);
-
- /*
- * Prevent the CPU from going idle while bypassed descendants have tasks
- * queued. Without this fallback, bypassed tasks could stall if the host
- * scheduler's ops.dispatch() doesn't yield any tasks.
- */
- if (scx_bypass_dsp_enabled(sch))
- return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
-
- return false;
-}
-
static int balance_one(struct rq *rq, struct task_struct *prev)
{
struct scx_sched *sch = scx_root;
@@ -4470,26 +4271,6 @@ static inline void scx_cgroup_lock(void) {}
static inline void scx_cgroup_unlock(void) {}
#endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */
-#ifdef CONFIG_EXT_SUB_SCHED
-static struct cgroup *sch_cgroup(struct scx_sched *sch)
-{
- return sch->cgrp;
-}
-
-/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
-static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
-{
- struct cgroup *pos;
- struct cgroup_subsys_state *css;
-
- cgroup_for_each_live_descendant_pre(pos, css, cgrp)
- rcu_assign_pointer(pos->scx_sched, sch);
-}
-#else /* CONFIG_EXT_SUB_SCHED */
-static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
-static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
/*
* Omitted operations:
*
@@ -5766,202 +5547,6 @@ void scx_log_sched_disable(struct scx_sched *sch)
}
}
-#ifdef CONFIG_EXT_SUB_SCHED
-static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
-
-static void drain_descendants(struct scx_sched *sch)
-{
- /*
- * Child scheds that finished the critical part of disabling will take
- * themselves off @sch->children. Wait for it to drain. As propagation
- * is recursive, empty @sch->children means that all proper descendant
- * scheds reached unlinking stage.
- */
- wait_event(scx_unlink_waitq, list_empty(&sch->children));
-}
-
-static void scx_fail_parent(struct scx_sched *sch,
- struct task_struct *failed, s32 fail_code)
-{
- struct scx_sched *parent = scx_parent(sch);
- struct scx_task_iter sti;
- struct task_struct *p;
-
- scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler",
- fail_code, failed->comm, failed->pid);
-
- /*
- * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into
- * it. This may cause downstream failures on the BPF side but $parent is
- * dying anyway.
- */
- scx_bypass(parent, true);
-
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- if (scx_task_on_sched(parent, p))
- continue;
-
- scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
- scx_disable_and_exit_task(sch, p);
- scx_set_task_sched(p, parent);
- }
- }
- scx_task_iter_stop(&sti);
-}
-
-static void scx_sub_disable(struct scx_sched *sch)
-{
- struct scx_sched *parent = scx_parent(sch);
- struct scx_task_iter sti;
- struct task_struct *p;
- int ret;
-
- /*
- * Guarantee forward progress and wait for descendants to be disabled.
- * To limit disruptions, $parent is not bypassed. Tasks are fully
- * prepped and then inserted back into $parent.
- */
- scx_bypass(sch, true);
- drain_descendants(sch);
-
- /*
- * Here, every runnable task is guaranteed to make forward progress and
- * we can safely use blocking synchronization constructs. Actually
- * disable ops.
- */
- mutex_lock(&scx_enable_mutex);
- percpu_down_write(&scx_fork_rwsem);
- scx_cgroup_lock();
-
- set_cgroup_sched(sch_cgroup(sch), parent);
-
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- struct rq *rq;
- struct rq_flags rf;
-
- /* filter out duplicate visits */
- if (scx_task_on_sched(parent, p))
- continue;
-
- /*
- * By the time control reaches here, all descendant schedulers
- * should already have been disabled.
- */
- WARN_ON_ONCE(!scx_task_on_sched(sch, p));
-
- /*
- * @p is pinned by the iter: css_task_iter_next() takes a
- * reference and holds it until the next iter_next() call, so
- * @p->usage is guaranteed > 0.
- */
- get_task_struct(p);
-
- scx_task_iter_unlock(&sti);
-
- /*
- * $p is READY or ENABLED on @sch. Initialize for $parent,
- * disable and exit from @sch, and then switch over to $parent.
- *
- * If a task fails to initialize for $parent, the only available
- * action is disabling $parent too. While this allows disabling
- * of a child sched to cause the parent scheduler to fail, the
- * failure can only originate from ops.init_task() of the
- * parent. A child can't directly affect the parent through its
- * own failures.
- */
- ret = __scx_init_task(parent, p, false);
- if (ret) {
- scx_fail_parent(sch, p, ret);
- put_task_struct(p);
- break;
- }
-
- rq = task_rq_lock(p, &rf);
-
- if (scx_get_task_state(p) == SCX_TASK_DEAD) {
- /*
- * sched_ext_dead() raced us between __scx_init_task()
- * and this rq lock and ran exit_task() on @sch (the
- * sched @p was on at that point), not on $parent.
- * $parent's just-completed init is owed an exit_task()
- * and we issue it here.
- */
- scx_sub_init_cancel_task(parent, p);
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
- continue;
- }
-
- scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
- /*
- * $p is initialized for $parent and still attached to
- * @sch. Disable and exit for @sch, switch over to
- * $parent, override the state to READY to account for
- * $p having already been initialized, and then enable.
- */
- scx_disable_and_exit_task(sch, p);
- scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
- scx_set_task_state(p, SCX_TASK_INIT);
- scx_set_task_sched(p, parent);
- scx_set_task_state(p, SCX_TASK_READY);
- scx_enable_task(parent, p);
- }
-
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
- }
- scx_task_iter_stop(&sti);
-
- scx_disable_dump(sch);
-
- scx_cgroup_unlock();
- percpu_up_write(&scx_fork_rwsem);
-
- /*
- * All tasks are moved off of @sch but there may still be on-going
- * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
- * the expedited version as ancestors may be waiting in bypass mode.
- * Also, tell the parent that there is no need to keep running bypass
- * DSQs for us.
- */
- synchronize_rcu_expedited();
- scx_disable_bypass_dsp(sch);
-
- scx_unlink_sched(sch);
-
- mutex_unlock(&scx_enable_mutex);
-
- /*
- * @sch is now unlinked from the parent's children list. Notify and call
- * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called
- * after unlinking and releasing all locks. See scx_claim_exit().
- */
- wake_up_all(&scx_unlink_waitq);
-
- if (parent->ops.sub_detach && sch->sub_attached) {
- struct scx_sub_detach_args sub_detach_args = {
- .ops = &sch->ops,
- .cgroup_path = sch->cgrp_path,
- };
- SCX_CALL_OP(parent, sub_detach, NULL,
- &sub_detach_args);
- }
-
- scx_log_sched_disable(sch);
-
- if (sch->ops.exit)
- SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
- if (sch->sub_kset)
- kobject_del(&sch->sub_kset->kobj);
- kobject_del(&sch->kobj);
-}
-#else /* CONFIG_EXT_SUB_SCHED */
-static inline void drain_descendants(struct scx_sched *sch) { }
-static inline void scx_sub_disable(struct scx_sched *sch) { }
-#endif /* CONFIG_EXT_SUB_SCHED */
-
static void scx_root_disable(struct scx_sched *sch)
{
struct scx_task_iter sti;
@@ -7351,347 +6936,6 @@ static void scx_root_enable_workfn(struct kthread_work *work)
cmd->ret = 0;
}
-#ifdef CONFIG_EXT_SUB_SCHED
-/* verify that a scheduler can be attached to @cgrp and return the parent */
-static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
-{
- struct scx_sched *parent = cgrp->scx_sched;
- struct scx_sched *pos;
-
- lockdep_assert_held(&scx_sched_lock);
-
- /* can't attach twice to the same cgroup */
- if (parent->cgrp == cgrp)
- return ERR_PTR(-EBUSY);
-
- /* does $parent allow sub-scheds? */
- if (!parent->ops.sub_attach)
- return ERR_PTR(-EOPNOTSUPP);
-
- /* can't insert between $parent and its exiting children */
- list_for_each_entry(pos, &parent->children, sibling)
- if (cgroup_is_descendant(pos->cgrp, cgrp))
- return ERR_PTR(-EBUSY);
-
- return parent;
-}
-
-static bool assert_task_ready_or_enabled(struct task_struct *p)
-{
- u32 state = scx_get_task_state(p);
-
- switch (state) {
- case SCX_TASK_READY:
- case SCX_TASK_ENABLED:
- return true;
- default:
- WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched",
- state, p->comm, p->pid);
- return false;
- }
-}
-
-static void scx_sub_enable_workfn(struct kthread_work *work)
-{
- struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
- struct sched_ext_ops *ops = cmd->ops;
- struct cgroup *cgrp;
- struct scx_sched *parent, *sch;
- struct scx_task_iter sti;
- struct task_struct *p;
- s32 i, ret;
-
- mutex_lock(&scx_enable_mutex);
-
- if (!scx_enabled()) {
- ret = -ENODEV;
- goto out_unlock;
- }
-
- /* See scx_root_enable_workfn() for the @ops->priv check. */
- if (rcu_access_pointer(ops->priv)) {
- ret = -EBUSY;
- goto out_unlock;
- }
-
- cgrp = cgroup_get_from_id(ops->sub_cgroup_id);
- if (IS_ERR(cgrp)) {
- ret = PTR_ERR(cgrp);
- goto out_unlock;
- }
-
- raw_spin_lock_irq(&scx_sched_lock);
- parent = find_parent_sched(cgrp);
- if (IS_ERR(parent)) {
- raw_spin_unlock_irq(&scx_sched_lock);
- ret = PTR_ERR(parent);
- goto out_put_cgrp;
- }
- kobject_get(&parent->kobj);
- raw_spin_unlock_irq(&scx_sched_lock);
-
- /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
- sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
- kobject_put(&parent->kobj);
- if (IS_ERR(sch)) {
- ret = PTR_ERR(sch);
- goto out_unlock;
- }
-
- ret = scx_link_sched(sch);
- if (ret)
- goto err_disable;
-
- if (sch->level >= SCX_SUB_MAX_DEPTH) {
- scx_error(sch, "max nesting depth %d violated",
- SCX_SUB_MAX_DEPTH);
- goto err_disable;
- }
-
- if (sch->ops.init) {
- ret = SCX_CALL_OP_RET(sch, init, NULL);
- if (ret) {
- ret = scx_ops_sanitize_err(sch, "init", ret);
- scx_error(sch, "ops.init() failed (%d)", ret);
- goto err_disable;
- }
- sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
- }
-
- ret = scx_arena_pool_init(sch);
- if (ret)
- goto err_disable;
-
- ret = scx_set_cmask_scratch_alloc(sch);
- if (ret)
- goto err_disable;
-
- if (scx_validate_ops(sch, ops))
- goto err_disable;
-
- struct scx_sub_attach_args sub_attach_args = {
- .ops = &sch->ops,
- .cgroup_path = sch->cgrp_path,
- };
-
- ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
- &sub_attach_args);
- if (ret) {
- ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
- scx_error(sch, "parent rejected (%d)", ret);
- goto err_disable;
- }
- sch->sub_attached = true;
-
- scx_bypass(sch, true);
-
- for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
- if (((void (**)(void))ops)[i])
- set_bit(i, sch->has_op);
-
- percpu_down_write(&scx_fork_rwsem);
- scx_cgroup_lock();
-
- /*
- * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see
- * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down.
- */
- set_cgroup_sched(sch_cgroup(sch), sch);
- if (!(cgrp->self.flags & CSS_ONLINE)) {
- scx_error(sch, "cgroup is not online");
- goto err_unlock_and_disable;
- }
-
- /*
- * Initialize tasks for the new child $sch without exiting them for
- * $parent so that the tasks can always be reverted back to $parent
- * sched on child init failure.
- */
- WARN_ON_ONCE(scx_enabling_sub_sched);
- scx_enabling_sub_sched = sch;
-
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- struct rq *rq;
- struct rq_flags rf;
-
- /*
- * Task iteration may visit the same task twice when racing
- * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which
- * finished __scx_init_task() and skip if set.
- *
- * A task may exit and get freed between __scx_init_task()
- * completion and scx_enable_task(). In such cases,
- * scx_disable_and_exit_task() must exit the task for both the
- * parent and child scheds.
- */
- if (p->scx.flags & SCX_TASK_SUB_INIT)
- continue;
-
- /* @p is pinned by the iter; see scx_sub_disable() */
- get_task_struct(p);
-
- if (!assert_task_ready_or_enabled(p)) {
- ret = -EINVAL;
- goto abort;
- }
-
- scx_task_iter_unlock(&sti);
-
- /*
- * As $p is still on $parent, it can't be transitioned to INIT.
- * Let's worry about task state later. Use __scx_init_task().
- */
- ret = __scx_init_task(sch, p, false);
- if (ret)
- goto abort;
-
- rq = task_rq_lock(p, &rf);
-
- if (scx_get_task_state(p) == SCX_TASK_DEAD) {
- /*
- * sched_ext_dead() raced us between __scx_init_task()
- * and this rq lock and ran exit_task() on $parent (the
- * sched @p was on at that point), not on @sch. @sch's
- * just-completed init is owed an exit_task() and we
- * issue it here.
- */
- scx_sub_init_cancel_task(sch, p);
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
- continue;
- }
-
- p->scx.flags |= SCX_TASK_SUB_INIT;
- task_rq_unlock(rq, p, &rf);
-
- put_task_struct(p);
- }
- scx_task_iter_stop(&sti);
-
- /*
- * All tasks are prepped. Disable/exit tasks for $parent and enable for
- * the new @sch.
- */
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- /*
- * Use clearing of %SCX_TASK_SUB_INIT to detect and skip
- * duplicate iterations.
- */
- if (!(p->scx.flags & SCX_TASK_SUB_INIT))
- continue;
-
- scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
- /*
- * $p must be either READY or ENABLED. If ENABLED,
- * __scx_disabled_and_exit_task() first disables and
- * makes it READY. However, after exiting $p, it will
- * leave $p as READY.
- */
- assert_task_ready_or_enabled(p);
- __scx_disable_and_exit_task(parent, p);
-
- /*
- * $p is now only initialized for @sch and READY, which
- * is what we want. Assign it to @sch and enable.
- */
- scx_set_task_sched(p, sch);
- scx_enable_task(sch, p);
-
- p->scx.flags &= ~SCX_TASK_SUB_INIT;
- }
- }
- scx_task_iter_stop(&sti);
-
- scx_enabling_sub_sched = NULL;
-
- scx_cgroup_unlock();
- percpu_up_write(&scx_fork_rwsem);
-
- scx_bypass(sch, false);
-
- pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
- kobject_uevent(&sch->kobj, KOBJ_ADD);
- ret = 0;
- goto out_unlock;
-
-out_put_cgrp:
- cgroup_put(cgrp);
-out_unlock:
- mutex_unlock(&scx_enable_mutex);
- cmd->ret = ret;
- return;
-
-abort:
- put_task_struct(p);
- scx_task_iter_stop(&sti);
-
- /*
- * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
- * ran for @sch on them, so calling scx_disable_task() here would invoke
- * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
- * must stay set until SUB_INIT is cleared from every marked task -
- * scx_disable_and_exit_task() reads it when a task exits concurrently.
- */
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- if (p->scx.flags & SCX_TASK_SUB_INIT) {
- scx_sub_init_cancel_task(sch, p);
- p->scx.flags &= ~SCX_TASK_SUB_INIT;
- }
- }
- scx_task_iter_stop(&sti);
- scx_enabling_sub_sched = NULL;
-err_unlock_and_disable:
- /* we'll soon enter disable path, keep bypass on */
- scx_cgroup_unlock();
- percpu_up_write(&scx_fork_rwsem);
-err_disable:
- mutex_unlock(&scx_enable_mutex);
- scx_flush_disable_work(sch);
- cmd->ret = 0;
-}
-
-static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb,
- unsigned long action, void *data)
-{
- struct cgroup *cgrp = data;
- struct cgroup *parent = cgroup_parent(cgrp);
-
- if (!cgroup_on_dfl(cgrp))
- return NOTIFY_OK;
-
- switch (action) {
- case CGROUP_LIFETIME_ONLINE:
- /* inherit ->scx_sched from $parent */
- if (parent)
- rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched);
- break;
- case CGROUP_LIFETIME_OFFLINE:
- /* if there is a sched attached, shoot it down */
- if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp)
- scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN,
- SCX_ECODE_RSN_CGROUP_OFFLINE,
- "cgroup %llu going offline", cgroup_id(cgrp));
- break;
- }
-
- return NOTIFY_OK;
-}
-
-static struct notifier_block scx_cgroup_lifetime_nb = {
- .notifier_call = scx_cgroup_lifetime_notify,
-};
-
-static s32 __init scx_cgroup_lifetime_notifier_init(void)
-{
- return blocking_notifier_chain_register(&cgroup_lifetime_notifier,
- &scx_cgroup_lifetime_nb);
-}
-core_initcall(scx_cgroup_lifetime_notifier_init);
-#endif /* CONFIG_EXT_SUB_SCHED */
-
static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link)
{
static struct kthread_worker *helper;
@@ -7838,20 +7082,6 @@ static int bpf_scx_init_member(const struct btf_type *t,
return 0;
}
-#ifdef CONFIG_EXT_SUB_SCHED
-static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog)
-{
- struct scx_sched *sch;
-
- guard(rcu)();
- sch = scx_prog_sched(prog->aux);
- if (unlikely(!sch))
- return;
-
- scx_error(sch, "dispatch recursion detected");
-}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
static int bpf_scx_check_member(const struct btf_type *t,
const struct btf_member *member,
const struct bpf_prog *prog)
@@ -9022,45 +8252,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter,
p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
}
-#ifdef CONFIG_EXT_SUB_SCHED
-/**
- * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler
- * @cgroup_id: cgroup ID of the child scheduler to dispatch
- * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
- *
- * Allows a parent scheduler to trigger dispatching on one of its direct
- * child schedulers. The child scheduler runs its dispatch operation to
- * move tasks from dispatch queues to the local runqueue.
- *
- * Returns: true on success, false if cgroup_id is invalid, not a direct
- * child, or caller lacks dispatch permission.
- */
-__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
-{
- struct rq *this_rq = this_rq();
- struct scx_sched *parent, *child;
-
- guard(rcu)();
- parent = scx_prog_sched(aux);
- if (unlikely(!parent))
- return false;
-
- child = scx_find_sub_sched(cgroup_id);
-
- if (unlikely(!child))
- return false;
-
- if (unlikely(scx_parent(child) != parent)) {
- scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
- cgroup_id);
- return false;
- }
-
- return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
- true);
-}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 5d861cb0727d..01c8d6eac8dd 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -11,6 +11,34 @@
#include "../sched.h"
#include "types.h"
+#include <trace/events/sched_ext.h>
+
+/**
+ * scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @sch: scx_sched to account events for
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occurred
+ *
+ * This can be used when preemption is not disabled.
+ */
+#define scx_add_event(sch, name, cnt) do { \
+ this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
+ trace_sched_ext_event(#name, (cnt)); \
+} while(0)
+
+/**
+ * __scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @sch: scx_sched to account events for
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occurred
+ *
+ * This should be used only when preemption is disabled.
+ */
+#define __scx_add_event(sch, name, cnt) do { \
+ __this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
+ trace_sched_ext_event(#name, cnt); \
+} while(0)
+
#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
#define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void)))
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
new file mode 100644
index 000000000000..050420427273
--- /dev/null
+++ b/kernel/sched/ext/sub.c
@@ -0,0 +1,668 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Sub-scheduler hierarchy support.
+ *
+ * A sub-scheduler is an scx_sched attached to a cgroup subtree under another
+ * scx_sched. This file holds the sub-scheduler implementation: the scheduler
+ * tree walk, capability delegation, per-shard cap state and its sync, and the
+ * sub-scheduler enable/disable paths. The core dispatch/enqueue machinery it
+ * builds on lives in ext.c.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/rhashtable.h>
+#include "internal.h"
+#include "cid.h"
+#include "arena.h"
+#include "sub.h"
+
+#ifdef CONFIG_EXT_SUB_SCHED
+
+/**
+ * scx_next_descendant_pre - find the next descendant for pre-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @root: sched whose descendants to walk
+ *
+ * To be used by scx_for_each_descendant_pre(). Find the next descendant to
+ * visit for pre-order traversal of @root's descendants. @root is included in
+ * the iteration and the first node to be visited.
+ */
+struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root)
+{
+ struct scx_sched *next;
+
+ lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
+ lockdep_is_held(&scx_sched_lock));
+
+ /* if first iteration, visit @root */
+ if (!pos)
+ return root;
+
+ /* visit the first child if exists */
+ next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
+ if (next)
+ return next;
+
+ /* no child, visit my or the closest ancestor's next sibling */
+ while (pos != root) {
+ if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
+ return list_next_entry(pos, sibling);
+ pos = scx_parent(pos);
+ }
+
+ return NULL;
+}
+
+static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
+{
+ return rhashtable_lookup(&scx_sched_hash, &cgroup_id,
+ scx_sched_hash_params);
+}
+
+void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch)
+{
+ rcu_assign_pointer(p->scx.sched, sch);
+}
+
+struct cgroup *sch_cgroup(struct scx_sched *sch)
+{
+ return sch->cgrp;
+}
+
+/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
+void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
+{
+ struct cgroup *pos;
+ struct cgroup_subsys_state *css;
+
+ cgroup_for_each_live_descendant_pre(pos, css, cgrp)
+ rcu_assign_pointer(pos->scx_sched, sch);
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
+
+void drain_descendants(struct scx_sched *sch)
+{
+ /*
+ * Child scheds that finished the critical part of disabling will take
+ * themselves off @sch->children. Wait for it to drain. As propagation
+ * is recursive, empty @sch->children means that all proper descendant
+ * scheds reached unlinking stage.
+ */
+ wait_event(scx_unlink_waitq, list_empty(&sch->children));
+}
+
+static void scx_fail_parent(struct scx_sched *sch,
+ struct task_struct *failed, s32 fail_code)
+{
+ struct scx_sched *parent = scx_parent(sch);
+ struct scx_task_iter sti;
+ struct task_struct *p;
+
+ scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler",
+ fail_code, failed->comm, failed->pid);
+
+ /*
+ * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into
+ * it. This may cause downstream failures on the BPF side but $parent is
+ * dying anyway.
+ */
+ scx_bypass(parent, true);
+
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ if (scx_task_on_sched(parent, p))
+ continue;
+
+ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ scx_disable_and_exit_task(sch, p);
+ scx_set_task_sched(p, parent);
+ }
+ }
+ scx_task_iter_stop(&sti);
+}
+
+void scx_sub_disable(struct scx_sched *sch)
+{
+ struct scx_sched *parent = scx_parent(sch);
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ int ret;
+
+ /*
+ * Guarantee forward progress and wait for descendants to be disabled.
+ * To limit disruptions, $parent is not bypassed. Tasks are fully
+ * prepped and then inserted back into $parent.
+ */
+ scx_bypass(sch, true);
+ drain_descendants(sch);
+
+ /*
+ * Here, every runnable task is guaranteed to make forward progress and
+ * we can safely use blocking synchronization constructs. Actually
+ * disable ops.
+ */
+ mutex_lock(&scx_enable_mutex);
+ percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();
+
+ set_cgroup_sched(sch_cgroup(sch), parent);
+
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ /* filter out duplicate visits */
+ if (scx_task_on_sched(parent, p))
+ continue;
+
+ /*
+ * By the time control reaches here, all descendant schedulers
+ * should already have been disabled.
+ */
+ WARN_ON_ONCE(!scx_task_on_sched(sch, p));
+
+ /*
+ * @p is pinned by the iter: css_task_iter_next() takes a
+ * reference and holds it until the next iter_next() call, so
+ * @p->usage is guaranteed > 0.
+ */
+ get_task_struct(p);
+
+ scx_task_iter_unlock(&sti);
+
+ /*
+ * $p is READY or ENABLED on @sch. Initialize for $parent,
+ * disable and exit from @sch, and then switch over to $parent.
+ *
+ * If a task fails to initialize for $parent, the only available
+ * action is disabling $parent too. While this allows disabling
+ * of a child sched to cause the parent scheduler to fail, the
+ * failure can only originate from ops.init_task() of the
+ * parent. A child can't directly affect the parent through its
+ * own failures.
+ */
+ ret = __scx_init_task(parent, p, false);
+ if (ret) {
+ scx_fail_parent(sch, p, ret);
+ put_task_struct(p);
+ break;
+ }
+
+ rq = task_rq_lock(p, &rf);
+
+ if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+ /*
+ * sched_ext_dead() raced us between __scx_init_task()
+ * and this rq lock and ran exit_task() on @sch (the
+ * sched @p was on at that point), not on $parent.
+ * $parent's just-completed init is owed an exit_task()
+ * and we issue it here.
+ */
+ scx_sub_init_cancel_task(parent, p);
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ continue;
+ }
+
+ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ /*
+ * $p is initialized for $parent and still attached to
+ * @sch. Disable and exit for @sch, switch over to
+ * $parent, override the state to READY to account for
+ * $p having already been initialized, and then enable.
+ */
+ scx_disable_and_exit_task(sch, p);
+ scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
+ scx_set_task_state(p, SCX_TASK_INIT);
+ scx_set_task_sched(p, parent);
+ scx_set_task_state(p, SCX_TASK_READY);
+ scx_enable_task(parent, p);
+ }
+
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ }
+ scx_task_iter_stop(&sti);
+
+ scx_disable_dump(sch);
+
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+
+ /*
+ * All tasks are moved off of @sch but there may still be on-going
+ * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
+ * the expedited version as ancestors may be waiting in bypass mode.
+ * Also, tell the parent that there is no need to keep running bypass
+ * DSQs for us.
+ */
+ synchronize_rcu_expedited();
+ scx_disable_bypass_dsp(sch);
+
+ scx_unlink_sched(sch);
+
+ mutex_unlock(&scx_enable_mutex);
+
+ /*
+ * @sch is now unlinked from the parent's children list. Notify and call
+ * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called
+ * after unlinking and releasing all locks. See scx_claim_exit().
+ */
+ wake_up_all(&scx_unlink_waitq);
+
+ if (parent->ops.sub_detach && sch->sub_attached) {
+ struct scx_sub_detach_args sub_detach_args = {
+ .ops = &sch->ops,
+ .cgroup_path = sch->cgrp_path,
+ };
+ SCX_CALL_OP(parent, sub_detach, NULL,
+ &sub_detach_args);
+ }
+
+ scx_log_sched_disable(sch);
+
+ if (sch->ops.exit)
+ SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
+ if (sch->sub_kset)
+ kobject_del(&sch->sub_kset->kobj);
+ kobject_del(&sch->kobj);
+}
+
+/* verify that a scheduler can be attached to @cgrp and return the parent */
+static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
+{
+ struct scx_sched *parent = cgrp->scx_sched;
+ struct scx_sched *pos;
+
+ lockdep_assert_held(&scx_sched_lock);
+
+ /* can't attach twice to the same cgroup */
+ if (parent->cgrp == cgrp)
+ return ERR_PTR(-EBUSY);
+
+ /* does $parent allow sub-scheds? */
+ if (!parent->ops.sub_attach)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ /* can't insert between $parent and its exiting children */
+ list_for_each_entry(pos, &parent->children, sibling)
+ if (cgroup_is_descendant(pos->cgrp, cgrp))
+ return ERR_PTR(-EBUSY);
+
+ return parent;
+}
+
+static bool assert_task_ready_or_enabled(struct task_struct *p)
+{
+ u32 state = scx_get_task_state(p);
+
+ switch (state) {
+ case SCX_TASK_READY:
+ case SCX_TASK_ENABLED:
+ return true;
+ default:
+ WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched",
+ state, p->comm, p->pid);
+ return false;
+ }
+}
+
+void scx_sub_enable_workfn(struct kthread_work *work)
+{
+ struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
+ struct sched_ext_ops *ops = cmd->ops;
+ struct cgroup *cgrp;
+ struct scx_sched *parent, *sch;
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ s32 i, ret;
+
+ mutex_lock(&scx_enable_mutex);
+
+ if (!scx_enabled()) {
+ ret = -ENODEV;
+ goto out_unlock;
+ }
+
+ /* See scx_root_enable_workfn() for the @ops->priv check. */
+ if (rcu_access_pointer(ops->priv)) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ cgrp = cgroup_get_from_id(ops->sub_cgroup_id);
+ if (IS_ERR(cgrp)) {
+ ret = PTR_ERR(cgrp);
+ goto out_unlock;
+ }
+
+ raw_spin_lock_irq(&scx_sched_lock);
+ parent = find_parent_sched(cgrp);
+ if (IS_ERR(parent)) {
+ raw_spin_unlock_irq(&scx_sched_lock);
+ ret = PTR_ERR(parent);
+ goto out_put_cgrp;
+ }
+ kobject_get(&parent->kobj);
+ raw_spin_unlock_irq(&scx_sched_lock);
+
+ /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
+ sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
+ kobject_put(&parent->kobj);
+ if (IS_ERR(sch)) {
+ ret = PTR_ERR(sch);
+ goto out_unlock;
+ }
+
+ ret = scx_link_sched(sch);
+ if (ret)
+ goto err_disable;
+
+ if (sch->level >= SCX_SUB_MAX_DEPTH) {
+ scx_error(sch, "max nesting depth %d violated",
+ SCX_SUB_MAX_DEPTH);
+ goto err_disable;
+ }
+
+ if (sch->ops.init) {
+ ret = SCX_CALL_OP_RET(sch, init, NULL);
+ if (ret) {
+ ret = scx_ops_sanitize_err(sch, "init", ret);
+ scx_error(sch, "ops.init() failed (%d)", ret);
+ goto err_disable;
+ }
+ sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
+ }
+
+ ret = scx_arena_pool_init(sch);
+ if (ret)
+ goto err_disable;
+
+ ret = scx_set_cmask_scratch_alloc(sch);
+ if (ret)
+ goto err_disable;
+
+ if (scx_validate_ops(sch, ops))
+ goto err_disable;
+
+ struct scx_sub_attach_args sub_attach_args = {
+ .ops = &sch->ops,
+ .cgroup_path = sch->cgrp_path,
+ };
+
+ ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
+ &sub_attach_args);
+ if (ret) {
+ ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
+ scx_error(sch, "parent rejected (%d)", ret);
+ goto err_disable;
+ }
+ sch->sub_attached = true;
+
+ scx_bypass(sch, true);
+
+ for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
+ if (((void (**)(void))ops)[i])
+ set_bit(i, sch->has_op);
+
+ percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();
+
+ /*
+ * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see
+ * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down.
+ */
+ set_cgroup_sched(sch_cgroup(sch), sch);
+ if (!(cgrp->self.flags & CSS_ONLINE)) {
+ scx_error(sch, "cgroup is not online");
+ goto err_unlock_and_disable;
+ }
+
+ /*
+ * Initialize tasks for the new child $sch without exiting them for
+ * $parent so that the tasks can always be reverted back to $parent
+ * sched on child init failure.
+ */
+ WARN_ON_ONCE(scx_enabling_sub_sched);
+ scx_enabling_sub_sched = sch;
+
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ /*
+ * Task iteration may visit the same task twice when racing
+ * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which
+ * finished __scx_init_task() and skip if set.
+ *
+ * A task may exit and get freed between __scx_init_task()
+ * completion and scx_enable_task(). In such cases,
+ * scx_disable_and_exit_task() must exit the task for both the
+ * parent and child scheds.
+ */
+ if (p->scx.flags & SCX_TASK_SUB_INIT)
+ continue;
+
+ /* @p is pinned by the iter; see scx_sub_disable() */
+ get_task_struct(p);
+
+ if (!assert_task_ready_or_enabled(p)) {
+ ret = -EINVAL;
+ goto abort;
+ }
+
+ scx_task_iter_unlock(&sti);
+
+ /*
+ * As $p is still on $parent, it can't be transitioned to INIT.
+ * Let's worry about task state later. Use __scx_init_task().
+ */
+ ret = __scx_init_task(sch, p, false);
+ if (ret)
+ goto abort;
+
+ rq = task_rq_lock(p, &rf);
+
+ if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+ /*
+ * sched_ext_dead() raced us between __scx_init_task()
+ * and this rq lock and ran exit_task() on $parent (the
+ * sched @p was on at that point), not on @sch. @sch's
+ * just-completed init is owed an exit_task() and we
+ * issue it here.
+ */
+ scx_sub_init_cancel_task(sch, p);
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ continue;
+ }
+
+ p->scx.flags |= SCX_TASK_SUB_INIT;
+ task_rq_unlock(rq, p, &rf);
+
+ put_task_struct(p);
+ }
+ scx_task_iter_stop(&sti);
+
+ /*
+ * All tasks are prepped. Disable/exit tasks for $parent and enable for
+ * the new @sch.
+ */
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ /*
+ * Use clearing of %SCX_TASK_SUB_INIT to detect and skip
+ * duplicate iterations.
+ */
+ if (!(p->scx.flags & SCX_TASK_SUB_INIT))
+ continue;
+
+ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ /*
+ * $p must be either READY or ENABLED. If ENABLED,
+ * __scx_disabled_and_exit_task() first disables and
+ * makes it READY. However, after exiting $p, it will
+ * leave $p as READY.
+ */
+ assert_task_ready_or_enabled(p);
+ __scx_disable_and_exit_task(parent, p);
+
+ /*
+ * $p is now only initialized for @sch and READY, which
+ * is what we want. Assign it to @sch and enable.
+ */
+ scx_set_task_sched(p, sch);
+ scx_enable_task(sch, p);
+
+ p->scx.flags &= ~SCX_TASK_SUB_INIT;
+ }
+ }
+ scx_task_iter_stop(&sti);
+
+ scx_enabling_sub_sched = NULL;
+
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+
+ scx_bypass(sch, false);
+
+ pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
+ kobject_uevent(&sch->kobj, KOBJ_ADD);
+ ret = 0;
+ goto out_unlock;
+
+out_put_cgrp:
+ cgroup_put(cgrp);
+out_unlock:
+ mutex_unlock(&scx_enable_mutex);
+ cmd->ret = ret;
+ return;
+
+abort:
+ put_task_struct(p);
+ scx_task_iter_stop(&sti);
+
+ /*
+ * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
+ * ran for @sch on them, so calling scx_disable_task() here would invoke
+ * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
+ * must stay set until SUB_INIT is cleared from every marked task -
+ * scx_disable_and_exit_task() reads it when a task exits concurrently.
+ */
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ if (p->scx.flags & SCX_TASK_SUB_INIT) {
+ scx_sub_init_cancel_task(sch, p);
+ p->scx.flags &= ~SCX_TASK_SUB_INIT;
+ }
+ }
+ scx_task_iter_stop(&sti);
+ scx_enabling_sub_sched = NULL;
+err_unlock_and_disable:
+ /* we'll soon enter disable path, keep bypass on */
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+err_disable:
+ mutex_unlock(&scx_enable_mutex);
+ scx_flush_disable_work(sch);
+ cmd->ret = 0;
+}
+
+static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct cgroup *cgrp = data;
+ struct cgroup *parent = cgroup_parent(cgrp);
+
+ if (!cgroup_on_dfl(cgrp))
+ return NOTIFY_OK;
+
+ switch (action) {
+ case CGROUP_LIFETIME_ONLINE:
+ /* inherit ->scx_sched from $parent */
+ if (parent)
+ rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched);
+ break;
+ case CGROUP_LIFETIME_OFFLINE:
+ /* if there is a sched attached, shoot it down */
+ if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp)
+ scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN,
+ SCX_ECODE_RSN_CGROUP_OFFLINE,
+ "cgroup %llu going offline", cgroup_id(cgrp));
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block scx_cgroup_lifetime_nb = {
+ .notifier_call = scx_cgroup_lifetime_notify,
+};
+
+static s32 __init scx_cgroup_lifetime_notifier_init(void)
+{
+ return blocking_notifier_chain_register(&cgroup_lifetime_notifier,
+ &scx_cgroup_lifetime_nb);
+}
+core_initcall(scx_cgroup_lifetime_notifier_init);
+
+void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog)
+{
+ struct scx_sched *sch;
+
+ guard(rcu)();
+ sch = scx_prog_sched(prog->aux);
+ if (unlikely(!sch))
+ return;
+
+ scx_error(sch, "dispatch recursion detected");
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler
+ * @cgroup_id: cgroup ID of the child scheduler to dispatch
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Allows a parent scheduler to trigger dispatching on one of its direct
+ * child schedulers. The child scheduler runs its dispatch operation to
+ * move tasks from dispatch queues to the local runqueue.
+ *
+ * Returns: true on success, false if cgroup_id is invalid, not a direct
+ * child, or caller lacks dispatch permission.
+ */
+__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
+{
+ struct rq *this_rq = this_rq();
+ struct scx_sched *parent, *child;
+
+ guard(rcu)();
+ parent = scx_prog_sched(aux);
+ if (unlikely(!parent))
+ return false;
+
+ child = scx_find_sub_sched(cgroup_id);
+
+ if (unlikely(!child))
+ return false;
+
+ if (unlikely(scx_parent(child) != parent)) {
+ scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
+ cgroup_id);
+ return false;
+ }
+
+ return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
+ true);
+}
+
+__bpf_kfunc_end_defs();
+
+#endif /* CONFIG_EXT_SUB_SCHED */
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
new file mode 100644
index 000000000000..460a9fd196dc
--- /dev/null
+++ b/kernel/sched/ext/sub.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Sub-scheduler hierarchy support.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_SUB_H
+#define _KERNEL_SCHED_EXT_SUB_H
+
+#include "internal.h"
+#include "cid.h"
+
+#ifdef CONFIG_EXT_SUB_SCHED
+
+struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root);
+void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch);
+struct cgroup *sch_cgroup(struct scx_sched *sch);
+void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch);
+void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog);
+void drain_descendants(struct scx_sched *sch);
+void scx_sub_disable(struct scx_sched *sch);
+void scx_sub_enable_workfn(struct kthread_work *work);
+bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
+
+#else /* CONFIG_EXT_SUB_SCHED */
+
+static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
+static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
+static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
+static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
+static inline void drain_descendants(struct scx_sched *sch) { }
+static inline void scx_sub_disable(struct scx_sched *sch) { }
+
+#endif /* CONFIG_EXT_SUB_SCHED */
+
+/**
+ * scx_for_each_descendant_pre - pre-order walk of a sched's descendants
+ * @pos: iteration cursor
+ * @root: sched to walk the descendants of
+ *
+ * Walk @root's descendants. @root is included in the iteration and the first
+ * node to be visited. Must be called with either scx_enable_mutex or
+ * scx_sched_lock held.
+ */
+#define scx_for_each_descendant_pre(pos, root) \
+ for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \
+ (pos) = scx_next_descendant_pre((pos), (root)))
+
+/*
+ * One user of this function is scx_bpf_dispatch() which can be called
+ * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
+ * from the call frame.
+ */
+static __always_inline bool
+scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
+ struct task_struct *prev, bool nested)
+{
+ struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
+ int nr_loops = SCX_DSP_MAX_LOOPS;
+ s32 cpu = cpu_of(rq);
+ bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
+ scx_task_on_sched(sch, prev);
+
+ if (scx_consume_global_dsq(sch, rq))
+ return true;
+
+ if (scx_bypass_dsp_enabled(sch)) {
+ /* if @sch is bypassing, only the bypass DSQs are active */
+ if (scx_bypassing(sch, cpu))
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
+
+#ifdef CONFIG_EXT_SUB_SCHED
+ /*
+ * If @sch isn't bypassing but its children are, @sch is
+ * responsible for making forward progress for both its own
+ * tasks that aren't bypassing and the bypassing descendants'
+ * tasks. The following implements a simple built-in behavior -
+ * let each CPU try to run the bypass DSQ every Nth time.
+ *
+ * Later, if necessary, we can add an ops flag to suppress the
+ * auto-consumption and a kfunc to consume the bypass DSQ and,
+ * so that the BPF scheduler can fully control scheduling of
+ * bypassed tasks.
+ */
+ struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+
+ if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
+ scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
+ __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
+ return true;
+ }
+#endif /* CONFIG_EXT_SUB_SCHED */
+ }
+
+ if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
+ return false;
+
+ dspc->rq = rq;
+
+ /*
+ * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
+ * lock, the local DSQ might still end up empty after a successful
+ * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
+ * produced some tasks, retry. The BPF scheduler may depend on this
+ * looping behavior to simplify its implementation.
+ */
+ do {
+ dspc->nr_tasks = 0;
+
+ if (nested) {
+ SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+ prev_on_sch ? prev : NULL);
+ } else {
+ /* stash @prev so that nested invocations can access it */
+ rq->scx.sub_dispatch_prev = prev;
+ SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+ prev_on_sch ? prev : NULL);
+ rq->scx.sub_dispatch_prev = NULL;
+ }
+
+ scx_flush_dispatch_buf(sch, rq);
+
+ if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
+ rq->scx.flags |= SCX_RQ_BAL_KEEP;
+ return true;
+ }
+ if (rq->scx.local_dsq.nr)
+ return true;
+ if (scx_consume_global_dsq(sch, rq))
+ return true;
+
+ /*
+ * ops.dispatch() can trap us in this loop by repeatedly
+ * dispatching ineligible tasks. Break out once in a while to
+ * allow the watchdog to run. As IRQ can't be enabled in
+ * balance(), we want to complete this scheduling cycle and then
+ * start a new one. IOW, we want to call resched_curr() on the
+ * next, most likely idle, task, not the current one. Use
+ * __scx_bpf_kick_cpu() for deferred kicking.
+ */
+ if (unlikely(!--nr_loops)) {
+ scx_kick_cpu(sch, cpu, 0);
+ break;
+ }
+ } while (dspc->nr_tasks);
+
+ /*
+ * Prevent the CPU from going idle while bypassed descendants have tasks
+ * queued. Without this fallback, bypassed tasks could stall if the host
+ * scheduler's ops.dispatch() doesn't yield any tasks.
+ */
+ if (scx_bypass_dsp_enabled(sch))
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
+
+ return false;
+}
+
+#endif /* _KERNEL_SCHED_EXT_SUB_H */
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 18:34 ` sashiko-bot
2026-07-01 19:47 ` Tejun Heo
0 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-07-01 18:34 UTC (permalink / raw)
To: Tejun Heo; +Cc: sched-ext
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
New issues:
- [Low] Compilation failure due to implicit function declarations in `sub.h`.
Pre-existing issues:
- [High] The `scx_bpf_sub_dispatch` kfunc breaks runqueue lock atomicity for the parent BPF program.
- [High] Missing `scx_error()` calls in `scx_sub_enable_workfn()` lead to silent sub-scheduler initialization failures and zombie states.
--
commit c73a80aa8b4b758cfebb3ce05e858b8b1302f361
Author: Tejun Heo <tj@kernel.org>
Subject: sched_ext: Split sub-scheduler implementation into sub.c
This commit moves the sub-scheduler implementation from ext.c into a new
sub.c file. It also moves shared prototypes to sub.h and makes
scx_dispatch_sched() an inline function in sub.h.
> diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
> new file mode 100644
> index 0000000000000..460a9fd196dc6
> --- /dev/null
> +++ b/kernel/sched/ext/sub.h
[ ... ]
> +static __always_inline bool
> +scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
> + struct task_struct *prev, bool nested)
> +{
> + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
> + int nr_loops = SCX_DSP_MAX_LOOPS;
> + s32 cpu = cpu_of(rq);
> + bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
> + scx_task_on_sched(sch, prev);
> +
> + if (scx_consume_global_dsq(sch, rq))
> + return true;
[ ... ]
> + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
> + return false;
[Severity: Low]
Does this cause a compilation failure due to implicit function declarations?
In kernel/sched/build_policy.c, ext/sub.h is included before ext/ext.c. When
the compiler parses this scx_dispatch_sched() inline function in sub.h,
functions like scx_rq_online(), scx_flush_dispatch_buf(), and scx_kick_cpu()
have not been declared yet because they are defined later as static functions
in ext.c. This will result in an implicit function declaration error.
> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> new file mode 100644
> index 0000000000000..0504204272732
> --- /dev/null
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> +void scx_sub_enable_workfn(struct kthread_work *work)
> +{
> + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
> + struct sched_ext_ops *ops = cmd->ops;
> + struct cgroup *cgrp;
[ ... ]
> +err_disable:
> + mutex_unlock(&scx_enable_mutex);
> + scx_flush_disable_work(sch);
> + cmd->ret = 0;
> +}
[Severity: High]
This is a pre-existing issue, but do missing scx_error() calls lead to silent
sub-scheduler initialization failures?
If initialization fails (for example, inside scx_arena_pool_init()), the code
jumps to the err_disable label. Unlike scx_root_enable_workfn() which
explicitly calls scx_error() to transition the scheduler to a disabled state
and queue cleanup work, scx_sub_enable_workfn() omits this call.
Consequently, scx_flush_disable_work() does nothing because the exit state
remains SCX_EXIT_NONE, and the function returns success to userspace by
setting cmd->ret = 0, leaving the sub-scheduler permanently half-initialized.
[ ... ]
> +__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
> +{
> + struct rq *this_rq = this_rq();
> + struct scx_sched *parent, *child;
> +
> + guard(rcu)();
> + parent = scx_prog_sched(aux);
> + if (unlikely(!parent))
> + return false;
> +
> + child = scx_find_sub_sched(cgroup_id);
> +
> + if (unlikely(!child))
> + return false;
> +
> + if (unlikely(scx_parent(child) != parent)) {
> + scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
> + cgroup_id);
> + return false;
> + }
> +
> + return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
> + true);
> +}
[Severity: High]
This isn't a bug introduced by this patch, but does the scx_bpf_sub_dispatch()
kfunc break runqueue lock atomicity for the parent BPF program?
The BPF verifier and scheduler core assume that a non-sleepable ops.dispatch
BPF program executes atomically while continuously holding the rq lock.
However, when the parent BPF scheduler calls this kfunc, it evaluates the
child scheduler's dispatch logic which calls scx_flush_dispatch_buf().
Flushing the buffer can invoke dispatch_to_local_dsq(), which drops and
re-acquires the rq lock. Dropping the lock underneath the still-executing
parent BPF program violates the atomicity assumption, which can allow
concurrent CPUs to modify the runqueue.
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260701181046.2490390-1-tj@kernel.org?part=4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 18:10 [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
` (3 preceding siblings ...)
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 19:43 ` Andrea Righi
4 siblings, 0 replies; 8+ messages in thread
From: Andrea Righi @ 2026-07-01 19:43 UTC (permalink / raw)
To: Tejun Heo
Cc: David Vernet, Changwoo Min, sched-ext, Emil Tsalapatis,
linux-kernel
Hi Tejun,
On Wed, Jul 01, 2026 at 08:10:42AM -1000, Tejun Heo wrote:
> Hello,
>
> v2: Fold the scx_dispatch_sched() sub.h promotion into the split (patch 4) so
> it is self-contained. v1 left it to a later patch, so the posted split had
> sub.c call an ext.c file-local static (Andrea). __always_inline kept.
> Patches 1-3 unchanged.
>
> v1: https://lore.kernel.org/all/20260701031429.1892218-1-tj@kernel.org
>
> The sub-scheduler implementation has grown and will keep growing. Move it
> out of ext.c into a new kernel/sched/ext/sub.c. The first three patches are
> mechanical prep (prefix file-local helpers, expose shared internals, inline
> a few trivial helpers) so the move itself stays pure code motion. No
> functional change.
Looks good to me.
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
>
> Based on sched_ext/for-7.3 (5df6a4506d06) with sched_ext/for-7.2-fixes
> (b7d9c359e5cf) assumed merged.
>
> Tejun Heo (4):
> sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
> sched_ext: Expose the ext.c internals used by the sub.c split
> sched_ext: Inline small ext.c helpers shared across the sub.c split
> sched_ext: Split sub-scheduler implementation into sub.c
>
> kernel/sched/build_policy.c | 2 +
> kernel/sched/ext/ext.c | 1117 +++++--------------------------------------
> kernel/sched/ext/internal.h | 164 ++++++-
> kernel/sched/ext/sub.c | 668 ++++++++++++++++++++++++++
> kernel/sched/ext/sub.h | 161 +++++++
> 5 files changed, 1101 insertions(+), 1011 deletions(-)
>
> Thanks.
>
> --
> tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 18:34 ` sashiko-bot
@ 2026-07-01 19:47 ` Tejun Heo
0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 19:47 UTC (permalink / raw)
To: sched-ext; +Cc: sashiko-reviews
Hello,
> - [Low] Compilation failure due to implicit function declarations in `sub.h`.
Yeah, scx_dispatch_sched() moved into sub.h calls three ext.c statics whose
declarations only land in a later patch, so 4/4 doesn't build on its own.
Another series-slicing mistake on my end. Will fix and post v3.
> - [High] The `scx_bpf_sub_dispatch` kfunc breaks runqueue lock atomicity for the parent BPF program.
ops.dispatch has never held the rq lock across its whole execution, and
scx_bpf_dsq_move_to_local() already drops and reacquires it when consuming a
remote task, so there's no continuous-lock guarantee to break.
> - [High] Missing `scx_error()` calls in `scx_sub_enable_workfn()` lead to silent sub-scheduler initialization failures and zombie states.
This will be addressed in a future patch.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-07-01 19:47 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 18:10 [PATCHSET v2 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
2026-07-01 18:10 ` [PATCH v2 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 18:34 ` sashiko-bot
2026-07-01 19:47 ` Tejun Heo
2026-07-01 19:43 ` [PATCHSET v2 sched_ext/for-7.3] " Andrea Righi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox