* [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c
@ 2026-07-01 20:34 Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
` (4 more replies)
0 siblings, 5 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
Hello,
v3: Patch 2 also exposes scx_rq_online(), scx_flush_dispatch_buf() and
scx_kick_cpu() which scx_dispatch_sched() in sub.h calls, so that patch
4 builds on its own (sashiko AI). Added Andrea's Reviewed-by. Patches
1, 3 and 4 are otherwise unchanged.
v2: https://lore.kernel.org/all/20260701181046.2490390-1-tj@kernel.org
v1: https://lore.kernel.org/all/20260701031429.1892218-1-tj@kernel.org
The sub-scheduler implementation has grown and will keep growing. Move it
out of ext.c into a new kernel/sched/ext/sub.c. The first three patches are
mechanical prep (prefix file-local helpers, expose shared internals, inline
a few trivial helpers) so the move itself stays pure code motion. No
functional change.
Based on sched_ext/for-7.3 (5df6a4506d06) with sched_ext/for-7.2-fixes
(b7d9c359e5cf) assumed merged.
Tejun Heo (4):
sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
sched_ext: Expose the ext.c internals used by the sub.c split
sched_ext: Inline small ext.c helpers shared across the sub.c split
sched_ext: Split sub-scheduler implementation into sub.c
kernel/sched/build_policy.c | 2 +
kernel/sched/ext/ext.c | 1122 ++++-------------------------------
kernel/sched/ext/internal.h | 167 +++++-
kernel/sched/ext/sub.c | 668 +++++++++++++++++++++
kernel/sched/ext/sub.h | 161 +++++
5 files changed, 1106 insertions(+), 1014 deletions(-)
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
` (3 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
A later change moves the sub-scheduler implementation out of ext.c into its
own file, from where it calls a number of file-local ext.c helpers. Give
those helpers the scx_ prefix that cross-file sched_ext symbols carry, ahead
of the move so the mechanical rename stays out of the code-motion patch. No
functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext/ext.c | 192 ++++++++++++++++++------------------
kernel/sched/ext/internal.h | 2 +-
2 files changed, 97 insertions(+), 97 deletions(-)
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4e0cd08a6a2e..56e6a13fd0f8 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -369,7 +369,7 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p)
return __setscheduler_class(p->policy, p->prio);
}
-static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu)
+static struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
{
return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
}
@@ -392,11 +392,11 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
sch = scx_parent(sch);
#endif /* CONFIG_EXT_SUB_SCHED */
- return bypass_dsq(sch, cpu);
+ return scx_bypass_dsq(sch, cpu);
}
/**
- * bypass_dsp_enabled - Check if bypass dispatch path is enabled
+ * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
* @sch: scheduler to check
*
* When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
@@ -408,9 +408,9 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
*
* This function checks bypass_dsp_enable_depth which is managed separately from
* bypass_depth to enable this decoupling. See enable_bypass_dsp() and
- * disable_bypass_dsp().
+ * scx_disable_bypass_dsp().
*/
-static bool bypass_dsp_enabled(struct scx_sched *sch)
+static bool scx_bypass_dsp_enabled(struct scx_sched *sch)
{
return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
}
@@ -1079,7 +1079,7 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
}
/**
- * ops_sanitize_err - Sanitize a -errno value
+ * scx_ops_sanitize_err - Sanitize a -errno value
* @sch: scx_sched to error out on error
* @ops_name: operation to blame on failure
* @err: -errno value to sanitize
@@ -1091,7 +1091,7 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
* value fails IS_ERR() test after being encoded with ERR_PTR() and then is
* handled as a pointer.
*/
-static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
+static int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
{
if (err < 0 && err >= -MAX_ERRNO)
return err;
@@ -1251,7 +1251,7 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
schedule_deferred(rq);
}
-static void schedule_reenq_local(struct rq *rq, u64 reenq_flags)
+static void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
{
struct scx_sched *root = rcu_dereference_sched(scx_root);
@@ -1347,8 +1347,8 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
* to the CPU or dequeued. In both cases, the only way @p can go back to
* the BPF sched is through enqueueing. If being inserted into a local
* DSQ with IMMED, persist the state until the next enqueueing event in
- * do_enqueue_task() so that we can maintain IMMED protection through
- * e.g. SAVE/RESTORE cycles and slice extensions.
+ * scx_do_enqueue_task() so that we can maintain IMMED protection
+ * through e.g. SAVE/RESTORE cycles and slice extensions.
*/
if (enq_flags & SCX_ENQ_IMMED) {
if (unlikely(dsq->id != SCX_DSQ_LOCAL)) {
@@ -1371,7 +1371,7 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
* done yet, @p can't go on the CPU immediately. Re-enqueue.
*/
if (unlikely(dsq->nr > 1 || !rq_is_open(rq, enq_flags)))
- schedule_reenq_local(rq, 0);
+ scx_schedule_reenq_local(rq, 0);
}
}
@@ -1488,9 +1488,9 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
}
}
-static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
- struct scx_dispatch_q *dsq, struct task_struct *p,
- u64 enq_flags)
+static void scx_dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, struct task_struct *p,
+ u64 enq_flags)
{
bool is_local = dsq->id == SCX_DSQ_LOCAL;
@@ -1638,7 +1638,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
}
}
-static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
+static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
{
struct scx_dispatch_q *dsq = p->scx.dsq;
bool is_local = dsq == &rq->scx.local_dsq;
@@ -1692,8 +1692,8 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
}
/*
- * Abbreviated version of dispatch_dequeue() that can be used when both @p's rq
- * and dsq are locked.
+ * Abbreviated version of scx_dispatch_dequeue() that can be used when both
+ * @p's rq and dsq are locked.
*/
static void dispatch_dequeue_locked(struct task_struct *p,
struct scx_dispatch_q *dsq)
@@ -1774,10 +1774,10 @@ static void mark_direct_dispatch(struct scx_sched *sch,
* - direct_dispatch(): cleared on the synchronous enqueue path, deferred
* dispatch keeps the state until consumed
* - process_ddsp_deferred_locals(): cleared after consuming deferred state,
- * - do_enqueue_task(): cleared on enqueue fallbacks where the dispatch
+ * - scx_do_enqueue_task(): cleared on enqueue fallbacks where the dispatch
* verdict is ignored (local/global/bypass)
- * - dequeue_task_scx(): cleared after dispatch_dequeue(), covering deferred
- * cancellation and holding_cpu races
+ * - dequeue_task_scx(): cleared after scx_dispatch_dequeue(), covering
+ * deferred cancellation and holding_cpu races
* - scx_disable_task(): cleared for queued wakeup tasks, which are excluded by
* the scx_bypass() loop, so that stale state is not reused by a subsequent
* scheduler instance
@@ -1838,7 +1838,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
ddsp_enq_flags = p->scx.ddsp_enq_flags;
clear_direct_dispatch(p);
- dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
+ scx_dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}
static bool scx_rq_online(struct rq *rq)
@@ -1853,8 +1853,8 @@ static bool scx_rq_online(struct rq *rq)
return likely((rq->scx.flags & SCX_RQ_ONLINE) && cpu_active(cpu_of(rq)));
}
-static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
- int sticky_cpu)
+static void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+ int sticky_cpu)
{
struct scx_sched *sch = scx_task_sched(p);
struct task_struct **ddsp_taskp;
@@ -1941,7 +1941,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
direct_dispatch(sch, p, enq_flags);
return;
local_norefill:
- dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
+ scx_dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
return;
local:
dsq = &rq->scx.local_dsq;
@@ -1962,7 +1962,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
touch_core_sched(rq, p);
refill_task_slice_dfl(sch, p);
clear_direct_dispatch(p);
- dispatch_enqueue(sch, rq, dsq, p, enq_flags);
+ scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags);
}
static bool task_runnable(const struct task_struct *p)
@@ -2031,7 +2031,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
if (rq->scx.nr_running == 1)
dl_server_start(&rq->ext_server);
- do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+ scx_do_enqueue_task(rq, p, enq_flags, sticky_cpu);
if (sticky_cpu >= 0)
p->scx.sticky_cpu = -1;
@@ -2167,7 +2167,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
rq->scx.nr_running--;
sub_nr_running(rq, 1);
- dispatch_dequeue(rq, p);
+ scx_dispatch_dequeue(rq, p);
clear_direct_dispatch(p);
return true;
}
@@ -2215,7 +2215,7 @@ static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_fl
* - A higher-priority wakes up while SCX dispatch is in progress.
*/
if (rq->scx.nr_immed)
- schedule_reenq_local(rq, 0);
+ scx_schedule_reenq_local(rq, 0);
}
static void move_local_task_to_local_dsq(struct scx_sched *sch,
@@ -2380,7 +2380,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
* values afterwards, as this operation can't be preempted or recurse, the
* holding_cpu can never become this CPU again before we're done. Thus, we can
* tell whether we lost to dequeue by testing whether the holding_cpu still
- * points to this CPU. See dispatch_dequeue() for the counterpart.
+ * points to this CPU. See scx_dispatch_dequeue() for the counterpart.
*
* On return, @dsq is unlocked and @src_rq is locked. Returns %true if @p is
* still valid. %false if lost to dequeue.
@@ -2485,14 +2485,14 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
dispatch_dequeue_locked(p, src_dsq);
raw_spin_unlock(&src_dsq->lock);
- dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
+ scx_dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
}
return dst_rq;
}
-static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
- struct scx_dispatch_q *dsq, u64 enq_flags)
+static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, u64 enq_flags)
{
struct task_struct *p;
retry:
@@ -2538,11 +2538,11 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
return false;
}
-static bool consume_global_dsq(struct scx_sched *sch, struct rq *rq)
+static bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
{
int node = cpu_to_node(cpu_of(rq));
- return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0);
+ return scx_consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0);
}
/**
@@ -2575,8 +2575,8 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
* If dispatching to @rq that @p is already on, no lock dancing needed.
*/
if (rq == src_rq && rq == dst_rq) {
- dispatch_enqueue(sch, rq, dst_dsq, p,
- enq_flags | SCX_ENQ_CLEAR_OPSS);
+ scx_dispatch_enqueue(sch, rq, dst_dsq, p,
+ enq_flags | SCX_ENQ_CLEAR_OPSS);
return;
}
@@ -2614,13 +2614,13 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
*/
if (src_rq == dst_rq) {
p->scx.holding_cpu = -1;
- dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
- enq_flags);
+ scx_dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
+ enq_flags);
} else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
p->scx.holding_cpu = -1;
fallback = true;
- dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
- p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
+ scx_dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
+ p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
} else {
move_remote_task_to_local_dsq(p, enq_flags,
src_rq, dst_rq);
@@ -2708,10 +2708,10 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
goto retry;
case SCX_OPSS_QUEUEING:
/*
- * do_enqueue_task() is in the process of transferring the task
- * to the BPF scheduler while holding @p's rq lock. As we aren't
- * holding any kernel or BPF resource that the enqueue path may
- * depend upon, it's safe to wait.
+ * scx_do_enqueue_task() is in the process of transferring the
+ * task to the BPF scheduler while holding @p's rq lock. As we
+ * aren't holding any kernel or BPF resource that the enqueue
+ * path may depend upon, it's safe to wait.
*/
wait_ops_state(p, opss);
goto retry;
@@ -2724,10 +2724,10 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
if (dsq->id == SCX_DSQ_LOCAL)
dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
else
- dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+ scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
}
-static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
+static void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
{
struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
u32 u;
@@ -2771,13 +2771,13 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
scx_task_on_sched(sch, prev);
- if (consume_global_dsq(sch, rq))
+ if (scx_consume_global_dsq(sch, rq))
return true;
- if (bypass_dsp_enabled(sch)) {
+ if (scx_bypass_dsp_enabled(sch)) {
/* if @sch is bypassing, only the bypass DSQs are active */
if (scx_bypassing(sch, cpu))
- return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0);
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
#ifdef CONFIG_EXT_SUB_SCHED
/*
@@ -2795,7 +2795,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
- consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0)) {
+ scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
__scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
return true;
}
@@ -2808,8 +2808,8 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
dspc->rq = rq;
/*
- * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
- * the local DSQ might still end up empty after a successful
+ * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
+ * lock, the local DSQ might still end up empty after a successful
* ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
* produced some tasks, retry. The BPF scheduler may depend on this
* looping behavior to simplify its implementation.
@@ -2828,7 +2828,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
rq->scx.sub_dispatch_prev = NULL;
}
- flush_dispatch_buf(sch, rq);
+ scx_flush_dispatch_buf(sch, rq);
if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
rq->scx.flags |= SCX_RQ_BAL_KEEP;
@@ -2836,7 +2836,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
}
if (rq->scx.local_dsq.nr)
return true;
- if (consume_global_dsq(sch, rq))
+ if (scx_consume_global_dsq(sch, rq))
return true;
/*
@@ -2859,8 +2859,8 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
* queued. Without this fallback, bypassed tasks could stall if the host
* scheduler's ops.dispatch() doesn't yield any tasks.
*/
- if (bypass_dsp_enabled(sch))
- return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0);
+ if (scx_bypass_dsp_enabled(sch))
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
return false;
}
@@ -2939,7 +2939,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
* between the IMMED queueing and the subsequent scheduling event.
*/
if (unlikely(rq->scx.local_dsq.nr > 1 && rq->scx.nr_immed))
- schedule_reenq_local(rq, 0);
+ scx_schedule_reenq_local(rq, 0);
rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
return true;
@@ -2955,7 +2955,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
* dispatched. Call ops_dequeue() to notify the BPF scheduler.
*/
ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
- dispatch_dequeue(rq, p);
+ scx_dispatch_dequeue(rq, p);
}
p->se.exec_start = rq_clock_task(rq);
@@ -3067,10 +3067,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
if (p->scx.slice && !scx_bypassing(sch, cpu_of(rq))) {
if (p->scx.flags & SCX_TASK_IMMED) {
p->scx.flags |= SCX_TASK_REENQ_PREEMPTED;
- do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+ scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
} else {
- dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+ scx_dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD);
}
goto switch_class;
}
@@ -3088,9 +3088,9 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
WARN_ON_ONCE(sched_cpu_cookie_match(rq, p) &&
!(sch->ops.flags & SCX_OPS_ENQ_LAST));
- do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
+ scx_do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
} else {
- do_enqueue_task(rq, p, 0, -1);
+ scx_do_enqueue_task(rq, p, 0, -1);
}
}
@@ -3562,7 +3562,7 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo
ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args);
if (unlikely(ret)) {
- ret = ops_sanitize_err(sch, "init_task", ret);
+ ret = scx_ops_sanitize_err(sch, "init_task", ret);
return ret;
}
}
@@ -4107,7 +4107,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
if (!local_task_should_reenq(p, &reenq_flags, &reason))
continue;
- dispatch_dequeue(rq, p);
+ scx_dispatch_dequeue(rq, p);
if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
@@ -4119,7 +4119,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
list_del_init(&p->scx.dsq_list.node);
- do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+ scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
nr_enqueued++;
@@ -4234,7 +4234,7 @@ static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flag
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
p->scx.flags |= reason;
- do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
+ scx_do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
@@ -4354,7 +4354,7 @@ int scx_tg_online(struct task_group *tg)
ret = SCX_CALL_OP_RET(sch, cgroup_init,
NULL, tg->css.cgroup, &args);
if (ret)
- ret = ops_sanitize_err(sch, "cgroup_init", ret);
+ ret = scx_ops_sanitize_err(sch, "cgroup_init", ret);
}
if (ret == 0)
tg->scx.flags |= SCX_TG_ONLINE | SCX_TG_INITED;
@@ -4422,7 +4422,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
p->scx.cgrp_moving_from = NULL;
}
- return ops_sanitize_err(sch, "cgroup_prep_move", ret);
+ return scx_ops_sanitize_err(sch, "cgroup_prep_move", ret);
}
void scx_cgroup_move_task(struct task_struct *p)
@@ -4700,7 +4700,7 @@ static void destroy_dsq(struct scx_sched *sch, u64 dsq_id)
goto out_unlock_dsq;
/*
- * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
+ * Mark dead by invalidating ->id to prevent scx_dispatch_enqueue() from
* queueing more tasks. As this function can be called from anywhere,
* freeing is bounced through an irq work to avoid nesting RCU
* operations inside scheduler locks.
@@ -4928,7 +4928,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
*/
WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
- exit_dsq(bypass_dsq(sch, cpu));
+ exit_dsq(scx_bypass_dsq(sch, cpu));
}
free_percpu(sch->pcpu);
@@ -5239,7 +5239,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
u32 nr_donor_target, u32 nr_donee_target)
{
struct rq *donor_rq = cpu_rq(donor);
- struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor);
+ struct scx_dispatch_q *donor_dsq = scx_bypass_dsq(sch, donor);
struct task_struct *p, *n;
struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0);
s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target;
@@ -5287,7 +5287,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
if (donee >= nr_cpu_ids)
continue;
- donee_dsq = bypass_dsq(sch, donee);
+ donee_dsq = scx_bypass_dsq(sch, donee);
/*
* $p's rq is not locked but $p's DSQ lock protects its
@@ -5308,7 +5308,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
* between bypass DSQs.
*/
dispatch_dequeue_locked(p, donor_dsq);
- dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED);
+ scx_dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED);
/*
* $donee might have been idle and need to be woken up. No need
@@ -5351,7 +5351,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
/* count the target tasks and CPUs */
for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
- u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr);
+ u32 nr = READ_ONCE(scx_bypass_dsq(sch, cpu)->nr);
nr_tasks += nr;
nr_cpus++;
@@ -5373,7 +5373,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
cpumask_clear(donee_mask);
for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
- if (READ_ONCE(bypass_dsq(sch, cpu)->nr) < nr_target)
+ if (READ_ONCE(scx_bypass_dsq(sch, cpu)->nr) < nr_target)
cpumask_set_cpu(cpu, donee_mask);
}
@@ -5384,7 +5384,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
break;
if (cpumask_test_cpu(cpu, donee_mask))
continue;
- if (READ_ONCE(bypass_dsq(sch, cpu)->nr) <= nr_donor_target)
+ if (READ_ONCE(scx_bypass_dsq(sch, cpu)->nr) <= nr_donor_target)
continue;
nr_balanced += bypass_lb_cpu(sch, cpu, donee_mask, resched_mask,
@@ -5395,7 +5395,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
resched_cpu(cpu);
for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
- u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr);
+ u32 nr = READ_ONCE(scx_bypass_dsq(sch, cpu)->nr);
after_min = min(nr, after_min);
after_max = max(nr, after_max);
@@ -5421,7 +5421,7 @@ static void scx_bypass_lb_timerfn(struct timer_list *timer)
int node;
u32 intv_us;
- if (!bypass_dsp_enabled(sch))
+ if (!scx_bypass_dsp_enabled(sch))
return;
for_each_node_with_cpus(node)
@@ -5487,9 +5487,9 @@ static void enable_bypass_dsp(struct scx_sched *sch)
* dispatch enabled while a descendant is bypassing, which is all that's
* required.
*
- * bypass_dsp_enabled() test is used to determine whether to enter the
- * bypass dispatch handling path from both bypassing and hosting scheds.
- * Bump enable depth on both @sch and bypass dispatch host.
+ * scx_bypass_dsp_enabled() test is used to determine whether to enter
+ * the bypass dispatch handling path from both bypassing and hosting
+ * scheds. Bump enable depth on both @sch and bypass dispatch host.
*/
ret = atomic_inc_return(&sch->bypass_dsp_enable_depth);
WARN_ON_ONCE(ret <= 0);
@@ -5509,7 +5509,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
}
/* may be called without holding scx_bypass_lock */
-static void disable_bypass_dsp(struct scx_sched *sch)
+static void scx_disable_bypass_dsp(struct scx_sched *sch)
{
s32 ret;
@@ -5654,7 +5654,7 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
/* disarming must come after moving all tasks out of the bypass DSQs */
if (!bypass)
- disable_bypass_dsp(sch);
+ scx_disable_bypass_dsp(sch);
unlock:
raw_spin_unlock_irqrestore(&scx_bypass_lock, flags);
}
@@ -6003,7 +6003,7 @@ static void scx_sub_disable(struct scx_sched *sch)
* DSQs for us.
*/
synchronize_rcu_expedited();
- disable_bypass_dsp(sch);
+ scx_disable_bypass_dsp(sch);
scx_unlink_sched(sch);
@@ -6810,7 +6810,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
}
for_each_possible_cpu(cpu) {
- ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+ ret = init_dsq(scx_bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
if (ret) {
bypass_fail_cpu = cpu;
goto err_free_pcpu;
@@ -6963,7 +6963,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
for_each_possible_cpu(cpu) {
if (cpu == bypass_fail_cpu)
break;
- exit_dsq(bypass_dsq(sch, cpu));
+ exit_dsq(scx_bypass_dsq(sch, cpu));
}
free_percpu(sch->pcpu);
err_free_pnode:
@@ -7007,7 +7007,7 @@ static int check_hotplug_seq(struct scx_sched *sch,
return 0;
}
-static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
+static int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
{
/*
* It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
@@ -7170,7 +7170,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
if (sch->ops.init) {
ret = SCX_CALL_OP_RET(sch, init, NULL);
if (ret) {
- ret = ops_sanitize_err(sch, "init", ret);
+ ret = scx_ops_sanitize_err(sch, "init", ret);
cpus_read_unlock();
scx_error(sch, "ops.init() failed (%d)", ret);
goto err_disable;
@@ -7203,7 +7203,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
cpus_read_unlock();
- ret = validate_ops(sch, ops);
+ ret = scx_validate_ops(sch, ops);
if (ret)
goto err_disable;
@@ -7545,7 +7545,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
if (sch->ops.init) {
ret = SCX_CALL_OP_RET(sch, init, NULL);
if (ret) {
- ret = ops_sanitize_err(sch, "init", ret);
+ ret = scx_ops_sanitize_err(sch, "init", ret);
scx_error(sch, "ops.init() failed (%d)", ret);
goto err_disable;
}
@@ -7560,7 +7560,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
if (ret)
goto err_disable;
- if (validate_ops(sch, ops))
+ if (scx_validate_ops(sch, ops))
goto err_disable;
struct scx_sub_attach_args sub_attach_args = {
@@ -7571,7 +7571,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
&sub_attach_args);
if (ret) {
- ret = ops_sanitize_err(sch, "sub_attach", ret);
+ ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
scx_error(sch, "parent rejected (%d)", ret);
goto err_disable;
}
@@ -8830,7 +8830,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
/*
* If the BPF scheduler keeps calling this function repeatedly, it can
- * cause similar live-lock conditions as consume_dispatch_q().
+ * cause similar live-lock conditions as scx_consume_dispatch_q().
*/
if (unlikely(READ_ONCE(sch->aborting)))
return false;
@@ -8991,7 +8991,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
- flush_dispatch_buf(sch, dspc->rq);
+ scx_flush_dispatch_buf(sch, dspc->rq);
dsq = find_user_dsq(sch, dsq_id);
if (unlikely(!dsq)) {
@@ -8999,7 +8999,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
return false;
}
- if (consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) {
+ if (scx_consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) {
/*
* A successfully consumed task can be dequeued before it starts
* running while the CPU is trying to migrate other dispatched
@@ -10683,7 +10683,7 @@ static int __init scx_init(void)
/* @priv tail must align since both share the same data block */
CID_OFFSET_MATCH(priv, priv);
/*
- * cid-form must end exactly at @priv - validate_ops() skips
+ * cid-form must end exactly at @priv - scx_validate_ops() skips
* cpu_acquire/cpu_release for cid-form because reading those fields
* past the BPF allocation would be UB.
*/
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 0256931a379a..743980dc60b0 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1172,7 +1172,7 @@ struct scx_sched {
u64 bypass_timestamp;
s32 bypass_depth;
- /* bypass dispatch path enable state, see bypass_dsp_enabled() */
+ /* bypass dispatch path enable state, see scx_bypass_dsp_enabled() */
unsigned long bypass_dsp_claim;
atomic_t bypass_dsp_enable_depth;
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used by the sub.c split
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
` (2 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
The sub-scheduler implementation is about to move into its own sub.c, from
where it calls a set of ext.c helpers and shares a few ext.c globals. Make
those reachable across the new file boundary ahead of the move.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
v2: Also expose scx_rq_online(), scx_flush_dispatch_buf() and scx_kick_cpu()
which scx_dispatch_sched() in sub.h calls (sashiko AI).
kernel/sched/ext/ext.c | 116 ++++++++++++------------------------
kernel/sched/ext/internal.h | 80 +++++++++++++++++++++++++
2 files changed, 119 insertions(+), 77 deletions(-)
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 56e6a13fd0f8..bdbc66466962 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -20,7 +20,7 @@
#include "arena.h"
#include "idle.h"
-static DEFINE_RAW_SPINLOCK(scx_sched_lock);
+DEFINE_RAW_SPINLOCK(scx_sched_lock);
/*
* NOTE: sched_ext is in the process of growing multiple scheduler support and
@@ -39,14 +39,14 @@ struct scx_sched __rcu *scx_root;
static LIST_HEAD(scx_sched_all);
#ifdef CONFIG_EXT_SUB_SCHED
-static const struct rhashtable_params scx_sched_hash_params = {
+const struct rhashtable_params scx_sched_hash_params = {
.key_len = sizeof_field(struct scx_sched, ops.sub_cgroup_id),
.key_offset = offsetof(struct scx_sched, ops.sub_cgroup_id),
.head_offset = offsetof(struct scx_sched, hash_node),
.insecure_elasticity = true, /* inserted under scx_sched_lock */
};
-static struct rhashtable scx_sched_hash;
+struct rhashtable scx_sched_hash;
#endif
/* see SCX_OPS_TID_TO_TASK */
@@ -68,9 +68,9 @@ static DEFINE_RAW_SPINLOCK(scx_tasks_lock);
static LIST_HEAD(scx_tasks);
/* ops enable/disable */
-static DEFINE_MUTEX(scx_enable_mutex);
+DEFINE_MUTEX(scx_enable_mutex);
DEFINE_STATIC_KEY_FALSE(__scx_enabled);
-DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
+DEFINE_PERCPU_RWSEM(scx_fork_rwsem);
static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
static DEFINE_RAW_SPINLOCK(scx_bypass_lock);
static bool scx_init_task_enabled;
@@ -101,7 +101,7 @@ static atomic64_t scx_tid_cursor = ATOMIC64_INIT(1);
* tasks for the sub-sched being enabled. Use a global variable instead of a
* per-task field as all enables are serialized.
*/
-static struct scx_sched *scx_enabling_sub_sched;
+struct scx_sched *scx_enabling_sub_sched;
#else
#define scx_enabling_sub_sched (struct scx_sched *)NULL
#endif /* CONFIG_EXT_SUB_SCHED */
@@ -242,7 +242,6 @@ MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microsecond
static void run_deferred(struct rq *rq);
static bool task_dead_and_done(struct task_struct *p);
-static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
__printf(5, 6) bool __scx_exit(struct scx_sched *sch,
@@ -676,12 +675,12 @@ struct bpf_iter_scx_dsq {
} __attribute__((aligned(8)));
-static u32 scx_get_task_state(const struct task_struct *p)
+u32 scx_get_task_state(const struct task_struct *p)
{
return p->scx.flags & SCX_TASK_STATE_MASK;
}
-static void scx_set_task_state(struct task_struct *p, u32 state)
+void scx_set_task_state(struct task_struct *p, u32 state)
{
u32 prev_state = scx_get_task_state(p);
bool warn = false;
@@ -721,23 +720,6 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
p->scx.flags |= state;
}
-/*
- * SCX task iterator.
- */
-struct scx_task_iter {
- struct sched_ext_entity cursor;
- struct task_struct *locked_task;
- struct rq *rq;
- struct rq_flags rf;
- u32 cnt;
- bool list_locked;
-#ifdef CONFIG_EXT_SUB_SCHED
- struct cgroup *cgrp;
- struct cgroup_subsys_state *css_pos;
- struct css_task_iter css_iter;
-#endif
-};
-
/**
* scx_task_iter_start - Lock scx_tasks_lock and start a task iteration
* @iter: iterator to init
@@ -766,7 +748,7 @@ struct scx_task_iter {
* All tasks which existed when the iteration started are guaranteed to be
* visited as long as they are not dead.
*/
-static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
+void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
{
memset(iter, 0, sizeof(*iter));
@@ -805,7 +787,7 @@ static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
* This function can be safely called anytime during an iteration. The next
* iterator operation will automatically restore the necessary locking.
*/
-static void scx_task_iter_unlock(struct scx_task_iter *iter)
+void scx_task_iter_unlock(struct scx_task_iter *iter)
{
__scx_task_iter_rq_unlock(iter);
if (iter->list_locked) {
@@ -848,7 +830,7 @@ static void scx_task_iter_relock(struct scx_task_iter *iter,
* which is released on return. If the iterator holds a task's rq lock, that rq
* lock is also released. See scx_task_iter_start() for details.
*/
-static void scx_task_iter_stop(struct scx_task_iter *iter)
+void scx_task_iter_stop(struct scx_task_iter *iter)
{
#ifdef CONFIG_EXT_SUB_SCHED
if (iter->cgrp) {
@@ -923,7 +905,7 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
* whether they would like to filter out dead tasks. See scx_task_iter_start()
* for details.
*/
-static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
+struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
{
struct task_struct *p;
@@ -1186,8 +1168,8 @@ static void schedule_deferred_locked(struct rq *rq)
schedule_deferred(rq);
}
-static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
- u64 reenq_flags, struct rq *locked_rq)
+void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+ u64 reenq_flags, struct rq *locked_rq)
{
struct rq *rq;
@@ -1841,7 +1823,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
scx_dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
}
-static bool scx_rq_online(struct rq *rq)
+bool scx_rq_online(struct rq *rq)
{
/*
* Test both cpu_active() and %SCX_RQ_ONLINE. %SCX_RQ_ONLINE indicates
@@ -2491,8 +2473,8 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
return dst_rq;
}
-static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
- struct scx_dispatch_q *dsq, u64 enq_flags)
+bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, u64 enq_flags)
{
struct task_struct *p;
retry:
@@ -2538,7 +2520,7 @@ static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
return false;
}
-static bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
+bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
{
int node = cpu_to_node(cpu_of(rq));
@@ -2727,7 +2709,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
}
-static void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
+void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
{
struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
u32 u;
@@ -3548,7 +3530,7 @@ static struct cgroup *tg_cgrp(struct task_group *tg)
#endif /* CONFIG_EXT_GROUP_SCHED */
-static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
+int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
{
int ret;
@@ -3631,7 +3613,7 @@ static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight);
}
-static void scx_enable_task(struct scx_sched *sch, struct task_struct *p)
+void scx_enable_task(struct scx_sched *sch, struct task_struct *p)
{
__scx_enable_task(sch, p);
scx_set_task_state(p, SCX_TASK_ENABLED);
@@ -3665,8 +3647,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p)
WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
}
-static void __scx_disable_and_exit_task(struct scx_sched *sch,
- struct task_struct *p)
+void __scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p)
{
struct scx_exit_task_args args = {
.cancelled = false,
@@ -3700,7 +3681,7 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch,
* ran. The task state has not been transitioned, so this mirrors the
* SCX_TASK_INIT branch in __scx_disable_and_exit_task().
*/
-static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
+void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
{
struct scx_exit_task_args args = { .cancelled = true };
@@ -3711,8 +3692,7 @@ static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *
SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args);
}
-static void scx_disable_and_exit_task(struct scx_sched *sch,
- struct task_struct *p)
+void scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p)
{
__scx_disable_and_exit_task(sch, p);
@@ -4525,7 +4505,7 @@ static struct cgroup *root_cgroup(void)
return &cgrp_dfl_root.cgrp;
}
-static void scx_cgroup_lock(void)
+void scx_cgroup_lock(void)
{
#ifdef CONFIG_EXT_GROUP_SCHED
percpu_down_write(&scx_cgroup_ops_rwsem);
@@ -4533,7 +4513,7 @@ static void scx_cgroup_lock(void)
cgroup_lock();
}
-static void scx_cgroup_unlock(void)
+void scx_cgroup_unlock(void)
{
cgroup_unlock();
#ifdef CONFIG_EXT_GROUP_SCHED
@@ -4851,7 +4831,7 @@ static void free_exit_info(struct scx_exit_info *ei);
static const char *scx_exit_reason(enum scx_exit_kind kind);
static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind);
-static s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch)
+s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch)
{
size_t size = struct_size_t(struct scx_cmask, bits,
SCX_CMASK_NR_WORDS(num_possible_cpus()));
@@ -5509,7 +5489,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
}
/* may be called without holding scx_bypass_lock */
-static void scx_disable_bypass_dsp(struct scx_sched *sch)
+void scx_disable_bypass_dsp(struct scx_sched *sch)
{
s32 ret;
@@ -5557,7 +5537,7 @@ static void scx_disable_bypass_dsp(struct scx_sched *sch)
*
* - scx_prio_less() reverts to the default core_sched_at order.
*/
-static void scx_bypass(struct scx_sched *sch, bool bypass)
+void scx_bypass(struct scx_sched *sch, bool bypass)
{
struct scx_sched *pos;
unsigned long flags;
@@ -5746,7 +5726,7 @@ static void refresh_watchdog(void)
cancel_delayed_work_sync(&scx_watchdog_work);
}
-static s32 scx_link_sched(struct scx_sched *sch)
+s32 scx_link_sched(struct scx_sched *sch)
{
const char *err_msg = "";
s32 ret = 0;
@@ -5795,7 +5775,7 @@ static s32 scx_link_sched(struct scx_sched *sch)
return 0;
}
-static void scx_unlink_sched(struct scx_sched *sch)
+void scx_unlink_sched(struct scx_sched *sch)
{
scoped_guard(raw_spinlock_irq, &scx_sched_lock) {
#ifdef CONFIG_EXT_SUB_SCHED
@@ -5816,13 +5796,13 @@ static void scx_unlink_sched(struct scx_sched *sch)
* @sch. Once @sch becomes empty during disable, there's no point in dumping it.
* This prevents calling dump ops on a dead sch.
*/
-static void scx_disable_dump(struct scx_sched *sch)
+void scx_disable_dump(struct scx_sched *sch)
{
guard(raw_spinlock_irqsave)(&scx_dump_lock);
sch->dump_disabled = true;
}
-static void scx_log_sched_disable(struct scx_sched *sch)
+void scx_log_sched_disable(struct scx_sched *sch)
{
struct scx_exit_info *ei = sch->exit_info;
const char *type = scx_parent(sch) ? "sub-scheduler" : "scheduler";
@@ -6285,7 +6265,7 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
* as a noop. Syncing the irq_work first is required to guarantee the
* kthread work has been queued before waiting for it.
*/
-static void scx_flush_disable_work(struct scx_sched *sch)
+void scx_flush_disable_work(struct scx_sched *sch)
{
int kind;
@@ -6739,31 +6719,13 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
return pnode;
}
-/*
- * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
- * starvation. During the READY -> ENABLED task switching loop, the calling
- * thread's sched_class gets switched from fair to ext. As fair has higher
- * priority than ext, the calling thread can be indefinitely starved under
- * fair-class saturation, leading to a system hang.
- */
-struct scx_enable_cmd {
- struct kthread_work work;
- union {
- struct sched_ext_ops *ops;
- struct sched_ext_ops_cid *ops_cid;
- };
- bool is_cid_type;
- struct bpf_map *arena_map; /* arena ref to transfer to sch */
- int ret;
-};
-
/*
* Allocate and initialize a new scx_sched. @cgrp's reference is always
* consumed whether the function succeeds or fails.
*/
-static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
- struct cgroup *cgrp,
- struct scx_sched *parent)
+struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
+ struct cgroup *cgrp,
+ struct scx_sched *parent)
{
struct sched_ext_ops *ops = cmd->ops;
struct scx_sched *sch;
@@ -7007,7 +6969,7 @@ static int check_hotplug_seq(struct scx_sched *sch,
return 0;
}
-static int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
+int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
{
/*
* It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
@@ -9342,7 +9304,7 @@ __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime,
return true;
}
-static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
+void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
{
struct rq *this_rq;
unsigned long irq_flags;
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 743980dc60b0..c4a910d2ca91 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1535,6 +1535,41 @@ enum scx_ops_state {
#define SCX_OPSS_STATE_MASK ((1LU << SCX_OPSS_QSEQ_SHIFT) - 1)
#define SCX_OPSS_QSEQ_MASK (~SCX_OPSS_STATE_MASK)
+/*
+ * SCX task iterator.
+ */
+struct scx_task_iter {
+ struct sched_ext_entity cursor;
+ struct task_struct *locked_task;
+ struct rq *rq;
+ struct rq_flags rf;
+ u32 cnt;
+ bool list_locked;
+#ifdef CONFIG_EXT_SUB_SCHED
+ struct cgroup *cgrp;
+ struct cgroup_subsys_state *css_pos;
+ struct css_task_iter css_iter;
+#endif
+};
+
+/*
+ * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
+ * starvation. During the READY -> ENABLED task switching loop, the calling
+ * thread's sched_class gets switched from fair to ext. As fair has higher
+ * priority than ext, the calling thread can be indefinitely starved under
+ * fair-class saturation, leading to a system hang.
+ */
+struct scx_enable_cmd {
+ struct kthread_work work;
+ union {
+ struct sched_ext_ops *ops;
+ struct sched_ext_ops_cid *ops_cid;
+ };
+ bool is_cid_type;
+ struct bpf_map *arena_map; /* arena ref to transfer to sch */
+ int ret;
+};
+
extern struct scx_sched __rcu *scx_root;
DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
@@ -1555,6 +1590,51 @@ __printf(5, 0) bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
__printf(5, 6) bool __scx_exit(struct scx_sched *sch, enum scx_exit_kind kind,
s64 exit_code, s32 exit_cpu, const char *fmt, ...);
+u32 scx_get_task_state(const struct task_struct *p);
+void scx_set_task_state(struct task_struct *p, u32 state);
+void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp);
+void scx_task_iter_unlock(struct scx_task_iter *iter);
+void scx_task_iter_stop(struct scx_task_iter *iter);
+struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter);
+bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+ struct scx_dispatch_q *dsq, u64 enq_flags);
+bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq);
+bool scx_rq_online(struct rq *rq);
+void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq);
+void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
+void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+ u64 reenq_flags, struct rq *locked_rq);
+int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork);
+void scx_enable_task(struct scx_sched *sch, struct task_struct *p);
+void __scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p);
+void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p);
+void scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p);
+#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
+void scx_cgroup_lock(void);
+void scx_cgroup_unlock(void);
+#endif
+s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch);
+void scx_disable_bypass_dsp(struct scx_sched *sch);
+void scx_bypass(struct scx_sched *sch, bool bypass);
+s32 scx_link_sched(struct scx_sched *sch);
+void scx_unlink_sched(struct scx_sched *sch);
+void scx_disable_dump(struct scx_sched *sch);
+void scx_log_sched_disable(struct scx_sched *sch);
+void scx_flush_disable_work(struct scx_sched *sch);
+struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
+ struct cgroup *cgrp,
+ struct scx_sched *parent);
+int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
+
+extern raw_spinlock_t scx_sched_lock;
+extern struct mutex scx_enable_mutex;
+extern struct percpu_rw_semaphore scx_fork_rwsem;
+#ifdef CONFIG_EXT_SUB_SCHED
+extern const struct rhashtable_params scx_sched_hash_params;
+extern struct rhashtable scx_sched_hash;
+extern struct scx_sched *scx_enabling_sub_sched;
+#endif
+
#define scx_exit(sch, kind, exit_code, fmt, args...) \
__scx_exit(sch, kind, exit_code, raw_smp_processor_id(), fmt, ##args)
#define scx_error(sch, fmt, args...) \
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across the sub.c split
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:44 ` [PATCHSET v3 sched_ext/for-7.3] " Tejun Heo
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
The following trivial helpers in ext.c are called from both ext.c and the
sub-scheduler code. Define them as static inline in internal.h.
- scx_bypass_dsq()
- scx_bypass_dsp_enabled()
- scx_ops_sanitize_err()
- scx_schedule_reenq_local()
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
kernel/sched/ext/ext.c | 57 -------------------------------------
kernel/sched/ext/internal.h | 57 +++++++++++++++++++++++++++++++++++++
2 files changed, 57 insertions(+), 57 deletions(-)
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index bdbc66466962..d1ef79c1038d 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -368,11 +368,6 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p)
return __setscheduler_class(p->policy, p->prio);
}
-static struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
-{
- return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
-}
-
static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu)
{
#ifdef CONFIG_EXT_SUB_SCHED
@@ -394,26 +389,6 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
return scx_bypass_dsq(sch, cpu);
}
-/**
- * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
- * @sch: scheduler to check
- *
- * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
- * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors
- * are bypassing. In the former case, the ancestor is not itself bypassing but
- * its bypass DSQs will be populated with bypassed tasks from descendants. Thus,
- * the ancestor's bypass dispatch path must be active even though its own
- * bypass_depth remains zero.
- *
- * This function checks bypass_dsp_enable_depth which is managed separately from
- * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
- * scx_disable_bypass_dsp().
- */
-static bool scx_bypass_dsp_enabled(struct scx_sched *sch)
-{
- return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
-}
-
/**
* rq_is_open - Is the rq available for immediate execution of an SCX task?
* @rq: rq to test
@@ -1060,28 +1035,6 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
}
}
-/**
- * scx_ops_sanitize_err - Sanitize a -errno value
- * @sch: scx_sched to error out on error
- * @ops_name: operation to blame on failure
- * @err: -errno value to sanitize
- *
- * Verify @err is a valid -errno. If not, trigger scx_error() and return
- * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
- * cause misbehaviors. For an example, a large negative return from
- * ops.init_task() triggers an oops when passed up the call chain because the
- * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
- * handled as a pointer.
- */
-static int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
-{
- if (err < 0 && err >= -MAX_ERRNO)
- return err;
-
- scx_error(sch, "ops.%s() returned an invalid errno %d", ops_name, err);
- return -EPROTO;
-}
-
static void deferred_bal_cb_workfn(struct rq *rq)
{
run_deferred(rq);
@@ -1233,16 +1186,6 @@ void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
schedule_deferred(rq);
}
-static void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
-{
- struct scx_sched *root = rcu_dereference_sched(scx_root);
-
- if (WARN_ON_ONCE(!root))
- return;
-
- schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
-}
-
/**
* touch_core_sched - Update timestamp used for core-sched task ordering
* @rq: rq to read clock from, must be locked
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index c4a910d2ca91..c3b97ea4ae79 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1640,6 +1640,63 @@ extern struct scx_sched *scx_enabling_sub_sched;
#define scx_error(sch, fmt, args...) \
scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
+static inline struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
+{
+ return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
+}
+
+/**
+ * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
+ * @sch: scheduler to check
+ *
+ * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
+ * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors
+ * are bypassing. In the former case, the ancestor is not itself bypassing but
+ * its bypass DSQs will be populated with bypassed tasks from descendants. Thus,
+ * the ancestor's bypass dispatch path must be active even though its own
+ * bypass_depth remains zero.
+ *
+ * This function checks bypass_dsp_enable_depth which is managed separately from
+ * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
+ * scx_disable_bypass_dsp().
+ */
+static inline bool scx_bypass_dsp_enabled(struct scx_sched *sch)
+{
+ return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
+}
+
+/**
+ * scx_ops_sanitize_err - Sanitize a -errno value
+ * @sch: scx_sched to error out on error
+ * @ops_name: operation to blame on failure
+ * @err: -errno value to sanitize
+ *
+ * Verify @err is a valid -errno. If not, trigger scx_error() and return
+ * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
+ * cause misbehaviors. For an example, a large negative return from
+ * ops.init_task() triggers an oops when passed up the call chain because the
+ * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
+ * handled as a pointer.
+ */
+static inline int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
+{
+ if (err < 0 && err >= -MAX_ERRNO)
+ return err;
+
+ scx_error(sch, "ops.%s() returned an invalid errno %d", ops_name, err);
+ return -EPROTO;
+}
+
+static inline void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
+{
+ struct scx_sched *root = rcu_dereference_sched(scx_root);
+
+ if (WARN_ON_ONCE(!root))
+ return;
+
+ schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
+}
+
/*
* Return the rq currently locked from an scx callback, or NULL if no rq is
* locked.
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
` (2 preceding siblings ...)
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
2026-07-01 20:55 ` sashiko-bot
2026-07-01 20:44 ` [PATCHSET v3 sched_ext/for-7.3] " Tejun Heo
4 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
The sub-scheduler implementation has grown and will continue to expand. Move
the sub-scheduler functions from ext.c into a new kernel/sched/ext/sub.c.
sub.h holds the prototypes and the !CONFIG_EXT_SUB_SCHED no-op stubs.
scx_dispatch_sched() is shared: balance_one() in ext.c and the
scx_bpf_sub_dispatch() kfunc in sub.c both call it, and the latter re-enters
it as sub-scheduler dispatch nests. It moves into sub.h as a static
__always_inline so both callers keep it inlined and per-level stack stays
bounded across the recursion. The event macros it uses move to internal.h.
No functional change.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
v2: Fold the scx_dispatch_sched() sub.h promotion into this patch (was a
separate later patch in v1) so the split is self-contained (Andrea).
kernel/sched/build_policy.c | 2 +
kernel/sched/ext/ext.c | 811 +-----------------------------------
kernel/sched/ext/internal.h | 28 ++
kernel/sched/ext/sub.c | 668 +++++++++++++++++++++++++++++
kernel/sched/ext/sub.h | 161 +++++++
5 files changed, 860 insertions(+), 810 deletions(-)
create mode 100644 kernel/sched/ext/sub.c
create mode 100644 kernel/sched/ext/sub.h
diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index d74b54f81992..01dc7bf89af8 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -66,10 +66,12 @@
# include "ext/cid.h"
# include "ext/arena.h"
# include "ext/idle.h"
+# include "ext/sub.h"
# include "ext/ext.c"
# include "ext/cid.c"
# include "ext/arena.c"
# include "ext/idle.c"
+# include "ext/sub.c"
#endif
#include "syscalls.c"
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index d1ef79c1038d..1a0ec985da77 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -19,6 +19,7 @@
#include "cid.h"
#include "arena.h"
#include "idle.h"
+#include "sub.h"
DEFINE_RAW_SPINLOCK(scx_sched_lock);
@@ -271,58 +272,6 @@ static bool u32_before(u32 a, u32 b)
return (s32)(a - b) < 0;
}
-#ifdef CONFIG_EXT_SUB_SCHED
-/**
- * scx_next_descendant_pre - find the next descendant for pre-order walk
- * @pos: the current position (%NULL to initiate traversal)
- * @root: sched whose descendants to walk
- *
- * To be used by scx_for_each_descendant_pre(). Find the next descendant to
- * visit for pre-order traversal of @root's descendants. @root is included in
- * the iteration and the first node to be visited.
- */
-static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos,
- struct scx_sched *root)
-{
- struct scx_sched *next;
-
- lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
- lockdep_is_held(&scx_sched_lock));
-
- /* if first iteration, visit @root */
- if (!pos)
- return root;
-
- /* visit the first child if exists */
- next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
- if (next)
- return next;
-
- /* no child, visit my or the closest ancestor's next sibling */
- while (pos != root) {
- if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
- return list_next_entry(pos, sibling);
- pos = scx_parent(pos);
- }
-
- return NULL;
-}
-
-static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
-{
- return rhashtable_lookup(&scx_sched_hash, &cgroup_id,
- scx_sched_hash_params);
-}
-
-static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch)
-{
- rcu_assign_pointer(p->scx.sched, sch);
-}
-#else /* CONFIG_EXT_SUB_SCHED */
-static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
-static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
/**
* scx_is_descendant - Test whether sched is a descendant
* @sch: sched to test
@@ -337,19 +286,6 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
return sch->ancestors[ancestor->level] == ancestor;
}
-/**
- * scx_for_each_descendant_pre - pre-order walk of a sched's descendants
- * @pos: iteration cursor
- * @root: sched to walk the descendants of
- *
- * Walk @root's descendants. @root is included in the iteration and the first
- * node to be visited. Must be called with either scx_enable_mutex or
- * scx_sched_lock held.
- */
-#define scx_for_each_descendant_pre(pos, root) \
- for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \
- (pos) = scx_next_descendant_pre((pos), (root)))
-
static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu)
{
return &sch->pnode[cpu_to_node(cpu)]->global_dsq;
@@ -935,32 +871,6 @@ struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
return NULL;
}
-/**
- * scx_add_event - Increase an event counter for 'name' by 'cnt'
- * @sch: scx_sched to account events for
- * @name: an event name defined in struct scx_event_stats
- * @cnt: the number of the event occurred
- *
- * This can be used when preemption is not disabled.
- */
-#define scx_add_event(sch, name, cnt) do { \
- this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
- trace_sched_ext_event(#name, (cnt)); \
-} while(0)
-
-/**
- * __scx_add_event - Increase an event counter for 'name' by 'cnt'
- * @sch: scx_sched to account events for
- * @name: an event name defined in struct scx_event_stats
- * @cnt: the number of the event occurred
- *
- * This should be used only when preemption is disabled.
- */
-#define __scx_add_event(sch, name, cnt) do { \
- __this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
- trace_sched_ext_event(#name, cnt); \
-} while(0)
-
/**
* scx_dump_event - Dump an event 'kind' in 'events' to 's'
* @s: output seq_buf
@@ -2681,115 +2591,6 @@ static inline void maybe_queue_balance_callback(struct rq *rq)
rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
}
-/*
- * One user of this function is scx_bpf_dispatch() which can be called
- * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
- * from the call frame.
- */
-static __always_inline bool
-scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
- struct task_struct *prev, bool nested)
-{
- struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
- int nr_loops = SCX_DSP_MAX_LOOPS;
- s32 cpu = cpu_of(rq);
- bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
- scx_task_on_sched(sch, prev);
-
- if (scx_consume_global_dsq(sch, rq))
- return true;
-
- if (scx_bypass_dsp_enabled(sch)) {
- /* if @sch is bypassing, only the bypass DSQs are active */
- if (scx_bypassing(sch, cpu))
- return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
-
-#ifdef CONFIG_EXT_SUB_SCHED
- /*
- * If @sch isn't bypassing but its children are, @sch is
- * responsible for making forward progress for both its own
- * tasks that aren't bypassing and the bypassing descendants'
- * tasks. The following implements a simple built-in behavior -
- * let each CPU try to run the bypass DSQ every Nth time.
- *
- * Later, if necessary, we can add an ops flag to suppress the
- * auto-consumption and a kfunc to consume the bypass DSQ and,
- * so that the BPF scheduler can fully control scheduling of
- * bypassed tasks.
- */
- struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
-
- if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
- scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
- __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
- return true;
- }
-#endif /* CONFIG_EXT_SUB_SCHED */
- }
-
- if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
- return false;
-
- dspc->rq = rq;
-
- /*
- * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
- * lock, the local DSQ might still end up empty after a successful
- * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
- * produced some tasks, retry. The BPF scheduler may depend on this
- * looping behavior to simplify its implementation.
- */
- do {
- dspc->nr_tasks = 0;
-
- if (nested) {
- SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
- prev_on_sch ? prev : NULL);
- } else {
- /* stash @prev so that nested invocations can access it */
- rq->scx.sub_dispatch_prev = prev;
- SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
- prev_on_sch ? prev : NULL);
- rq->scx.sub_dispatch_prev = NULL;
- }
-
- scx_flush_dispatch_buf(sch, rq);
-
- if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
- rq->scx.flags |= SCX_RQ_BAL_KEEP;
- return true;
- }
- if (rq->scx.local_dsq.nr)
- return true;
- if (scx_consume_global_dsq(sch, rq))
- return true;
-
- /*
- * ops.dispatch() can trap us in this loop by repeatedly
- * dispatching ineligible tasks. Break out once in a while to
- * allow the watchdog to run. As IRQ can't be enabled in
- * balance(), we want to complete this scheduling cycle and then
- * start a new one. IOW, we want to call resched_curr() on the
- * next, most likely idle, task, not the current one. Use
- * __scx_bpf_kick_cpu() for deferred kicking.
- */
- if (unlikely(!--nr_loops)) {
- scx_kick_cpu(sch, cpu, 0);
- break;
- }
- } while (dspc->nr_tasks);
-
- /*
- * Prevent the CPU from going idle while bypassed descendants have tasks
- * queued. Without this fallback, bypassed tasks could stall if the host
- * scheduler's ops.dispatch() doesn't yield any tasks.
- */
- if (scx_bypass_dsp_enabled(sch))
- return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
-
- return false;
-}
-
static int balance_one(struct rq *rq, struct task_struct *prev)
{
struct scx_sched *sch = scx_root;
@@ -4469,26 +4270,6 @@ static inline void scx_cgroup_lock(void) {}
static inline void scx_cgroup_unlock(void) {}
#endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */
-#ifdef CONFIG_EXT_SUB_SCHED
-static struct cgroup *sch_cgroup(struct scx_sched *sch)
-{
- return sch->cgrp;
-}
-
-/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
-static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
-{
- struct cgroup *pos;
- struct cgroup_subsys_state *css;
-
- cgroup_for_each_live_descendant_pre(pos, css, cgrp)
- rcu_assign_pointer(pos->scx_sched, sch);
-}
-#else /* CONFIG_EXT_SUB_SCHED */
-static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
-static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
/*
* Omitted operations:
*
@@ -5765,202 +5546,6 @@ void scx_log_sched_disable(struct scx_sched *sch)
}
}
-#ifdef CONFIG_EXT_SUB_SCHED
-static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
-
-static void drain_descendants(struct scx_sched *sch)
-{
- /*
- * Child scheds that finished the critical part of disabling will take
- * themselves off @sch->children. Wait for it to drain. As propagation
- * is recursive, empty @sch->children means that all proper descendant
- * scheds reached unlinking stage.
- */
- wait_event(scx_unlink_waitq, list_empty(&sch->children));
-}
-
-static void scx_fail_parent(struct scx_sched *sch,
- struct task_struct *failed, s32 fail_code)
-{
- struct scx_sched *parent = scx_parent(sch);
- struct scx_task_iter sti;
- struct task_struct *p;
-
- scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler",
- fail_code, failed->comm, failed->pid);
-
- /*
- * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into
- * it. This may cause downstream failures on the BPF side but $parent is
- * dying anyway.
- */
- scx_bypass(parent, true);
-
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- if (scx_task_on_sched(parent, p))
- continue;
-
- scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
- scx_disable_and_exit_task(sch, p);
- scx_set_task_sched(p, parent);
- }
- }
- scx_task_iter_stop(&sti);
-}
-
-static void scx_sub_disable(struct scx_sched *sch)
-{
- struct scx_sched *parent = scx_parent(sch);
- struct scx_task_iter sti;
- struct task_struct *p;
- int ret;
-
- /*
- * Guarantee forward progress and wait for descendants to be disabled.
- * To limit disruptions, $parent is not bypassed. Tasks are fully
- * prepped and then inserted back into $parent.
- */
- scx_bypass(sch, true);
- drain_descendants(sch);
-
- /*
- * Here, every runnable task is guaranteed to make forward progress and
- * we can safely use blocking synchronization constructs. Actually
- * disable ops.
- */
- mutex_lock(&scx_enable_mutex);
- percpu_down_write(&scx_fork_rwsem);
- scx_cgroup_lock();
-
- set_cgroup_sched(sch_cgroup(sch), parent);
-
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- struct rq *rq;
- struct rq_flags rf;
-
- /* filter out duplicate visits */
- if (scx_task_on_sched(parent, p))
- continue;
-
- /*
- * By the time control reaches here, all descendant schedulers
- * should already have been disabled.
- */
- WARN_ON_ONCE(!scx_task_on_sched(sch, p));
-
- /*
- * @p is pinned by the iter: css_task_iter_next() takes a
- * reference and holds it until the next iter_next() call, so
- * @p->usage is guaranteed > 0.
- */
- get_task_struct(p);
-
- scx_task_iter_unlock(&sti);
-
- /*
- * $p is READY or ENABLED on @sch. Initialize for $parent,
- * disable and exit from @sch, and then switch over to $parent.
- *
- * If a task fails to initialize for $parent, the only available
- * action is disabling $parent too. While this allows disabling
- * of a child sched to cause the parent scheduler to fail, the
- * failure can only originate from ops.init_task() of the
- * parent. A child can't directly affect the parent through its
- * own failures.
- */
- ret = __scx_init_task(parent, p, false);
- if (ret) {
- scx_fail_parent(sch, p, ret);
- put_task_struct(p);
- break;
- }
-
- rq = task_rq_lock(p, &rf);
-
- if (scx_get_task_state(p) == SCX_TASK_DEAD) {
- /*
- * sched_ext_dead() raced us between __scx_init_task()
- * and this rq lock and ran exit_task() on @sch (the
- * sched @p was on at that point), not on $parent.
- * $parent's just-completed init is owed an exit_task()
- * and we issue it here.
- */
- scx_sub_init_cancel_task(parent, p);
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
- continue;
- }
-
- scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
- /*
- * $p is initialized for $parent and still attached to
- * @sch. Disable and exit for @sch, switch over to
- * $parent, override the state to READY to account for
- * $p having already been initialized, and then enable.
- */
- scx_disable_and_exit_task(sch, p);
- scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
- scx_set_task_state(p, SCX_TASK_INIT);
- scx_set_task_sched(p, parent);
- scx_set_task_state(p, SCX_TASK_READY);
- scx_enable_task(parent, p);
- }
-
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
- }
- scx_task_iter_stop(&sti);
-
- scx_disable_dump(sch);
-
- scx_cgroup_unlock();
- percpu_up_write(&scx_fork_rwsem);
-
- /*
- * All tasks are moved off of @sch but there may still be on-going
- * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
- * the expedited version as ancestors may be waiting in bypass mode.
- * Also, tell the parent that there is no need to keep running bypass
- * DSQs for us.
- */
- synchronize_rcu_expedited();
- scx_disable_bypass_dsp(sch);
-
- scx_unlink_sched(sch);
-
- mutex_unlock(&scx_enable_mutex);
-
- /*
- * @sch is now unlinked from the parent's children list. Notify and call
- * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called
- * after unlinking and releasing all locks. See scx_claim_exit().
- */
- wake_up_all(&scx_unlink_waitq);
-
- if (parent->ops.sub_detach && sch->sub_attached) {
- struct scx_sub_detach_args sub_detach_args = {
- .ops = &sch->ops,
- .cgroup_path = sch->cgrp_path,
- };
- SCX_CALL_OP(parent, sub_detach, NULL,
- &sub_detach_args);
- }
-
- scx_log_sched_disable(sch);
-
- if (sch->ops.exit)
- SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
- if (sch->sub_kset)
- kobject_del(&sch->sub_kset->kobj);
- kobject_del(&sch->kobj);
-}
-#else /* CONFIG_EXT_SUB_SCHED */
-static inline void drain_descendants(struct scx_sched *sch) { }
-static inline void scx_sub_disable(struct scx_sched *sch) { }
-#endif /* CONFIG_EXT_SUB_SCHED */
-
static void scx_root_disable(struct scx_sched *sch)
{
struct scx_task_iter sti;
@@ -7350,347 +6935,6 @@ static void scx_root_enable_workfn(struct kthread_work *work)
cmd->ret = 0;
}
-#ifdef CONFIG_EXT_SUB_SCHED
-/* verify that a scheduler can be attached to @cgrp and return the parent */
-static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
-{
- struct scx_sched *parent = cgrp->scx_sched;
- struct scx_sched *pos;
-
- lockdep_assert_held(&scx_sched_lock);
-
- /* can't attach twice to the same cgroup */
- if (parent->cgrp == cgrp)
- return ERR_PTR(-EBUSY);
-
- /* does $parent allow sub-scheds? */
- if (!parent->ops.sub_attach)
- return ERR_PTR(-EOPNOTSUPP);
-
- /* can't insert between $parent and its exiting children */
- list_for_each_entry(pos, &parent->children, sibling)
- if (cgroup_is_descendant(pos->cgrp, cgrp))
- return ERR_PTR(-EBUSY);
-
- return parent;
-}
-
-static bool assert_task_ready_or_enabled(struct task_struct *p)
-{
- u32 state = scx_get_task_state(p);
-
- switch (state) {
- case SCX_TASK_READY:
- case SCX_TASK_ENABLED:
- return true;
- default:
- WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched",
- state, p->comm, p->pid);
- return false;
- }
-}
-
-static void scx_sub_enable_workfn(struct kthread_work *work)
-{
- struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
- struct sched_ext_ops *ops = cmd->ops;
- struct cgroup *cgrp;
- struct scx_sched *parent, *sch;
- struct scx_task_iter sti;
- struct task_struct *p;
- s32 i, ret;
-
- mutex_lock(&scx_enable_mutex);
-
- if (!scx_enabled()) {
- ret = -ENODEV;
- goto out_unlock;
- }
-
- /* See scx_root_enable_workfn() for the @ops->priv check. */
- if (rcu_access_pointer(ops->priv)) {
- ret = -EBUSY;
- goto out_unlock;
- }
-
- cgrp = cgroup_get_from_id(ops->sub_cgroup_id);
- if (IS_ERR(cgrp)) {
- ret = PTR_ERR(cgrp);
- goto out_unlock;
- }
-
- raw_spin_lock_irq(&scx_sched_lock);
- parent = find_parent_sched(cgrp);
- if (IS_ERR(parent)) {
- raw_spin_unlock_irq(&scx_sched_lock);
- ret = PTR_ERR(parent);
- goto out_put_cgrp;
- }
- kobject_get(&parent->kobj);
- raw_spin_unlock_irq(&scx_sched_lock);
-
- /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
- sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
- kobject_put(&parent->kobj);
- if (IS_ERR(sch)) {
- ret = PTR_ERR(sch);
- goto out_unlock;
- }
-
- ret = scx_link_sched(sch);
- if (ret)
- goto err_disable;
-
- if (sch->level >= SCX_SUB_MAX_DEPTH) {
- scx_error(sch, "max nesting depth %d violated",
- SCX_SUB_MAX_DEPTH);
- goto err_disable;
- }
-
- if (sch->ops.init) {
- ret = SCX_CALL_OP_RET(sch, init, NULL);
- if (ret) {
- ret = scx_ops_sanitize_err(sch, "init", ret);
- scx_error(sch, "ops.init() failed (%d)", ret);
- goto err_disable;
- }
- sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
- }
-
- ret = scx_arena_pool_init(sch);
- if (ret)
- goto err_disable;
-
- ret = scx_set_cmask_scratch_alloc(sch);
- if (ret)
- goto err_disable;
-
- if (scx_validate_ops(sch, ops))
- goto err_disable;
-
- struct scx_sub_attach_args sub_attach_args = {
- .ops = &sch->ops,
- .cgroup_path = sch->cgrp_path,
- };
-
- ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
- &sub_attach_args);
- if (ret) {
- ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
- scx_error(sch, "parent rejected (%d)", ret);
- goto err_disable;
- }
- sch->sub_attached = true;
-
- scx_bypass(sch, true);
-
- for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
- if (((void (**)(void))ops)[i])
- set_bit(i, sch->has_op);
-
- percpu_down_write(&scx_fork_rwsem);
- scx_cgroup_lock();
-
- /*
- * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see
- * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down.
- */
- set_cgroup_sched(sch_cgroup(sch), sch);
- if (!(cgrp->self.flags & CSS_ONLINE)) {
- scx_error(sch, "cgroup is not online");
- goto err_unlock_and_disable;
- }
-
- /*
- * Initialize tasks for the new child $sch without exiting them for
- * $parent so that the tasks can always be reverted back to $parent
- * sched on child init failure.
- */
- WARN_ON_ONCE(scx_enabling_sub_sched);
- scx_enabling_sub_sched = sch;
-
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- struct rq *rq;
- struct rq_flags rf;
-
- /*
- * Task iteration may visit the same task twice when racing
- * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which
- * finished __scx_init_task() and skip if set.
- *
- * A task may exit and get freed between __scx_init_task()
- * completion and scx_enable_task(). In such cases,
- * scx_disable_and_exit_task() must exit the task for both the
- * parent and child scheds.
- */
- if (p->scx.flags & SCX_TASK_SUB_INIT)
- continue;
-
- /* @p is pinned by the iter; see scx_sub_disable() */
- get_task_struct(p);
-
- if (!assert_task_ready_or_enabled(p)) {
- ret = -EINVAL;
- goto abort;
- }
-
- scx_task_iter_unlock(&sti);
-
- /*
- * As $p is still on $parent, it can't be transitioned to INIT.
- * Let's worry about task state later. Use __scx_init_task().
- */
- ret = __scx_init_task(sch, p, false);
- if (ret)
- goto abort;
-
- rq = task_rq_lock(p, &rf);
-
- if (scx_get_task_state(p) == SCX_TASK_DEAD) {
- /*
- * sched_ext_dead() raced us between __scx_init_task()
- * and this rq lock and ran exit_task() on $parent (the
- * sched @p was on at that point), not on @sch. @sch's
- * just-completed init is owed an exit_task() and we
- * issue it here.
- */
- scx_sub_init_cancel_task(sch, p);
- task_rq_unlock(rq, p, &rf);
- put_task_struct(p);
- continue;
- }
-
- p->scx.flags |= SCX_TASK_SUB_INIT;
- task_rq_unlock(rq, p, &rf);
-
- put_task_struct(p);
- }
- scx_task_iter_stop(&sti);
-
- /*
- * All tasks are prepped. Disable/exit tasks for $parent and enable for
- * the new @sch.
- */
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- /*
- * Use clearing of %SCX_TASK_SUB_INIT to detect and skip
- * duplicate iterations.
- */
- if (!(p->scx.flags & SCX_TASK_SUB_INIT))
- continue;
-
- scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
- /*
- * $p must be either READY or ENABLED. If ENABLED,
- * __scx_disabled_and_exit_task() first disables and
- * makes it READY. However, after exiting $p, it will
- * leave $p as READY.
- */
- assert_task_ready_or_enabled(p);
- __scx_disable_and_exit_task(parent, p);
-
- /*
- * $p is now only initialized for @sch and READY, which
- * is what we want. Assign it to @sch and enable.
- */
- scx_set_task_sched(p, sch);
- scx_enable_task(sch, p);
-
- p->scx.flags &= ~SCX_TASK_SUB_INIT;
- }
- }
- scx_task_iter_stop(&sti);
-
- scx_enabling_sub_sched = NULL;
-
- scx_cgroup_unlock();
- percpu_up_write(&scx_fork_rwsem);
-
- scx_bypass(sch, false);
-
- pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
- kobject_uevent(&sch->kobj, KOBJ_ADD);
- ret = 0;
- goto out_unlock;
-
-out_put_cgrp:
- cgroup_put(cgrp);
-out_unlock:
- mutex_unlock(&scx_enable_mutex);
- cmd->ret = ret;
- return;
-
-abort:
- put_task_struct(p);
- scx_task_iter_stop(&sti);
-
- /*
- * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
- * ran for @sch on them, so calling scx_disable_task() here would invoke
- * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
- * must stay set until SUB_INIT is cleared from every marked task -
- * scx_disable_and_exit_task() reads it when a task exits concurrently.
- */
- scx_task_iter_start(&sti, sch->cgrp);
- while ((p = scx_task_iter_next_locked(&sti))) {
- if (p->scx.flags & SCX_TASK_SUB_INIT) {
- scx_sub_init_cancel_task(sch, p);
- p->scx.flags &= ~SCX_TASK_SUB_INIT;
- }
- }
- scx_task_iter_stop(&sti);
- scx_enabling_sub_sched = NULL;
-err_unlock_and_disable:
- /* we'll soon enter disable path, keep bypass on */
- scx_cgroup_unlock();
- percpu_up_write(&scx_fork_rwsem);
-err_disable:
- mutex_unlock(&scx_enable_mutex);
- scx_flush_disable_work(sch);
- cmd->ret = 0;
-}
-
-static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb,
- unsigned long action, void *data)
-{
- struct cgroup *cgrp = data;
- struct cgroup *parent = cgroup_parent(cgrp);
-
- if (!cgroup_on_dfl(cgrp))
- return NOTIFY_OK;
-
- switch (action) {
- case CGROUP_LIFETIME_ONLINE:
- /* inherit ->scx_sched from $parent */
- if (parent)
- rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched);
- break;
- case CGROUP_LIFETIME_OFFLINE:
- /* if there is a sched attached, shoot it down */
- if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp)
- scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN,
- SCX_ECODE_RSN_CGROUP_OFFLINE,
- "cgroup %llu going offline", cgroup_id(cgrp));
- break;
- }
-
- return NOTIFY_OK;
-}
-
-static struct notifier_block scx_cgroup_lifetime_nb = {
- .notifier_call = scx_cgroup_lifetime_notify,
-};
-
-static s32 __init scx_cgroup_lifetime_notifier_init(void)
-{
- return blocking_notifier_chain_register(&cgroup_lifetime_notifier,
- &scx_cgroup_lifetime_nb);
-}
-core_initcall(scx_cgroup_lifetime_notifier_init);
-#endif /* CONFIG_EXT_SUB_SCHED */
-
static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link)
{
static struct kthread_worker *helper;
@@ -7837,20 +7081,6 @@ static int bpf_scx_init_member(const struct btf_type *t,
return 0;
}
-#ifdef CONFIG_EXT_SUB_SCHED
-static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog)
-{
- struct scx_sched *sch;
-
- guard(rcu)();
- sch = scx_prog_sched(prog->aux);
- if (unlikely(!sch))
- return;
-
- scx_error(sch, "dispatch recursion detected");
-}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
static int bpf_scx_check_member(const struct btf_type *t,
const struct btf_member *member,
const struct bpf_prog *prog)
@@ -9021,45 +8251,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter,
p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
}
-#ifdef CONFIG_EXT_SUB_SCHED
-/**
- * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler
- * @cgroup_id: cgroup ID of the child scheduler to dispatch
- * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
- *
- * Allows a parent scheduler to trigger dispatching on one of its direct
- * child schedulers. The child scheduler runs its dispatch operation to
- * move tasks from dispatch queues to the local runqueue.
- *
- * Returns: true on success, false if cgroup_id is invalid, not a direct
- * child, or caller lacks dispatch permission.
- */
-__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
-{
- struct rq *this_rq = this_rq();
- struct scx_sched *parent, *child;
-
- guard(rcu)();
- parent = scx_prog_sched(aux);
- if (unlikely(!parent))
- return false;
-
- child = scx_find_sub_sched(cgroup_id);
-
- if (unlikely(!child))
- return false;
-
- if (unlikely(scx_parent(child) != parent)) {
- scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
- cgroup_id);
- return false;
- }
-
- return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
- true);
-}
-#endif /* CONFIG_EXT_SUB_SCHED */
-
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index c3b97ea4ae79..f9fe7c6ebc4b 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -11,6 +11,34 @@
#include "../sched.h"
#include "types.h"
+#include <trace/events/sched_ext.h>
+
+/**
+ * scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @sch: scx_sched to account events for
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occurred
+ *
+ * This can be used when preemption is not disabled.
+ */
+#define scx_add_event(sch, name, cnt) do { \
+ this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
+ trace_sched_ext_event(#name, (cnt)); \
+} while(0)
+
+/**
+ * __scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @sch: scx_sched to account events for
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occurred
+ *
+ * This should be used only when preemption is disabled.
+ */
+#define __scx_add_event(sch, name, cnt) do { \
+ __this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \
+ trace_sched_ext_event(#name, cnt); \
+} while(0)
+
#define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
#define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void)))
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
new file mode 100644
index 000000000000..050420427273
--- /dev/null
+++ b/kernel/sched/ext/sub.c
@@ -0,0 +1,668 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Sub-scheduler hierarchy support.
+ *
+ * A sub-scheduler is an scx_sched attached to a cgroup subtree under another
+ * scx_sched. This file holds the sub-scheduler implementation: the scheduler
+ * tree walk, capability delegation, per-shard cap state and its sync, and the
+ * sub-scheduler enable/disable paths. The core dispatch/enqueue machinery it
+ * builds on lives in ext.c.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/rhashtable.h>
+#include "internal.h"
+#include "cid.h"
+#include "arena.h"
+#include "sub.h"
+
+#ifdef CONFIG_EXT_SUB_SCHED
+
+/**
+ * scx_next_descendant_pre - find the next descendant for pre-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @root: sched whose descendants to walk
+ *
+ * To be used by scx_for_each_descendant_pre(). Find the next descendant to
+ * visit for pre-order traversal of @root's descendants. @root is included in
+ * the iteration and the first node to be visited.
+ */
+struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root)
+{
+ struct scx_sched *next;
+
+ lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
+ lockdep_is_held(&scx_sched_lock));
+
+ /* if first iteration, visit @root */
+ if (!pos)
+ return root;
+
+ /* visit the first child if exists */
+ next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
+ if (next)
+ return next;
+
+ /* no child, visit my or the closest ancestor's next sibling */
+ while (pos != root) {
+ if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
+ return list_next_entry(pos, sibling);
+ pos = scx_parent(pos);
+ }
+
+ return NULL;
+}
+
+static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
+{
+ return rhashtable_lookup(&scx_sched_hash, &cgroup_id,
+ scx_sched_hash_params);
+}
+
+void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch)
+{
+ rcu_assign_pointer(p->scx.sched, sch);
+}
+
+struct cgroup *sch_cgroup(struct scx_sched *sch)
+{
+ return sch->cgrp;
+}
+
+/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
+void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
+{
+ struct cgroup *pos;
+ struct cgroup_subsys_state *css;
+
+ cgroup_for_each_live_descendant_pre(pos, css, cgrp)
+ rcu_assign_pointer(pos->scx_sched, sch);
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
+
+void drain_descendants(struct scx_sched *sch)
+{
+ /*
+ * Child scheds that finished the critical part of disabling will take
+ * themselves off @sch->children. Wait for it to drain. As propagation
+ * is recursive, empty @sch->children means that all proper descendant
+ * scheds reached unlinking stage.
+ */
+ wait_event(scx_unlink_waitq, list_empty(&sch->children));
+}
+
+static void scx_fail_parent(struct scx_sched *sch,
+ struct task_struct *failed, s32 fail_code)
+{
+ struct scx_sched *parent = scx_parent(sch);
+ struct scx_task_iter sti;
+ struct task_struct *p;
+
+ scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler",
+ fail_code, failed->comm, failed->pid);
+
+ /*
+ * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into
+ * it. This may cause downstream failures on the BPF side but $parent is
+ * dying anyway.
+ */
+ scx_bypass(parent, true);
+
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ if (scx_task_on_sched(parent, p))
+ continue;
+
+ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ scx_disable_and_exit_task(sch, p);
+ scx_set_task_sched(p, parent);
+ }
+ }
+ scx_task_iter_stop(&sti);
+}
+
+void scx_sub_disable(struct scx_sched *sch)
+{
+ struct scx_sched *parent = scx_parent(sch);
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ int ret;
+
+ /*
+ * Guarantee forward progress and wait for descendants to be disabled.
+ * To limit disruptions, $parent is not bypassed. Tasks are fully
+ * prepped and then inserted back into $parent.
+ */
+ scx_bypass(sch, true);
+ drain_descendants(sch);
+
+ /*
+ * Here, every runnable task is guaranteed to make forward progress and
+ * we can safely use blocking synchronization constructs. Actually
+ * disable ops.
+ */
+ mutex_lock(&scx_enable_mutex);
+ percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();
+
+ set_cgroup_sched(sch_cgroup(sch), parent);
+
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ /* filter out duplicate visits */
+ if (scx_task_on_sched(parent, p))
+ continue;
+
+ /*
+ * By the time control reaches here, all descendant schedulers
+ * should already have been disabled.
+ */
+ WARN_ON_ONCE(!scx_task_on_sched(sch, p));
+
+ /*
+ * @p is pinned by the iter: css_task_iter_next() takes a
+ * reference and holds it until the next iter_next() call, so
+ * @p->usage is guaranteed > 0.
+ */
+ get_task_struct(p);
+
+ scx_task_iter_unlock(&sti);
+
+ /*
+ * $p is READY or ENABLED on @sch. Initialize for $parent,
+ * disable and exit from @sch, and then switch over to $parent.
+ *
+ * If a task fails to initialize for $parent, the only available
+ * action is disabling $parent too. While this allows disabling
+ * of a child sched to cause the parent scheduler to fail, the
+ * failure can only originate from ops.init_task() of the
+ * parent. A child can't directly affect the parent through its
+ * own failures.
+ */
+ ret = __scx_init_task(parent, p, false);
+ if (ret) {
+ scx_fail_parent(sch, p, ret);
+ put_task_struct(p);
+ break;
+ }
+
+ rq = task_rq_lock(p, &rf);
+
+ if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+ /*
+ * sched_ext_dead() raced us between __scx_init_task()
+ * and this rq lock and ran exit_task() on @sch (the
+ * sched @p was on at that point), not on $parent.
+ * $parent's just-completed init is owed an exit_task()
+ * and we issue it here.
+ */
+ scx_sub_init_cancel_task(parent, p);
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ continue;
+ }
+
+ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ /*
+ * $p is initialized for $parent and still attached to
+ * @sch. Disable and exit for @sch, switch over to
+ * $parent, override the state to READY to account for
+ * $p having already been initialized, and then enable.
+ */
+ scx_disable_and_exit_task(sch, p);
+ scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
+ scx_set_task_state(p, SCX_TASK_INIT);
+ scx_set_task_sched(p, parent);
+ scx_set_task_state(p, SCX_TASK_READY);
+ scx_enable_task(parent, p);
+ }
+
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ }
+ scx_task_iter_stop(&sti);
+
+ scx_disable_dump(sch);
+
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+
+ /*
+ * All tasks are moved off of @sch but there may still be on-going
+ * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
+ * the expedited version as ancestors may be waiting in bypass mode.
+ * Also, tell the parent that there is no need to keep running bypass
+ * DSQs for us.
+ */
+ synchronize_rcu_expedited();
+ scx_disable_bypass_dsp(sch);
+
+ scx_unlink_sched(sch);
+
+ mutex_unlock(&scx_enable_mutex);
+
+ /*
+ * @sch is now unlinked from the parent's children list. Notify and call
+ * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called
+ * after unlinking and releasing all locks. See scx_claim_exit().
+ */
+ wake_up_all(&scx_unlink_waitq);
+
+ if (parent->ops.sub_detach && sch->sub_attached) {
+ struct scx_sub_detach_args sub_detach_args = {
+ .ops = &sch->ops,
+ .cgroup_path = sch->cgrp_path,
+ };
+ SCX_CALL_OP(parent, sub_detach, NULL,
+ &sub_detach_args);
+ }
+
+ scx_log_sched_disable(sch);
+
+ if (sch->ops.exit)
+ SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
+ if (sch->sub_kset)
+ kobject_del(&sch->sub_kset->kobj);
+ kobject_del(&sch->kobj);
+}
+
+/* verify that a scheduler can be attached to @cgrp and return the parent */
+static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
+{
+ struct scx_sched *parent = cgrp->scx_sched;
+ struct scx_sched *pos;
+
+ lockdep_assert_held(&scx_sched_lock);
+
+ /* can't attach twice to the same cgroup */
+ if (parent->cgrp == cgrp)
+ return ERR_PTR(-EBUSY);
+
+ /* does $parent allow sub-scheds? */
+ if (!parent->ops.sub_attach)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ /* can't insert between $parent and its exiting children */
+ list_for_each_entry(pos, &parent->children, sibling)
+ if (cgroup_is_descendant(pos->cgrp, cgrp))
+ return ERR_PTR(-EBUSY);
+
+ return parent;
+}
+
+static bool assert_task_ready_or_enabled(struct task_struct *p)
+{
+ u32 state = scx_get_task_state(p);
+
+ switch (state) {
+ case SCX_TASK_READY:
+ case SCX_TASK_ENABLED:
+ return true;
+ default:
+ WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched",
+ state, p->comm, p->pid);
+ return false;
+ }
+}
+
+void scx_sub_enable_workfn(struct kthread_work *work)
+{
+ struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
+ struct sched_ext_ops *ops = cmd->ops;
+ struct cgroup *cgrp;
+ struct scx_sched *parent, *sch;
+ struct scx_task_iter sti;
+ struct task_struct *p;
+ s32 i, ret;
+
+ mutex_lock(&scx_enable_mutex);
+
+ if (!scx_enabled()) {
+ ret = -ENODEV;
+ goto out_unlock;
+ }
+
+ /* See scx_root_enable_workfn() for the @ops->priv check. */
+ if (rcu_access_pointer(ops->priv)) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ cgrp = cgroup_get_from_id(ops->sub_cgroup_id);
+ if (IS_ERR(cgrp)) {
+ ret = PTR_ERR(cgrp);
+ goto out_unlock;
+ }
+
+ raw_spin_lock_irq(&scx_sched_lock);
+ parent = find_parent_sched(cgrp);
+ if (IS_ERR(parent)) {
+ raw_spin_unlock_irq(&scx_sched_lock);
+ ret = PTR_ERR(parent);
+ goto out_put_cgrp;
+ }
+ kobject_get(&parent->kobj);
+ raw_spin_unlock_irq(&scx_sched_lock);
+
+ /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
+ sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
+ kobject_put(&parent->kobj);
+ if (IS_ERR(sch)) {
+ ret = PTR_ERR(sch);
+ goto out_unlock;
+ }
+
+ ret = scx_link_sched(sch);
+ if (ret)
+ goto err_disable;
+
+ if (sch->level >= SCX_SUB_MAX_DEPTH) {
+ scx_error(sch, "max nesting depth %d violated",
+ SCX_SUB_MAX_DEPTH);
+ goto err_disable;
+ }
+
+ if (sch->ops.init) {
+ ret = SCX_CALL_OP_RET(sch, init, NULL);
+ if (ret) {
+ ret = scx_ops_sanitize_err(sch, "init", ret);
+ scx_error(sch, "ops.init() failed (%d)", ret);
+ goto err_disable;
+ }
+ sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
+ }
+
+ ret = scx_arena_pool_init(sch);
+ if (ret)
+ goto err_disable;
+
+ ret = scx_set_cmask_scratch_alloc(sch);
+ if (ret)
+ goto err_disable;
+
+ if (scx_validate_ops(sch, ops))
+ goto err_disable;
+
+ struct scx_sub_attach_args sub_attach_args = {
+ .ops = &sch->ops,
+ .cgroup_path = sch->cgrp_path,
+ };
+
+ ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
+ &sub_attach_args);
+ if (ret) {
+ ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
+ scx_error(sch, "parent rejected (%d)", ret);
+ goto err_disable;
+ }
+ sch->sub_attached = true;
+
+ scx_bypass(sch, true);
+
+ for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
+ if (((void (**)(void))ops)[i])
+ set_bit(i, sch->has_op);
+
+ percpu_down_write(&scx_fork_rwsem);
+ scx_cgroup_lock();
+
+ /*
+ * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see
+ * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down.
+ */
+ set_cgroup_sched(sch_cgroup(sch), sch);
+ if (!(cgrp->self.flags & CSS_ONLINE)) {
+ scx_error(sch, "cgroup is not online");
+ goto err_unlock_and_disable;
+ }
+
+ /*
+ * Initialize tasks for the new child $sch without exiting them for
+ * $parent so that the tasks can always be reverted back to $parent
+ * sched on child init failure.
+ */
+ WARN_ON_ONCE(scx_enabling_sub_sched);
+ scx_enabling_sub_sched = sch;
+
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ struct rq *rq;
+ struct rq_flags rf;
+
+ /*
+ * Task iteration may visit the same task twice when racing
+ * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which
+ * finished __scx_init_task() and skip if set.
+ *
+ * A task may exit and get freed between __scx_init_task()
+ * completion and scx_enable_task(). In such cases,
+ * scx_disable_and_exit_task() must exit the task for both the
+ * parent and child scheds.
+ */
+ if (p->scx.flags & SCX_TASK_SUB_INIT)
+ continue;
+
+ /* @p is pinned by the iter; see scx_sub_disable() */
+ get_task_struct(p);
+
+ if (!assert_task_ready_or_enabled(p)) {
+ ret = -EINVAL;
+ goto abort;
+ }
+
+ scx_task_iter_unlock(&sti);
+
+ /*
+ * As $p is still on $parent, it can't be transitioned to INIT.
+ * Let's worry about task state later. Use __scx_init_task().
+ */
+ ret = __scx_init_task(sch, p, false);
+ if (ret)
+ goto abort;
+
+ rq = task_rq_lock(p, &rf);
+
+ if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+ /*
+ * sched_ext_dead() raced us between __scx_init_task()
+ * and this rq lock and ran exit_task() on $parent (the
+ * sched @p was on at that point), not on @sch. @sch's
+ * just-completed init is owed an exit_task() and we
+ * issue it here.
+ */
+ scx_sub_init_cancel_task(sch, p);
+ task_rq_unlock(rq, p, &rf);
+ put_task_struct(p);
+ continue;
+ }
+
+ p->scx.flags |= SCX_TASK_SUB_INIT;
+ task_rq_unlock(rq, p, &rf);
+
+ put_task_struct(p);
+ }
+ scx_task_iter_stop(&sti);
+
+ /*
+ * All tasks are prepped. Disable/exit tasks for $parent and enable for
+ * the new @sch.
+ */
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ /*
+ * Use clearing of %SCX_TASK_SUB_INIT to detect and skip
+ * duplicate iterations.
+ */
+ if (!(p->scx.flags & SCX_TASK_SUB_INIT))
+ continue;
+
+ scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+ /*
+ * $p must be either READY or ENABLED. If ENABLED,
+ * __scx_disabled_and_exit_task() first disables and
+ * makes it READY. However, after exiting $p, it will
+ * leave $p as READY.
+ */
+ assert_task_ready_or_enabled(p);
+ __scx_disable_and_exit_task(parent, p);
+
+ /*
+ * $p is now only initialized for @sch and READY, which
+ * is what we want. Assign it to @sch and enable.
+ */
+ scx_set_task_sched(p, sch);
+ scx_enable_task(sch, p);
+
+ p->scx.flags &= ~SCX_TASK_SUB_INIT;
+ }
+ }
+ scx_task_iter_stop(&sti);
+
+ scx_enabling_sub_sched = NULL;
+
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+
+ scx_bypass(sch, false);
+
+ pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
+ kobject_uevent(&sch->kobj, KOBJ_ADD);
+ ret = 0;
+ goto out_unlock;
+
+out_put_cgrp:
+ cgroup_put(cgrp);
+out_unlock:
+ mutex_unlock(&scx_enable_mutex);
+ cmd->ret = ret;
+ return;
+
+abort:
+ put_task_struct(p);
+ scx_task_iter_stop(&sti);
+
+ /*
+ * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
+ * ran for @sch on them, so calling scx_disable_task() here would invoke
+ * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
+ * must stay set until SUB_INIT is cleared from every marked task -
+ * scx_disable_and_exit_task() reads it when a task exits concurrently.
+ */
+ scx_task_iter_start(&sti, sch->cgrp);
+ while ((p = scx_task_iter_next_locked(&sti))) {
+ if (p->scx.flags & SCX_TASK_SUB_INIT) {
+ scx_sub_init_cancel_task(sch, p);
+ p->scx.flags &= ~SCX_TASK_SUB_INIT;
+ }
+ }
+ scx_task_iter_stop(&sti);
+ scx_enabling_sub_sched = NULL;
+err_unlock_and_disable:
+ /* we'll soon enter disable path, keep bypass on */
+ scx_cgroup_unlock();
+ percpu_up_write(&scx_fork_rwsem);
+err_disable:
+ mutex_unlock(&scx_enable_mutex);
+ scx_flush_disable_work(sch);
+ cmd->ret = 0;
+}
+
+static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct cgroup *cgrp = data;
+ struct cgroup *parent = cgroup_parent(cgrp);
+
+ if (!cgroup_on_dfl(cgrp))
+ return NOTIFY_OK;
+
+ switch (action) {
+ case CGROUP_LIFETIME_ONLINE:
+ /* inherit ->scx_sched from $parent */
+ if (parent)
+ rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched);
+ break;
+ case CGROUP_LIFETIME_OFFLINE:
+ /* if there is a sched attached, shoot it down */
+ if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp)
+ scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN,
+ SCX_ECODE_RSN_CGROUP_OFFLINE,
+ "cgroup %llu going offline", cgroup_id(cgrp));
+ break;
+ }
+
+ return NOTIFY_OK;
+}
+
+static struct notifier_block scx_cgroup_lifetime_nb = {
+ .notifier_call = scx_cgroup_lifetime_notify,
+};
+
+static s32 __init scx_cgroup_lifetime_notifier_init(void)
+{
+ return blocking_notifier_chain_register(&cgroup_lifetime_notifier,
+ &scx_cgroup_lifetime_nb);
+}
+core_initcall(scx_cgroup_lifetime_notifier_init);
+
+void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog)
+{
+ struct scx_sched *sch;
+
+ guard(rcu)();
+ sch = scx_prog_sched(prog->aux);
+ if (unlikely(!sch))
+ return;
+
+ scx_error(sch, "dispatch recursion detected");
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler
+ * @cgroup_id: cgroup ID of the child scheduler to dispatch
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Allows a parent scheduler to trigger dispatching on one of its direct
+ * child schedulers. The child scheduler runs its dispatch operation to
+ * move tasks from dispatch queues to the local runqueue.
+ *
+ * Returns: true on success, false if cgroup_id is invalid, not a direct
+ * child, or caller lacks dispatch permission.
+ */
+__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
+{
+ struct rq *this_rq = this_rq();
+ struct scx_sched *parent, *child;
+
+ guard(rcu)();
+ parent = scx_prog_sched(aux);
+ if (unlikely(!parent))
+ return false;
+
+ child = scx_find_sub_sched(cgroup_id);
+
+ if (unlikely(!child))
+ return false;
+
+ if (unlikely(scx_parent(child) != parent)) {
+ scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
+ cgroup_id);
+ return false;
+ }
+
+ return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
+ true);
+}
+
+__bpf_kfunc_end_defs();
+
+#endif /* CONFIG_EXT_SUB_SCHED */
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
new file mode 100644
index 000000000000..460a9fd196dc
--- /dev/null
+++ b/kernel/sched/ext/sub.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Sub-scheduler hierarchy support.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_SUB_H
+#define _KERNEL_SCHED_EXT_SUB_H
+
+#include "internal.h"
+#include "cid.h"
+
+#ifdef CONFIG_EXT_SUB_SCHED
+
+struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root);
+void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch);
+struct cgroup *sch_cgroup(struct scx_sched *sch);
+void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch);
+void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog);
+void drain_descendants(struct scx_sched *sch);
+void scx_sub_disable(struct scx_sched *sch);
+void scx_sub_enable_workfn(struct kthread_work *work);
+bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
+
+#else /* CONFIG_EXT_SUB_SCHED */
+
+static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
+static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
+static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
+static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
+static inline void drain_descendants(struct scx_sched *sch) { }
+static inline void scx_sub_disable(struct scx_sched *sch) { }
+
+#endif /* CONFIG_EXT_SUB_SCHED */
+
+/**
+ * scx_for_each_descendant_pre - pre-order walk of a sched's descendants
+ * @pos: iteration cursor
+ * @root: sched to walk the descendants of
+ *
+ * Walk @root's descendants. @root is included in the iteration and the first
+ * node to be visited. Must be called with either scx_enable_mutex or
+ * scx_sched_lock held.
+ */
+#define scx_for_each_descendant_pre(pos, root) \
+ for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \
+ (pos) = scx_next_descendant_pre((pos), (root)))
+
+/*
+ * One user of this function is scx_bpf_dispatch() which can be called
+ * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
+ * from the call frame.
+ */
+static __always_inline bool
+scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
+ struct task_struct *prev, bool nested)
+{
+ struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
+ int nr_loops = SCX_DSP_MAX_LOOPS;
+ s32 cpu = cpu_of(rq);
+ bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
+ scx_task_on_sched(sch, prev);
+
+ if (scx_consume_global_dsq(sch, rq))
+ return true;
+
+ if (scx_bypass_dsp_enabled(sch)) {
+ /* if @sch is bypassing, only the bypass DSQs are active */
+ if (scx_bypassing(sch, cpu))
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
+
+#ifdef CONFIG_EXT_SUB_SCHED
+ /*
+ * If @sch isn't bypassing but its children are, @sch is
+ * responsible for making forward progress for both its own
+ * tasks that aren't bypassing and the bypassing descendants'
+ * tasks. The following implements a simple built-in behavior -
+ * let each CPU try to run the bypass DSQ every Nth time.
+ *
+ * Later, if necessary, we can add an ops flag to suppress the
+ * auto-consumption and a kfunc to consume the bypass DSQ and,
+ * so that the BPF scheduler can fully control scheduling of
+ * bypassed tasks.
+ */
+ struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+
+ if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
+ scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
+ __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
+ return true;
+ }
+#endif /* CONFIG_EXT_SUB_SCHED */
+ }
+
+ if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
+ return false;
+
+ dspc->rq = rq;
+
+ /*
+ * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
+ * lock, the local DSQ might still end up empty after a successful
+ * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
+ * produced some tasks, retry. The BPF scheduler may depend on this
+ * looping behavior to simplify its implementation.
+ */
+ do {
+ dspc->nr_tasks = 0;
+
+ if (nested) {
+ SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+ prev_on_sch ? prev : NULL);
+ } else {
+ /* stash @prev so that nested invocations can access it */
+ rq->scx.sub_dispatch_prev = prev;
+ SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+ prev_on_sch ? prev : NULL);
+ rq->scx.sub_dispatch_prev = NULL;
+ }
+
+ scx_flush_dispatch_buf(sch, rq);
+
+ if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
+ rq->scx.flags |= SCX_RQ_BAL_KEEP;
+ return true;
+ }
+ if (rq->scx.local_dsq.nr)
+ return true;
+ if (scx_consume_global_dsq(sch, rq))
+ return true;
+
+ /*
+ * ops.dispatch() can trap us in this loop by repeatedly
+ * dispatching ineligible tasks. Break out once in a while to
+ * allow the watchdog to run. As IRQ can't be enabled in
+ * balance(), we want to complete this scheduling cycle and then
+ * start a new one. IOW, we want to call resched_curr() on the
+ * next, most likely idle, task, not the current one. Use
+ * __scx_bpf_kick_cpu() for deferred kicking.
+ */
+ if (unlikely(!--nr_loops)) {
+ scx_kick_cpu(sch, cpu, 0);
+ break;
+ }
+ } while (dspc->nr_tasks);
+
+ /*
+ * Prevent the CPU from going idle while bypassed descendants have tasks
+ * queued. Without this fallback, bypassed tasks could stall if the host
+ * scheduler's ops.dispatch() doesn't yield any tasks.
+ */
+ if (scx_bypass_dsp_enabled(sch))
+ return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
+
+ return false;
+}
+
+#endif /* _KERNEL_SCHED_EXT_SUB_H */
--
2.54.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
` (3 preceding siblings ...)
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 20:44 ` Tejun Heo
4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:44 UTC (permalink / raw)
To: David Vernet, Andrea Righi, Changwoo Min
Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo
Applied 1-4 to sched_ext/for-7.3.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 20:55 ` sashiko-bot
2026-07-01 21:56 ` Tejun Heo
0 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-07-01 20:55 UTC (permalink / raw)
To: Tejun Heo; +Cc: sched-ext
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
Pre-existing issues:
- [High] `scx_sub_disable` violates RCU removal ordering, leading to Use-After-Free of sub-schedulers.
- [High] NULL pointer dereference in `find_parent_sched` when attaching a sub-scheduler to a newly created cgroup.
- [High] Missing `scx_error()` calls in `scx_sub_enable_workfn` error paths cause the sub-scheduler to leak and remain bypassed indefinitely.
--
commit 3673367fbf0f16668d479297e4d6ace4a6a2dbef
Author: Tejun Heo <tj@kernel.org>
Subject: sched_ext: Split sub-scheduler implementation into sub.c
This commit moves the sub-scheduler implementation from ext.c into a new
kernel/sched/ext/sub.c file. It also creates sub.h to hold prototypes and
inline functions like scx_dispatch_sched() for better modularity.
> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> new file mode 100644
> index 0000000000000..0504204272732
> --- /dev/null
> +++ b/kernel/sched/ext/sub.c
[ ... ]
> +void scx_sub_disable(struct scx_sched *sch)
> +{
[ ... ]
> + /*
> + * All tasks are moved off of @sch but there may still be on-going
> + * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
> + * the expedited version as ancestors may be waiting in bypass mode.
> + * Also, tell the parent that there is no need to keep running bypass
> + * DSQs for us.
> + */
> + synchronize_rcu_expedited();
> + scx_disable_bypass_dsp(sch);
> +
> + scx_unlink_sched(sch);
[Severity: High]
This is a pre-existing issue, but does this violate RCU removal ordering?
The sub-scheduler is removed from the scx_sched_hash via scx_unlink_sched()
after the RCU grace period completes. Because it remains in the hash during
and after the grace period, couldn't a concurrent BPF program on the parent
scheduler call scx_bpf_sub_dispatch() and find the sub-scheduler?
Should scx_unlink_sched() be called before synchronize_rcu_expedited() to
ensure new RCU readers cannot look it up?
[ ... ]
> +/* verify that a scheduler can be attached to @cgrp and return the parent */
> +static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
> +{
> + struct scx_sched *parent = cgrp->scx_sched;
> + struct scx_sched *pos;
> +
> + lockdep_assert_held(&scx_sched_lock);
> +
> + /* can't attach twice to the same cgroup */
> + if (parent->cgrp == cgrp)
[Severity: High]
This isn't a bug introduced by this patch, but is it possible for parent
to be NULL here?
If a BPF program attaches a sub-scheduler to a newly created cgroup that
hasn't completed its ONLINE phase yet, cgrp->scx_sched might be NULL. The
code dereferences it immediately.
The explicit check for CSS_ONLINE happens later in scx_sub_enable_workfn().
Should we verify parent is not NULL before dereferencing it?
[ ... ]
> +void scx_sub_enable_workfn(struct kthread_work *work)
> +{
[ ... ]
> + ret = scx_link_sched(sch);
> + if (ret)
> + goto err_disable;
> +
> + if (sch->level >= SCX_SUB_MAX_DEPTH) {
> + scx_error(sch, "max nesting depth %d violated",
> + SCX_SUB_MAX_DEPTH);
> + goto err_disable;
> + }
> +
> + if (sch->ops.init) {
> + ret = SCX_CALL_OP_RET(sch, init, NULL);
> + if (ret) {
> + ret = scx_ops_sanitize_err(sch, "init", ret);
> + scx_error(sch, "ops.init() failed (%d)", ret);
> + goto err_disable;
> + }
> + sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
> + }
> +
> + ret = scx_arena_pool_init(sch);
> + if (ret)
> + goto err_disable;
> +
> + ret = scx_set_cmask_scratch_alloc(sch);
> + if (ret)
> + goto err_disable;
> +
> + if (scx_validate_ops(sch, ops))
> + goto err_disable;
[Severity: High]
This is a pre-existing issue, but do we need to call scx_error() in these
failure paths?
If scx_link_sched(), scx_arena_pool_init(), scx_set_cmask_scratch_alloc(),
or scx_validate_ops() fail, the code jumps to err_disable without calling
scx_error().
Without scx_error(), sch->exit_kind remains SCX_EXIT_NONE. When
scx_flush_disable_work(sch) is executed at err_disable, it will immediately
return, leaving the sub-scheduler partially initialized and permanently
bypassed, while returning 0 (success) to userspace.
[ ... ]
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260701203438.2837392-1-tj@kernel.org?part=4
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
2026-07-01 20:55 ` sashiko-bot
@ 2026-07-01 21:56 ` Tejun Heo
0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 21:56 UTC (permalink / raw)
To: sashiko-reviews; +Cc: sched-ext
Hello.
> The sub-scheduler is removed from the scx_sched_hash via scx_unlink_sched()
> after the RCU grace period completes. Because it remains in the hash during
> and after the grace period, couldn't a concurrent BPF program on the parent
> scheduler call scx_bpf_sub_dispatch() and find the sub-scheduler?
It can, but that's fine. The expedited synchronize is there to drain
in-flight ops after all tasks have been moved off, not to make the sched
unreachable. The sched is freed through another RCU grace period after
unlinking and scx_bpf_sub_dispatch() runs under RCU read lock, so the
racing lookup can't see the sched freed under it. Such a dispatch is also
inert as the child is already bypassed and has no tasks left.
> If a BPF program attaches a sub-scheduler to a newly created cgroup that
> hasn't completed its ONLINE phase yet, cgrp->scx_sched might be NULL. The
> code dereferences it immediately.
The cgroup can only come from cgroup_get_from_id() which fails on kernfs
nodes which haven't been activated, and cgroup_mkdir() activates the node
after CGROUP_LIFETIME_ONLINE has fired and ->scx_sched has been inherited
from the parent. The inherited pointer is non-NULL whenever the attach can
get this far. The root enable path sets ->scx_sched on all live cgroups
before __scx_enabled is turned on, under the same scx_enable_mutex that
scx_sub_enable_workfn() holds across the scx_enabled() check. The
CSS_ONLINE check afterwards guards against destruction, not incomplete
onlining.
> This is a pre-existing issue, but do we need to call scx_error() in these
> failure paths?
Will be dealt with separately.
Thanks.
--
tejun
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-07-01 21:56 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:55 ` sashiko-bot
2026-07-01 21:56 ` Tejun Heo
2026-07-01 20:44 ` [PATCHSET v3 sched_ext/for-7.3] " Tejun Heo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.