[PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c
@ 2026-07-01 20:34 Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Hello,

v3: Patch 2 also exposes scx_rq_online(), scx_flush_dispatch_buf() and
    scx_kick_cpu() which scx_dispatch_sched() in sub.h calls, so that patch
    4 builds on its own (sashiko AI). Added Andrea's Reviewed-by. Patches
    1, 3 and 4 are otherwise unchanged.

v2: https://lore.kernel.org/all/20260701181046.2490390-1-tj@kernel.org
v1: https://lore.kernel.org/all/20260701031429.1892218-1-tj@kernel.org

The sub-scheduler implementation has grown and will keep growing. Move it
out of ext.c into a new kernel/sched/ext/sub.c. The first three patches are
mechanical prep (prefix file-local helpers, expose shared internals, inline
a few trivial helpers) so the move itself stays pure code motion. No
functional change.

Based on sched_ext/for-7.3 (5df6a4506d06) with sched_ext/for-7.2-fixes
(b7d9c359e5cf) assumed merged.

Tejun Heo (4):
  sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
  sched_ext: Expose the ext.c internals used by the sub.c split
  sched_ext: Inline small ext.c helpers shared across the sub.c split
  sched_ext: Split sub-scheduler implementation into sub.c

 kernel/sched/build_policy.c |    2 +
 kernel/sched/ext/ext.c      | 1122 ++++-------------------------------
 kernel/sched/ext/internal.h |  167 +++++-
 kernel/sched/ext/sub.c      |  668 +++++++++++++++++++++
 kernel/sched/ext/sub.h      |  161 +++++
 5 files changed, 1106 insertions(+), 1014 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split
  2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

A later change moves the sub-scheduler implementation out of ext.c into its
own file, from where it calls a number of file-local ext.c helpers. Give
those helpers the scx_ prefix that cross-file sched_ext symbols carry, ahead
of the move so the mechanical rename stays out of the code-motion patch. No
functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext/ext.c      | 192 ++++++++++++++++++------------------
 kernel/sched/ext/internal.h |   2 +-
 2 files changed, 97 insertions(+), 97 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 4e0cd08a6a2e..56e6a13fd0f8 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -369,7 +369,7 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p)
 	return __setscheduler_class(p->policy, p->prio);
 }
 
-static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu)
+static struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
 {
 	return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
 }
@@ -392,11 +392,11 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
 		sch = scx_parent(sch);
 #endif	/* CONFIG_EXT_SUB_SCHED */
 
-	return bypass_dsq(sch, cpu);
+	return scx_bypass_dsq(sch, cpu);
 }
 
 /**
- * bypass_dsp_enabled - Check if bypass dispatch path is enabled
+ * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
  * @sch: scheduler to check
  *
  * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
@@ -408,9 +408,9 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
  *
  * This function checks bypass_dsp_enable_depth which is managed separately from
  * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
- * disable_bypass_dsp().
+ * scx_disable_bypass_dsp().
  */
-static bool bypass_dsp_enabled(struct scx_sched *sch)
+static bool scx_bypass_dsp_enabled(struct scx_sched *sch)
 {
 	return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
 }
@@ -1079,7 +1079,7 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
 }
 
 /**
- * ops_sanitize_err - Sanitize a -errno value
+ * scx_ops_sanitize_err - Sanitize a -errno value
  * @sch: scx_sched to error out on error
  * @ops_name: operation to blame on failure
  * @err: -errno value to sanitize
@@ -1091,7 +1091,7 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
  * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
  * handled as a pointer.
  */
-static int ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
+static int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
 {
 	if (err < 0 && err >= -MAX_ERRNO)
 		return err;
@@ -1251,7 +1251,7 @@ static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 		schedule_deferred(rq);
 }
 
-static void schedule_reenq_local(struct rq *rq, u64 reenq_flags)
+static void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
 {
 	struct scx_sched *root = rcu_dereference_sched(scx_root);
 
@@ -1347,8 +1347,8 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
 	 * to the CPU or dequeued. In both cases, the only way @p can go back to
 	 * the BPF sched is through enqueueing. If being inserted into a local
 	 * DSQ with IMMED, persist the state until the next enqueueing event in
-	 * do_enqueue_task() so that we can maintain IMMED protection through
-	 * e.g. SAVE/RESTORE cycles and slice extensions.
+	 * scx_do_enqueue_task() so that we can maintain IMMED protection
+	 * through e.g. SAVE/RESTORE cycles and slice extensions.
 	 */
 	if (enq_flags & SCX_ENQ_IMMED) {
 		if (unlikely(dsq->id != SCX_DSQ_LOCAL)) {
@@ -1371,7 +1371,7 @@ static void dsq_inc_nr(struct scx_dispatch_q *dsq, struct task_struct *p, u64 en
 		 * done yet, @p can't go on the CPU immediately. Re-enqueue.
 		 */
 		if (unlikely(dsq->nr > 1 || !rq_is_open(rq, enq_flags)))
-			schedule_reenq_local(rq, 0);
+			scx_schedule_reenq_local(rq, 0);
 	}
 }
 
@@ -1488,9 +1488,9 @@ static void local_dsq_post_enq(struct scx_sched *sch, struct scx_dispatch_q *dsq
 	}
 }
 
-static void dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
-			     struct scx_dispatch_q *dsq, struct task_struct *p,
-			     u64 enq_flags)
+static void scx_dispatch_enqueue(struct scx_sched *sch, struct rq *rq,
+				 struct scx_dispatch_q *dsq, struct task_struct *p,
+				 u64 enq_flags)
 {
 	bool is_local = dsq->id == SCX_DSQ_LOCAL;
 
@@ -1638,7 +1638,7 @@ static void task_unlink_from_dsq(struct task_struct *p,
 	}
 }
 
-static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
+static void scx_dispatch_dequeue(struct rq *rq, struct task_struct *p)
 {
 	struct scx_dispatch_q *dsq = p->scx.dsq;
 	bool is_local = dsq == &rq->scx.local_dsq;
@@ -1692,8 +1692,8 @@ static void dispatch_dequeue(struct rq *rq, struct task_struct *p)
 }
 
 /*
- * Abbreviated version of dispatch_dequeue() that can be used when both @p's rq
- * and dsq are locked.
+ * Abbreviated version of scx_dispatch_dequeue() that can be used when both
+ * @p's rq and dsq are locked.
  */
 static void dispatch_dequeue_locked(struct task_struct *p,
 				    struct scx_dispatch_q *dsq)
@@ -1774,10 +1774,10 @@ static void mark_direct_dispatch(struct scx_sched *sch,
  *  - direct_dispatch(): cleared on the synchronous enqueue path, deferred
  *    dispatch keeps the state until consumed
  *  - process_ddsp_deferred_locals(): cleared after consuming deferred state,
- *  - do_enqueue_task(): cleared on enqueue fallbacks where the dispatch
+ *  - scx_do_enqueue_task(): cleared on enqueue fallbacks where the dispatch
  *    verdict is ignored (local/global/bypass)
- *  - dequeue_task_scx(): cleared after dispatch_dequeue(), covering deferred
- *    cancellation and holding_cpu races
+ *  - dequeue_task_scx(): cleared after scx_dispatch_dequeue(), covering
+ *    deferred cancellation and holding_cpu races
  *  - scx_disable_task(): cleared for queued wakeup tasks, which are excluded by
  *    the scx_bypass() loop, so that stale state is not reused by a subsequent
  *    scheduler instance
@@ -1838,7 +1838,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 	ddsp_enq_flags = p->scx.ddsp_enq_flags;
 	clear_direct_dispatch(p);
 
-	dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
+	scx_dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
 static bool scx_rq_online(struct rq *rq)
@@ -1853,8 +1853,8 @@ static bool scx_rq_online(struct rq *rq)
 	return likely((rq->scx.flags & SCX_RQ_ONLINE) && cpu_active(cpu_of(rq)));
 }
 
-static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
-			    int sticky_cpu)
+static void scx_do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
+				int sticky_cpu)
 {
 	struct scx_sched *sch = scx_task_sched(p);
 	struct task_struct **ddsp_taskp;
@@ -1941,7 +1941,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	direct_dispatch(sch, p, enq_flags);
 	return;
 local_norefill:
-	dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
+	scx_dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, enq_flags);
 	return;
 local:
 	dsq = &rq->scx.local_dsq;
@@ -1962,7 +1962,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	touch_core_sched(rq, p);
 	refill_task_slice_dfl(sch, p);
 	clear_direct_dispatch(p);
-	dispatch_enqueue(sch, rq, dsq, p, enq_flags);
+	scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags);
 }
 
 static bool task_runnable(const struct task_struct *p)
@@ -2031,7 +2031,7 @@ static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int core_enq_
 	if (rq->scx.nr_running == 1)
 		dl_server_start(&rq->ext_server);
 
-	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
+	scx_do_enqueue_task(rq, p, enq_flags, sticky_cpu);
 
 	if (sticky_cpu >= 0)
 		p->scx.sticky_cpu = -1;
@@ -2167,7 +2167,7 @@ static bool dequeue_task_scx(struct rq *rq, struct task_struct *p, int core_deq_
 	rq->scx.nr_running--;
 	sub_nr_running(rq, 1);
 
-	dispatch_dequeue(rq, p);
+	scx_dispatch_dequeue(rq, p);
 	clear_direct_dispatch(p);
 	return true;
 }
@@ -2215,7 +2215,7 @@ static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p, int wake_fl
 	 * - A higher-priority wakes up while SCX dispatch is in progress.
 	 */
 	if (rq->scx.nr_immed)
-		schedule_reenq_local(rq, 0);
+		scx_schedule_reenq_local(rq, 0);
 }
 
 static void move_local_task_to_local_dsq(struct scx_sched *sch,
@@ -2380,7 +2380,7 @@ static bool task_can_run_on_remote_rq(struct scx_sched *sch,
  * values afterwards, as this operation can't be preempted or recurse, the
  * holding_cpu can never become this CPU again before we're done. Thus, we can
  * tell whether we lost to dequeue by testing whether the holding_cpu still
- * points to this CPU. See dispatch_dequeue() for the counterpart.
+ * points to this CPU. See scx_dispatch_dequeue() for the counterpart.
  *
  * On return, @dsq is unlocked and @src_rq is locked. Returns %true if @p is
  * still valid. %false if lost to dequeue.
@@ -2485,14 +2485,14 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 		dispatch_dequeue_locked(p, src_dsq);
 		raw_spin_unlock(&src_dsq->lock);
 
-		dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
+		scx_dispatch_enqueue(sch, dst_rq, dst_dsq, p, enq_flags);
 	}
 
 	return dst_rq;
 }
 
-static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
-			       struct scx_dispatch_q *dsq, u64 enq_flags)
+static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+				   struct scx_dispatch_q *dsq, u64 enq_flags)
 {
 	struct task_struct *p;
 retry:
@@ -2538,11 +2538,11 @@ static bool consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 	return false;
 }
 
-static bool consume_global_dsq(struct scx_sched *sch, struct rq *rq)
+static bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
 {
 	int node = cpu_to_node(cpu_of(rq));
 
-	return consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0);
+	return scx_consume_dispatch_q(sch, rq, &sch->pnode[node]->global_dsq, 0);
 }
 
 /**
@@ -2575,8 +2575,8 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 	 * If dispatching to @rq that @p is already on, no lock dancing needed.
 	 */
 	if (rq == src_rq && rq == dst_rq) {
-		dispatch_enqueue(sch, rq, dst_dsq, p,
-				 enq_flags | SCX_ENQ_CLEAR_OPSS);
+		scx_dispatch_enqueue(sch, rq, dst_dsq, p,
+				     enq_flags | SCX_ENQ_CLEAR_OPSS);
 		return;
 	}
 
@@ -2614,13 +2614,13 @@ static void dispatch_to_local_dsq(struct scx_sched *sch, struct rq *rq,
 		 */
 		if (src_rq == dst_rq) {
 			p->scx.holding_cpu = -1;
-			dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
-					 enq_flags);
+			scx_dispatch_enqueue(sch, dst_rq, &dst_rq->scx.local_dsq, p,
+					     enq_flags);
 		} else if (unlikely(!task_can_run_on_remote_rq(sch, p, dst_rq, true))) {
 			p->scx.holding_cpu = -1;
 			fallback = true;
-			dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
-					 p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
+			scx_dispatch_enqueue(sch, src_rq, find_global_dsq(sch, task_cpu(p)),
+					     p, enq_flags | SCX_ENQ_GDSQ_FALLBACK);
 		} else {
 			move_remote_task_to_local_dsq(p, enq_flags,
 						      src_rq, dst_rq);
@@ -2708,10 +2708,10 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 		goto retry;
 	case SCX_OPSS_QUEUEING:
 		/*
-		 * do_enqueue_task() is in the process of transferring the task
-		 * to the BPF scheduler while holding @p's rq lock. As we aren't
-		 * holding any kernel or BPF resource that the enqueue path may
-		 * depend upon, it's safe to wait.
+		 * scx_do_enqueue_task() is in the process of transferring the
+		 * task to the BPF scheduler while holding @p's rq lock. As we
+		 * aren't holding any kernel or BPF resource that the enqueue
+		 * path may depend upon, it's safe to wait.
 		 */
 		wait_ops_state(p, opss);
 		goto retry;
@@ -2724,10 +2724,10 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 	if (dsq->id == SCX_DSQ_LOCAL)
 		dispatch_to_local_dsq(sch, rq, dsq, p, enq_flags);
 	else
-		dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
+		scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
-static void flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
+static void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
 {
 	struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
 	u32 u;
@@ -2771,13 +2771,13 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 	bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
 		scx_task_on_sched(sch, prev);
 
-	if (consume_global_dsq(sch, rq))
+	if (scx_consume_global_dsq(sch, rq))
 		return true;
 
-	if (bypass_dsp_enabled(sch)) {
+	if (scx_bypass_dsp_enabled(sch)) {
 		/* if @sch is bypassing, only the bypass DSQs are active */
 		if (scx_bypassing(sch, cpu))
-			return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0);
+			return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
 
 #ifdef CONFIG_EXT_SUB_SCHED
 		/*
@@ -2795,7 +2795,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
 
 		if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
-		    consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0)) {
+		    scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
 			__scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
 			return true;
 		}
@@ -2808,8 +2808,8 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 	dspc->rq = rq;
 
 	/*
-	 * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
-	 * the local DSQ might still end up empty after a successful
+	 * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
+	 * lock, the local DSQ might still end up empty after a successful
 	 * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
 	 * produced some tasks, retry. The BPF scheduler may depend on this
 	 * looping behavior to simplify its implementation.
@@ -2828,7 +2828,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 			rq->scx.sub_dispatch_prev = NULL;
 		}
 
-		flush_dispatch_buf(sch, rq);
+		scx_flush_dispatch_buf(sch, rq);
 
 		if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
 			rq->scx.flags |= SCX_RQ_BAL_KEEP;
@@ -2836,7 +2836,7 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 		}
 		if (rq->scx.local_dsq.nr)
 			return true;
-		if (consume_global_dsq(sch, rq))
+		if (scx_consume_global_dsq(sch, rq))
 			return true;
 
 		/*
@@ -2859,8 +2859,8 @@ scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
 	 * queued. Without this fallback, bypassed tasks could stall if the host
 	 * scheduler's ops.dispatch() doesn't yield any tasks.
 	 */
-	if (bypass_dsp_enabled(sch))
-		return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu), 0);
+	if (scx_bypass_dsp_enabled(sch))
+		return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
 
 	return false;
 }
@@ -2939,7 +2939,7 @@ static int balance_one(struct rq *rq, struct task_struct *prev)
 	 *   between the IMMED queueing and the subsequent scheduling event.
 	 */
 	if (unlikely(rq->scx.local_dsq.nr > 1 && rq->scx.nr_immed))
-		schedule_reenq_local(rq, 0);
+		scx_schedule_reenq_local(rq, 0);
 
 	rq->scx.flags &= ~SCX_RQ_IN_BALANCE;
 	return true;
@@ -2955,7 +2955,7 @@ static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
 		 * dispatched. Call ops_dequeue() to notify the BPF scheduler.
 		 */
 		ops_dequeue(rq, p, SCX_DEQ_CORE_SCHED_EXEC);
-		dispatch_dequeue(rq, p);
+		scx_dispatch_dequeue(rq, p);
 	}
 
 	p->se.exec_start = rq_clock_task(rq);
@@ -3067,10 +3067,10 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		if (p->scx.slice && !scx_bypassing(sch, cpu_of(rq))) {
 			if (p->scx.flags & SCX_TASK_IMMED) {
 				p->scx.flags |= SCX_TASK_REENQ_PREEMPTED;
-				do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+				scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
 				p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 			} else {
-				dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD);
+				scx_dispatch_enqueue(sch, rq, &rq->scx.local_dsq, p, SCX_ENQ_HEAD);
 			}
 			goto switch_class;
 		}
@@ -3088,9 +3088,9 @@ static void put_prev_task_scx(struct rq *rq, struct task_struct *p,
 		if (next && sched_class_above(&ext_sched_class, next->sched_class)) {
 			WARN_ON_ONCE(sched_cpu_cookie_match(rq, p) &&
 				     !(sch->ops.flags & SCX_OPS_ENQ_LAST));
-			do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
+			scx_do_enqueue_task(rq, p, SCX_ENQ_LAST, -1);
 		} else {
-			do_enqueue_task(rq, p, 0, -1);
+			scx_do_enqueue_task(rq, p, 0, -1);
 		}
 	}
 
@@ -3562,7 +3562,7 @@ static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fo
 
 		ret = SCX_CALL_OP_RET(sch, init_task, NULL, p, &args);
 		if (unlikely(ret)) {
-			ret = ops_sanitize_err(sch, "init_task", ret);
+			ret = scx_ops_sanitize_err(sch, "init_task", ret);
 			return ret;
 		}
 	}
@@ -4107,7 +4107,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 		if (!local_task_should_reenq(p, &reenq_flags, &reason))
 			continue;
 
-		dispatch_dequeue(rq, p);
+		scx_dispatch_dequeue(rq, p);
 
 		if (WARN_ON_ONCE(p->scx.flags & SCX_TASK_REENQ_REASON_MASK))
 			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
@@ -4119,7 +4119,7 @@ static u32 reenq_local(struct scx_sched *sch, struct rq *rq, u64 reenq_flags)
 	list_for_each_entry_safe(p, n, &tasks, scx.dsq_list.node) {
 		list_del_init(&p->scx.dsq_list.node);
 
-		do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
+		scx_do_enqueue_task(rq, p, SCX_ENQ_REENQ, -1);
 
 		p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 		nr_enqueued++;
@@ -4234,7 +4234,7 @@ static void reenq_user(struct rq *rq, struct scx_dispatch_q *dsq, u64 reenq_flag
 			p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 		p->scx.flags |= reason;
 
-		do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
+		scx_do_enqueue_task(task_rq, p, SCX_ENQ_REENQ, -1);
 
 		p->scx.flags &= ~SCX_TASK_REENQ_REASON_MASK;
 
@@ -4354,7 +4354,7 @@ int scx_tg_online(struct task_group *tg)
 			ret = SCX_CALL_OP_RET(sch, cgroup_init,
 					      NULL, tg->css.cgroup, &args);
 			if (ret)
-				ret = ops_sanitize_err(sch, "cgroup_init", ret);
+				ret = scx_ops_sanitize_err(sch, "cgroup_init", ret);
 		}
 		if (ret == 0)
 			tg->scx.flags |= SCX_TG_ONLINE | SCX_TG_INITED;
@@ -4422,7 +4422,7 @@ int scx_cgroup_can_attach(struct cgroup_taskset *tset)
 		p->scx.cgrp_moving_from = NULL;
 	}
 
-	return ops_sanitize_err(sch, "cgroup_prep_move", ret);
+	return scx_ops_sanitize_err(sch, "cgroup_prep_move", ret);
 }
 
 void scx_cgroup_move_task(struct task_struct *p)
@@ -4700,7 +4700,7 @@ static void destroy_dsq(struct scx_sched *sch, u64 dsq_id)
 		goto out_unlock_dsq;
 
 	/*
-	 * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
+	 * Mark dead by invalidating ->id to prevent scx_dispatch_enqueue() from
 	 * queueing more tasks. As this function can be called from anywhere,
 	 * freeing is bounced through an irq work to avoid nesting RCU
 	 * operations inside scheduler locks.
@@ -4928,7 +4928,7 @@ static void scx_sched_free_rcu_work(struct work_struct *work)
 		 */
 		WARN_ON_ONCE(!list_empty(&pcpu->deferred_reenq_local.node));
 
-		exit_dsq(bypass_dsq(sch, cpu));
+		exit_dsq(scx_bypass_dsq(sch, cpu));
 	}
 
 	free_percpu(sch->pcpu);
@@ -5239,7 +5239,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
 			 u32 nr_donor_target, u32 nr_donee_target)
 {
 	struct rq *donor_rq = cpu_rq(donor);
-	struct scx_dispatch_q *donor_dsq = bypass_dsq(sch, donor);
+	struct scx_dispatch_q *donor_dsq = scx_bypass_dsq(sch, donor);
 	struct task_struct *p, *n;
 	struct scx_dsq_list_node cursor = INIT_DSQ_LIST_CURSOR(cursor, donor_dsq, 0);
 	s32 delta = READ_ONCE(donor_dsq->nr) - nr_donor_target;
@@ -5287,7 +5287,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
 		if (donee >= nr_cpu_ids)
 			continue;
 
-		donee_dsq = bypass_dsq(sch, donee);
+		donee_dsq = scx_bypass_dsq(sch, donee);
 
 		/*
 		 * $p's rq is not locked but $p's DSQ lock protects its
@@ -5308,7 +5308,7 @@ static u32 bypass_lb_cpu(struct scx_sched *sch, s32 donor,
 		 * between bypass DSQs.
 		 */
 		dispatch_dequeue_locked(p, donor_dsq);
-		dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED);
+		scx_dispatch_enqueue(sch, cpu_rq(donee), donee_dsq, p, SCX_ENQ_NESTED);
 
 		/*
 		 * $donee might have been idle and need to be woken up. No need
@@ -5351,7 +5351,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
 
 	/* count the target tasks and CPUs */
 	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
-		u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr);
+		u32 nr = READ_ONCE(scx_bypass_dsq(sch, cpu)->nr);
 
 		nr_tasks += nr;
 		nr_cpus++;
@@ -5373,7 +5373,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
 
 	cpumask_clear(donee_mask);
 	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
-		if (READ_ONCE(bypass_dsq(sch, cpu)->nr) < nr_target)
+		if (READ_ONCE(scx_bypass_dsq(sch, cpu)->nr) < nr_target)
 			cpumask_set_cpu(cpu, donee_mask);
 	}
 
@@ -5384,7 +5384,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
 			break;
 		if (cpumask_test_cpu(cpu, donee_mask))
 			continue;
-		if (READ_ONCE(bypass_dsq(sch, cpu)->nr) <= nr_donor_target)
+		if (READ_ONCE(scx_bypass_dsq(sch, cpu)->nr) <= nr_donor_target)
 			continue;
 
 		nr_balanced += bypass_lb_cpu(sch, cpu, donee_mask, resched_mask,
@@ -5395,7 +5395,7 @@ static void bypass_lb_node(struct scx_sched *sch, int node)
 		resched_cpu(cpu);
 
 	for_each_cpu_and(cpu, cpu_online_mask, node_mask) {
-		u32 nr = READ_ONCE(bypass_dsq(sch, cpu)->nr);
+		u32 nr = READ_ONCE(scx_bypass_dsq(sch, cpu)->nr);
 
 		after_min = min(nr, after_min);
 		after_max = max(nr, after_max);
@@ -5421,7 +5421,7 @@ static void scx_bypass_lb_timerfn(struct timer_list *timer)
 	int node;
 	u32 intv_us;
 
-	if (!bypass_dsp_enabled(sch))
+	if (!scx_bypass_dsp_enabled(sch))
 		return;
 
 	for_each_node_with_cpus(node)
@@ -5487,9 +5487,9 @@ static void enable_bypass_dsp(struct scx_sched *sch)
 	 * dispatch enabled while a descendant is bypassing, which is all that's
 	 * required.
 	 *
-	 * bypass_dsp_enabled() test is used to determine whether to enter the
-	 * bypass dispatch handling path from both bypassing and hosting scheds.
-	 * Bump enable depth on both @sch and bypass dispatch host.
+	 * scx_bypass_dsp_enabled() test is used to determine whether to enter
+	 * the bypass dispatch handling path from both bypassing and hosting
+	 * scheds. Bump enable depth on both @sch and bypass dispatch host.
 	 */
 	ret = atomic_inc_return(&sch->bypass_dsp_enable_depth);
 	WARN_ON_ONCE(ret <= 0);
@@ -5509,7 +5509,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
 }
 
 /* may be called without holding scx_bypass_lock */
-static void disable_bypass_dsp(struct scx_sched *sch)
+static void scx_disable_bypass_dsp(struct scx_sched *sch)
 {
 	s32 ret;
 
@@ -5654,7 +5654,7 @@ static void scx_bypass(struct scx_sched *sch, bool bypass)
 
 	/* disarming must come after moving all tasks out of the bypass DSQs */
 	if (!bypass)
-		disable_bypass_dsp(sch);
+		scx_disable_bypass_dsp(sch);
 unlock:
 	raw_spin_unlock_irqrestore(&scx_bypass_lock, flags);
 }
@@ -6003,7 +6003,7 @@ static void scx_sub_disable(struct scx_sched *sch)
 	 * DSQs for us.
 	 */
 	synchronize_rcu_expedited();
-	disable_bypass_dsp(sch);
+	scx_disable_bypass_dsp(sch);
 
 	scx_unlink_sched(sch);
 
@@ -6810,7 +6810,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 	}
 
 	for_each_possible_cpu(cpu) {
-		ret = init_dsq(bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
+		ret = init_dsq(scx_bypass_dsq(sch, cpu), SCX_DSQ_BYPASS, sch);
 		if (ret) {
 			bypass_fail_cpu = cpu;
 			goto err_free_pcpu;
@@ -6963,7 +6963,7 @@ static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
 	for_each_possible_cpu(cpu) {
 		if (cpu == bypass_fail_cpu)
 			break;
-		exit_dsq(bypass_dsq(sch, cpu));
+		exit_dsq(scx_bypass_dsq(sch, cpu));
 	}
 	free_percpu(sch->pcpu);
 err_free_pnode:
@@ -7007,7 +7007,7 @@ static int check_hotplug_seq(struct scx_sched *sch,
 	return 0;
 }
 
-static int validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
+static int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 {
 	/*
 	 * It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
@@ -7170,7 +7170,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 	if (sch->ops.init) {
 		ret = SCX_CALL_OP_RET(sch, init, NULL);
 		if (ret) {
-			ret = ops_sanitize_err(sch, "init", ret);
+			ret = scx_ops_sanitize_err(sch, "init", ret);
 			cpus_read_unlock();
 			scx_error(sch, "ops.init() failed (%d)", ret);
 			goto err_disable;
@@ -7203,7 +7203,7 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 
 	cpus_read_unlock();
 
-	ret = validate_ops(sch, ops);
+	ret = scx_validate_ops(sch, ops);
 	if (ret)
 		goto err_disable;
 
@@ -7545,7 +7545,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
 	if (sch->ops.init) {
 		ret = SCX_CALL_OP_RET(sch, init, NULL);
 		if (ret) {
-			ret = ops_sanitize_err(sch, "init", ret);
+			ret = scx_ops_sanitize_err(sch, "init", ret);
 			scx_error(sch, "ops.init() failed (%d)", ret);
 			goto err_disable;
 		}
@@ -7560,7 +7560,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
 	if (ret)
 		goto err_disable;
 
-	if (validate_ops(sch, ops))
+	if (scx_validate_ops(sch, ops))
 		goto err_disable;
 
 	struct scx_sub_attach_args sub_attach_args = {
@@ -7571,7 +7571,7 @@ static void scx_sub_enable_workfn(struct kthread_work *work)
 	ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
 			      &sub_attach_args);
 	if (ret) {
-		ret = ops_sanitize_err(sch, "sub_attach", ret);
+		ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
 		scx_error(sch, "parent rejected (%d)", ret);
 		goto err_disable;
 	}
@@ -8830,7 +8830,7 @@ static bool scx_dsq_move(struct bpf_iter_scx_dsq_kern *kit,
 
 	/*
 	 * If the BPF scheduler keeps calling this function repeatedly, it can
-	 * cause similar live-lock conditions as consume_dispatch_q().
+	 * cause similar live-lock conditions as scx_consume_dispatch_q().
 	 */
 	if (unlikely(READ_ONCE(sch->aborting)))
 		return false;
@@ -8991,7 +8991,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
 
 	dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
 
-	flush_dispatch_buf(sch, dspc->rq);
+	scx_flush_dispatch_buf(sch, dspc->rq);
 
 	dsq = find_user_dsq(sch, dsq_id);
 	if (unlikely(!dsq)) {
@@ -8999,7 +8999,7 @@ __bpf_kfunc bool scx_bpf_dsq_move_to_local___v2(u64 dsq_id, u64 enq_flags,
 		return false;
 	}
 
-	if (consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) {
+	if (scx_consume_dispatch_q(sch, dspc->rq, dsq, enq_flags)) {
 		/*
 		 * A successfully consumed task can be dequeued before it starts
 		 * running while the CPU is trying to migrate other dispatched
@@ -10683,7 +10683,7 @@ static int __init scx_init(void)
 	/* @priv tail must align since both share the same data block */
 	CID_OFFSET_MATCH(priv, priv);
 	/*
-	 * cid-form must end exactly at @priv - validate_ops() skips
+	 * cid-form must end exactly at @priv - scx_validate_ops() skips
 	 * cpu_acquire/cpu_release for cid-form because reading those fields
 	 * past the BPF allocation would be UB.
 	 */
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 0256931a379a..743980dc60b0 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1172,7 +1172,7 @@ struct scx_sched {
 	u64			bypass_timestamp;
 	s32			bypass_depth;
 
-	/* bypass dispatch path enable state, see bypass_dsp_enabled() */
+	/* bypass dispatch path enable state, see scx_bypass_dsp_enabled() */
 	unsigned long		bypass_dsp_claim;
 	atomic_t		bypass_dsp_enable_depth;
 
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used by the sub.c split
  2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

The sub-scheduler implementation is about to move into its own sub.c, from
where it calls a set of ext.c helpers and shares a few ext.c globals. Make
those reachable across the new file boundary ahead of the move.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
v2: Also expose scx_rq_online(), scx_flush_dispatch_buf() and scx_kick_cpu()
    which scx_dispatch_sched() in sub.h calls (sashiko AI).

 kernel/sched/ext/ext.c      | 116 ++++++++++++------------------------
 kernel/sched/ext/internal.h |  80 +++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 77 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index 56e6a13fd0f8..bdbc66466962 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -20,7 +20,7 @@
 #include "arena.h"
 #include "idle.h"
 
-static DEFINE_RAW_SPINLOCK(scx_sched_lock);
+DEFINE_RAW_SPINLOCK(scx_sched_lock);
 
 /*
  * NOTE: sched_ext is in the process of growing multiple scheduler support and
@@ -39,14 +39,14 @@ struct scx_sched __rcu *scx_root;
 static LIST_HEAD(scx_sched_all);
 
 #ifdef CONFIG_EXT_SUB_SCHED
-static const struct rhashtable_params scx_sched_hash_params = {
+const struct rhashtable_params scx_sched_hash_params = {
 	.key_len		= sizeof_field(struct scx_sched, ops.sub_cgroup_id),
 	.key_offset		= offsetof(struct scx_sched, ops.sub_cgroup_id),
 	.head_offset		= offsetof(struct scx_sched, hash_node),
 	.insecure_elasticity	= true,	/* inserted under scx_sched_lock */
 };
 
-static struct rhashtable scx_sched_hash;
+struct rhashtable scx_sched_hash;
 #endif
 
 /* see SCX_OPS_TID_TO_TASK */
@@ -68,9 +68,9 @@ static DEFINE_RAW_SPINLOCK(scx_tasks_lock);
 static LIST_HEAD(scx_tasks);
 
 /* ops enable/disable */
-static DEFINE_MUTEX(scx_enable_mutex);
+DEFINE_MUTEX(scx_enable_mutex);
 DEFINE_STATIC_KEY_FALSE(__scx_enabled);
-DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
+DEFINE_PERCPU_RWSEM(scx_fork_rwsem);
 static atomic_t scx_enable_state_var = ATOMIC_INIT(SCX_DISABLED);
 static DEFINE_RAW_SPINLOCK(scx_bypass_lock);
 static bool scx_init_task_enabled;
@@ -101,7 +101,7 @@ static atomic64_t scx_tid_cursor = ATOMIC64_INIT(1);
  * tasks for the sub-sched being enabled. Use a global variable instead of a
  * per-task field as all enables are serialized.
  */
-static struct scx_sched *scx_enabling_sub_sched;
+struct scx_sched *scx_enabling_sub_sched;
 #else
 #define scx_enabling_sub_sched	(struct scx_sched *)NULL
 #endif	/* CONFIG_EXT_SUB_SCHED */
@@ -242,7 +242,6 @@ MODULE_PARM_DESC(bypass_lb_intv_us, "bypass load balance interval in microsecond
 
 static void run_deferred(struct rq *rq);
 static bool task_dead_and_done(struct task_struct *p);
-static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
 static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind);
 
 __printf(5, 6) bool __scx_exit(struct scx_sched *sch,
@@ -676,12 +675,12 @@ struct bpf_iter_scx_dsq {
 } __attribute__((aligned(8)));
 
 
-static u32 scx_get_task_state(const struct task_struct *p)
+u32 scx_get_task_state(const struct task_struct *p)
 {
 	return p->scx.flags & SCX_TASK_STATE_MASK;
 }
 
-static void scx_set_task_state(struct task_struct *p, u32 state)
+void scx_set_task_state(struct task_struct *p, u32 state)
 {
 	u32 prev_state = scx_get_task_state(p);
 	bool warn = false;
@@ -721,23 +720,6 @@ static void scx_set_task_state(struct task_struct *p, u32 state)
 	p->scx.flags |= state;
 }
 
-/*
- * SCX task iterator.
- */
-struct scx_task_iter {
-	struct sched_ext_entity		cursor;
-	struct task_struct		*locked_task;
-	struct rq			*rq;
-	struct rq_flags			rf;
-	u32				cnt;
-	bool				list_locked;
-#ifdef CONFIG_EXT_SUB_SCHED
-	struct cgroup			*cgrp;
-	struct cgroup_subsys_state	*css_pos;
-	struct css_task_iter		css_iter;
-#endif
-};
-
 /**
  * scx_task_iter_start - Lock scx_tasks_lock and start a task iteration
  * @iter: iterator to init
@@ -766,7 +748,7 @@ struct scx_task_iter {
  * All tasks which existed when the iteration started are guaranteed to be
  * visited as long as they are not dead.
  */
-static void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
+void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp)
 {
 	memset(iter, 0, sizeof(*iter));
 
@@ -805,7 +787,7 @@ static void __scx_task_iter_rq_unlock(struct scx_task_iter *iter)
  * This function can be safely called anytime during an iteration. The next
  * iterator operation will automatically restore the necessary locking.
  */
-static void scx_task_iter_unlock(struct scx_task_iter *iter)
+void scx_task_iter_unlock(struct scx_task_iter *iter)
 {
 	__scx_task_iter_rq_unlock(iter);
 	if (iter->list_locked) {
@@ -848,7 +830,7 @@ static void scx_task_iter_relock(struct scx_task_iter *iter,
  * which is released on return. If the iterator holds a task's rq lock, that rq
  * lock is also released. See scx_task_iter_start() for details.
  */
-static void scx_task_iter_stop(struct scx_task_iter *iter)
+void scx_task_iter_stop(struct scx_task_iter *iter)
 {
 #ifdef CONFIG_EXT_SUB_SCHED
 	if (iter->cgrp) {
@@ -923,7 +905,7 @@ static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
  * whether they would like to filter out dead tasks. See scx_task_iter_start()
  * for details.
  */
-static struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
+struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 {
 	struct task_struct *p;
 
@@ -1186,8 +1168,8 @@ static void schedule_deferred_locked(struct rq *rq)
 	schedule_deferred(rq);
 }
 
-static void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
-			       u64 reenq_flags, struct rq *locked_rq)
+void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+			u64 reenq_flags, struct rq *locked_rq)
 {
 	struct rq *rq;
 
@@ -1841,7 +1823,7 @@ static void direct_dispatch(struct scx_sched *sch, struct task_struct *p,
 	scx_dispatch_enqueue(sch, rq, dsq, p, ddsp_enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
-static bool scx_rq_online(struct rq *rq)
+bool scx_rq_online(struct rq *rq)
 {
 	/*
 	 * Test both cpu_active() and %SCX_RQ_ONLINE. %SCX_RQ_ONLINE indicates
@@ -2491,8 +2473,8 @@ static struct rq *move_task_between_dsqs(struct scx_sched *sch,
 	return dst_rq;
 }
 
-static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
-				   struct scx_dispatch_q *dsq, u64 enq_flags)
+bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+			    struct scx_dispatch_q *dsq, u64 enq_flags)
 {
 	struct task_struct *p;
 retry:
@@ -2538,7 +2520,7 @@ static bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
 	return false;
 }
 
-static bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
+bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq)
 {
 	int node = cpu_to_node(cpu_of(rq));
 
@@ -2727,7 +2709,7 @@ static void finish_dispatch(struct scx_sched *sch, struct rq *rq,
 		scx_dispatch_enqueue(sch, rq, dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
 }
 
-static void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
+void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq)
 {
 	struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
 	u32 u;
@@ -3548,7 +3530,7 @@ static struct cgroup *tg_cgrp(struct task_group *tg)
 
 #endif	/* CONFIG_EXT_GROUP_SCHED */
 
-static int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
+int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork)
 {
 	int ret;
 
@@ -3631,7 +3613,7 @@ static void __scx_enable_task(struct scx_sched *sch, struct task_struct *p)
 		SCX_CALL_OP_TASK(sch, set_weight, rq, p, p->scx.weight);
 }
 
-static void scx_enable_task(struct scx_sched *sch, struct task_struct *p)
+void scx_enable_task(struct scx_sched *sch, struct task_struct *p)
 {
 	__scx_enable_task(sch, p);
 	scx_set_task_state(p, SCX_TASK_ENABLED);
@@ -3665,8 +3647,7 @@ static void scx_disable_task(struct scx_sched *sch, struct task_struct *p)
 	WARN_ON_ONCE(p->scx.flags & SCX_TASK_IN_CUSTODY);
 }
 
-static void __scx_disable_and_exit_task(struct scx_sched *sch,
-					struct task_struct *p)
+void __scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p)
 {
 	struct scx_exit_task_args args = {
 		.cancelled = false,
@@ -3700,7 +3681,7 @@ static void __scx_disable_and_exit_task(struct scx_sched *sch,
  * ran. The task state has not been transitioned, so this mirrors the
  * SCX_TASK_INIT branch in __scx_disable_and_exit_task().
  */
-static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
+void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p)
 {
 	struct scx_exit_task_args args = { .cancelled = true };
 
@@ -3711,8 +3692,7 @@ static void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *
 		SCX_CALL_OP_TASK(sch, exit_task, task_rq(p), p, &args);
 }
 
-static void scx_disable_and_exit_task(struct scx_sched *sch,
-				      struct task_struct *p)
+void scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p)
 {
 	__scx_disable_and_exit_task(sch, p);
 
@@ -4525,7 +4505,7 @@ static struct cgroup *root_cgroup(void)
 	return &cgrp_dfl_root.cgrp;
 }
 
-static void scx_cgroup_lock(void)
+void scx_cgroup_lock(void)
 {
 #ifdef CONFIG_EXT_GROUP_SCHED
 	percpu_down_write(&scx_cgroup_ops_rwsem);
@@ -4533,7 +4513,7 @@ static void scx_cgroup_lock(void)
 	cgroup_lock();
 }
 
-static void scx_cgroup_unlock(void)
+void scx_cgroup_unlock(void)
 {
 	cgroup_unlock();
 #ifdef CONFIG_EXT_GROUP_SCHED
@@ -4851,7 +4831,7 @@ static void free_exit_info(struct scx_exit_info *ei);
 static const char *scx_exit_reason(enum scx_exit_kind kind);
 static bool scx_claim_exit(struct scx_sched *sch, enum scx_exit_kind kind);
 
-static s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch)
+s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch)
 {
 	size_t size = struct_size_t(struct scx_cmask, bits,
 				    SCX_CMASK_NR_WORDS(num_possible_cpus()));
@@ -5509,7 +5489,7 @@ static void enable_bypass_dsp(struct scx_sched *sch)
 }
 
 /* may be called without holding scx_bypass_lock */
-static void scx_disable_bypass_dsp(struct scx_sched *sch)
+void scx_disable_bypass_dsp(struct scx_sched *sch)
 {
 	s32 ret;
 
@@ -5557,7 +5537,7 @@ static void scx_disable_bypass_dsp(struct scx_sched *sch)
  *
  * - scx_prio_less() reverts to the default core_sched_at order.
  */
-static void scx_bypass(struct scx_sched *sch, bool bypass)
+void scx_bypass(struct scx_sched *sch, bool bypass)
 {
 	struct scx_sched *pos;
 	unsigned long flags;
@@ -5746,7 +5726,7 @@ static void refresh_watchdog(void)
 		cancel_delayed_work_sync(&scx_watchdog_work);
 }
 
-static s32 scx_link_sched(struct scx_sched *sch)
+s32 scx_link_sched(struct scx_sched *sch)
 {
 	const char *err_msg = "";
 	s32 ret = 0;
@@ -5795,7 +5775,7 @@ static s32 scx_link_sched(struct scx_sched *sch)
 	return 0;
 }
 
-static void scx_unlink_sched(struct scx_sched *sch)
+void scx_unlink_sched(struct scx_sched *sch)
 {
 	scoped_guard(raw_spinlock_irq, &scx_sched_lock) {
 #ifdef CONFIG_EXT_SUB_SCHED
@@ -5816,13 +5796,13 @@ static void scx_unlink_sched(struct scx_sched *sch)
  * @sch. Once @sch becomes empty during disable, there's no point in dumping it.
  * This prevents calling dump ops on a dead sch.
  */
-static void scx_disable_dump(struct scx_sched *sch)
+void scx_disable_dump(struct scx_sched *sch)
 {
 	guard(raw_spinlock_irqsave)(&scx_dump_lock);
 	sch->dump_disabled = true;
 }
 
-static void scx_log_sched_disable(struct scx_sched *sch)
+void scx_log_sched_disable(struct scx_sched *sch)
 {
 	struct scx_exit_info *ei = sch->exit_info;
 	const char *type = scx_parent(sch) ? "sub-scheduler" : "scheduler";
@@ -6285,7 +6265,7 @@ static void scx_disable(struct scx_sched *sch, enum scx_exit_kind kind)
  * as a noop. Syncing the irq_work first is required to guarantee the
  * kthread work has been queued before waiting for it.
  */
-static void scx_flush_disable_work(struct scx_sched *sch)
+void scx_flush_disable_work(struct scx_sched *sch)
 {
 	int kind;
 
@@ -6739,31 +6719,13 @@ static struct scx_sched_pnode *alloc_pnode(struct scx_sched *sch, int node)
 	return pnode;
 }
 
-/*
- * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
- * starvation. During the READY -> ENABLED task switching loop, the calling
- * thread's sched_class gets switched from fair to ext. As fair has higher
- * priority than ext, the calling thread can be indefinitely starved under
- * fair-class saturation, leading to a system hang.
- */
-struct scx_enable_cmd {
-	struct kthread_work	work;
-	union {
-		struct sched_ext_ops		*ops;
-		struct sched_ext_ops_cid	*ops_cid;
-	};
-	bool			is_cid_type;
-	struct bpf_map		*arena_map;	/* arena ref to transfer to sch */
-	int			ret;
-};
-
 /*
  * Allocate and initialize a new scx_sched. @cgrp's reference is always
  * consumed whether the function succeeds or fails.
  */
-static struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
-						 struct cgroup *cgrp,
-						 struct scx_sched *parent)
+struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
+					  struct cgroup *cgrp,
+					  struct scx_sched *parent)
 {
 	struct sched_ext_ops *ops = cmd->ops;
 	struct scx_sched *sch;
@@ -7007,7 +6969,7 @@ static int check_hotplug_seq(struct scx_sched *sch,
 	return 0;
 }
 
-static int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
+int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops)
 {
 	/*
 	 * It doesn't make sense to specify the SCX_OPS_ENQ_LAST flag if the
@@ -9342,7 +9304,7 @@ __bpf_kfunc bool scx_bpf_task_set_dsq_vtime(struct task_struct *p, u64 vtime,
 	return true;
 }
 
-static void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
+void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags)
 {
 	struct rq *this_rq;
 	unsigned long irq_flags;
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index 743980dc60b0..c4a910d2ca91 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1535,6 +1535,41 @@ enum scx_ops_state {
 #define SCX_OPSS_STATE_MASK	((1LU << SCX_OPSS_QSEQ_SHIFT) - 1)
 #define SCX_OPSS_QSEQ_MASK	(~SCX_OPSS_STATE_MASK)
 
+/*
+ * SCX task iterator.
+ */
+struct scx_task_iter {
+	struct sched_ext_entity		cursor;
+	struct task_struct		*locked_task;
+	struct rq			*rq;
+	struct rq_flags			rf;
+	u32				cnt;
+	bool				list_locked;
+#ifdef CONFIG_EXT_SUB_SCHED
+	struct cgroup			*cgrp;
+	struct cgroup_subsys_state	*css_pos;
+	struct css_task_iter		css_iter;
+#endif
+};
+
+/*
+ * scx_enable() is offloaded to a dedicated system-wide RT kthread to avoid
+ * starvation. During the READY -> ENABLED task switching loop, the calling
+ * thread's sched_class gets switched from fair to ext. As fair has higher
+ * priority than ext, the calling thread can be indefinitely starved under
+ * fair-class saturation, leading to a system hang.
+ */
+struct scx_enable_cmd {
+	struct kthread_work	work;
+	union {
+		struct sched_ext_ops		*ops;
+		struct sched_ext_ops_cid	*ops_cid;
+	};
+	bool			is_cid_type;
+	struct bpf_map		*arena_map;	/* arena ref to transfer to sch */
+	int			ret;
+};
+
 extern struct scx_sched __rcu *scx_root;
 DECLARE_PER_CPU(struct rq *, scx_locked_rq_state);
 
@@ -1555,6 +1590,51 @@ __printf(5, 0) bool scx_vexit(struct scx_sched *sch, enum scx_exit_kind kind,
 __printf(5, 6) bool __scx_exit(struct scx_sched *sch, enum scx_exit_kind kind,
 			       s64 exit_code, s32 exit_cpu, const char *fmt, ...);
 
+u32 scx_get_task_state(const struct task_struct *p);
+void scx_set_task_state(struct task_struct *p, u32 state);
+void scx_task_iter_start(struct scx_task_iter *iter, struct cgroup *cgrp);
+void scx_task_iter_unlock(struct scx_task_iter *iter);
+void scx_task_iter_stop(struct scx_task_iter *iter);
+struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter);
+bool scx_consume_dispatch_q(struct scx_sched *sch, struct rq *rq,
+			    struct scx_dispatch_q *dsq, u64 enq_flags);
+bool scx_consume_global_dsq(struct scx_sched *sch, struct rq *rq);
+bool scx_rq_online(struct rq *rq);
+void scx_flush_dispatch_buf(struct scx_sched *sch, struct rq *rq);
+void scx_kick_cpu(struct scx_sched *sch, s32 cpu, u64 flags);
+void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
+			u64 reenq_flags, struct rq *locked_rq);
+int __scx_init_task(struct scx_sched *sch, struct task_struct *p, bool fork);
+void scx_enable_task(struct scx_sched *sch, struct task_struct *p);
+void __scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p);
+void scx_sub_init_cancel_task(struct scx_sched *sch, struct task_struct *p);
+void scx_disable_and_exit_task(struct scx_sched *sch, struct task_struct *p);
+#if defined(CONFIG_EXT_GROUP_SCHED) || defined(CONFIG_EXT_SUB_SCHED)
+void scx_cgroup_lock(void);
+void scx_cgroup_unlock(void);
+#endif
+s32 scx_set_cmask_scratch_alloc(struct scx_sched *sch);
+void scx_disable_bypass_dsp(struct scx_sched *sch);
+void scx_bypass(struct scx_sched *sch, bool bypass);
+s32 scx_link_sched(struct scx_sched *sch);
+void scx_unlink_sched(struct scx_sched *sch);
+void scx_disable_dump(struct scx_sched *sch);
+void scx_log_sched_disable(struct scx_sched *sch);
+void scx_flush_disable_work(struct scx_sched *sch);
+struct scx_sched *scx_alloc_and_add_sched(struct scx_enable_cmd *cmd,
+					  struct cgroup *cgrp,
+					  struct scx_sched *parent);
+int scx_validate_ops(struct scx_sched *sch, const struct sched_ext_ops *ops);
+
+extern raw_spinlock_t scx_sched_lock;
+extern struct mutex scx_enable_mutex;
+extern struct percpu_rw_semaphore scx_fork_rwsem;
+#ifdef CONFIG_EXT_SUB_SCHED
+extern const struct rhashtable_params scx_sched_hash_params;
+extern struct rhashtable scx_sched_hash;
+extern struct scx_sched *scx_enabling_sub_sched;
+#endif
+
 #define scx_exit(sch, kind, exit_code, fmt, args...)				\
 	__scx_exit(sch, kind, exit_code, raw_smp_processor_id(), fmt, ##args)
 #define scx_error(sch, fmt, args...)						\
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across the sub.c split
  2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
  2026-07-01 20:44 ` [PATCHSET v3 sched_ext/for-7.3] " Tejun Heo
  4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

The following trivial helpers in ext.c are called from both ext.c and the
sub-scheduler code. Define them as static inline in internal.h.

- scx_bypass_dsq()
- scx_bypass_dsp_enabled()
- scx_ops_sanitize_err()
- scx_schedule_reenq_local()

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
 kernel/sched/ext/ext.c      | 57 -------------------------------------
 kernel/sched/ext/internal.h | 57 +++++++++++++++++++++++++++++++++++++
 2 files changed, 57 insertions(+), 57 deletions(-)

diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index bdbc66466962..d1ef79c1038d 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -368,11 +368,6 @@ static const struct sched_class *scx_setscheduler_class(struct task_struct *p)
 	return __setscheduler_class(p->policy, p->prio);
 }
 
-static struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
-{
-	return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
-}
-
 static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu)
 {
 #ifdef CONFIG_EXT_SUB_SCHED
@@ -394,26 +389,6 @@ static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 c
 	return scx_bypass_dsq(sch, cpu);
 }
 
-/**
- * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
- * @sch: scheduler to check
- *
- * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
- * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors
- * are bypassing. In the former case, the ancestor is not itself bypassing but
- * its bypass DSQs will be populated with bypassed tasks from descendants. Thus,
- * the ancestor's bypass dispatch path must be active even though its own
- * bypass_depth remains zero.
- *
- * This function checks bypass_dsp_enable_depth which is managed separately from
- * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
- * scx_disable_bypass_dsp().
- */
-static bool scx_bypass_dsp_enabled(struct scx_sched *sch)
-{
-	return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
-}
-
 /**
  * rq_is_open - Is the rq available for immediate execution of an SCX task?
  * @rq: rq to test
@@ -1060,28 +1035,6 @@ bool scx_cpu_valid(struct scx_sched *sch, s32 cpu, const char *where)
 	}
 }
 
-/**
- * scx_ops_sanitize_err - Sanitize a -errno value
- * @sch: scx_sched to error out on error
- * @ops_name: operation to blame on failure
- * @err: -errno value to sanitize
- *
- * Verify @err is a valid -errno. If not, trigger scx_error() and return
- * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
- * cause misbehaviors. For an example, a large negative return from
- * ops.init_task() triggers an oops when passed up the call chain because the
- * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
- * handled as a pointer.
- */
-static int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
-{
-	if (err < 0 && err >= -MAX_ERRNO)
-		return err;
-
-	scx_error(sch, "ops.%s() returned an invalid errno %d", ops_name, err);
-	return -EPROTO;
-}
-
 static void deferred_bal_cb_workfn(struct rq *rq)
 {
 	run_deferred(rq);
@@ -1233,16 +1186,6 @@ void schedule_dsq_reenq(struct scx_sched *sch, struct scx_dispatch_q *dsq,
 		schedule_deferred(rq);
 }
 
-static void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
-{
-	struct scx_sched *root = rcu_dereference_sched(scx_root);
-
-	if (WARN_ON_ONCE(!root))
-		return;
-
-	schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
-}
-
 /**
  * touch_core_sched - Update timestamp used for core-sched task ordering
  * @rq: rq to read clock from, must be locked
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index c4a910d2ca91..c3b97ea4ae79 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -1640,6 +1640,63 @@ extern struct scx_sched *scx_enabling_sub_sched;
 #define scx_error(sch, fmt, args...)						\
 	scx_exit((sch), SCX_EXIT_ERROR, 0, fmt, ##args)
 
+static inline struct scx_dispatch_q *scx_bypass_dsq(struct scx_sched *sch, s32 cpu)
+{
+	return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
+}
+
+/**
+ * scx_bypass_dsp_enabled - Check if bypass dispatch path is enabled
+ * @sch: scheduler to check
+ *
+ * When a descendant scheduler enters bypass mode, bypassed tasks are scheduled
+ * by the nearest non-bypassing ancestor, or the root scheduler if all ancestors
+ * are bypassing. In the former case, the ancestor is not itself bypassing but
+ * its bypass DSQs will be populated with bypassed tasks from descendants. Thus,
+ * the ancestor's bypass dispatch path must be active even though its own
+ * bypass_depth remains zero.
+ *
+ * This function checks bypass_dsp_enable_depth which is managed separately from
+ * bypass_depth to enable this decoupling. See enable_bypass_dsp() and
+ * scx_disable_bypass_dsp().
+ */
+static inline bool scx_bypass_dsp_enabled(struct scx_sched *sch)
+{
+	return unlikely(atomic_read(&sch->bypass_dsp_enable_depth));
+}
+
+/**
+ * scx_ops_sanitize_err - Sanitize a -errno value
+ * @sch: scx_sched to error out on error
+ * @ops_name: operation to blame on failure
+ * @err: -errno value to sanitize
+ *
+ * Verify @err is a valid -errno. If not, trigger scx_error() and return
+ * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
+ * cause misbehaviors. For an example, a large negative return from
+ * ops.init_task() triggers an oops when passed up the call chain because the
+ * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
+ * handled as a pointer.
+ */
+static inline int scx_ops_sanitize_err(struct scx_sched *sch, const char *ops_name, s32 err)
+{
+	if (err < 0 && err >= -MAX_ERRNO)
+		return err;
+
+	scx_error(sch, "ops.%s() returned an invalid errno %d", ops_name, err);
+	return -EPROTO;
+}
+
+static inline void scx_schedule_reenq_local(struct rq *rq, u64 reenq_flags)
+{
+	struct scx_sched *root = rcu_dereference_sched(scx_root);
+
+	if (WARN_ON_ONCE(!root))
+		return;
+
+	schedule_dsq_reenq(root, &rq->scx.local_dsq, reenq_flags, rq);
+}
+
 /*
  * Return the rq currently locked from an scx callback, or NULL if no rq is
  * locked.
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
  2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
                   ` (2 preceding siblings ...)
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
@ 2026-07-01 20:34 ` Tejun Heo
  2026-07-01 20:55   ` sashiko-bot
  2026-07-01 20:44 ` [PATCHSET v3 sched_ext/for-7.3] " Tejun Heo
  4 siblings, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:34 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

The sub-scheduler implementation has grown and will continue to expand. Move
the sub-scheduler functions from ext.c into a new kernel/sched/ext/sub.c.
sub.h holds the prototypes and the !CONFIG_EXT_SUB_SCHED no-op stubs.

scx_dispatch_sched() is shared: balance_one() in ext.c and the
scx_bpf_sub_dispatch() kfunc in sub.c both call it, and the latter re-enters
it as sub-scheduler dispatch nests. It moves into sub.h as a static
__always_inline so both callers keep it inlined and per-level stack stays
bounded across the recursion. The event macros it uses move to internal.h.

No functional change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrea Righi <arighi@nvidia.com>
---
v2: Fold the scx_dispatch_sched() sub.h promotion into this patch (was a
    separate later patch in v1) so the split is self-contained (Andrea).

 kernel/sched/build_policy.c |   2 +
 kernel/sched/ext/ext.c      | 811 +-----------------------------------
 kernel/sched/ext/internal.h |  28 ++
 kernel/sched/ext/sub.c      | 668 +++++++++++++++++++++++++++++
 kernel/sched/ext/sub.h      | 161 +++++++
 5 files changed, 860 insertions(+), 810 deletions(-)
 create mode 100644 kernel/sched/ext/sub.c
 create mode 100644 kernel/sched/ext/sub.h

diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
index d74b54f81992..01dc7bf89af8 100644
--- a/kernel/sched/build_policy.c
+++ b/kernel/sched/build_policy.c
@@ -66,10 +66,12 @@
 # include "ext/cid.h"
 # include "ext/arena.h"
 # include "ext/idle.h"
+# include "ext/sub.h"
 # include "ext/ext.c"
 # include "ext/cid.c"
 # include "ext/arena.c"
 # include "ext/idle.c"
+# include "ext/sub.c"
 #endif
 
 #include "syscalls.c"
diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c
index d1ef79c1038d..1a0ec985da77 100644
--- a/kernel/sched/ext/ext.c
+++ b/kernel/sched/ext/ext.c
@@ -19,6 +19,7 @@
 #include "cid.h"
 #include "arena.h"
 #include "idle.h"
+#include "sub.h"
 
 DEFINE_RAW_SPINLOCK(scx_sched_lock);
 
@@ -271,58 +272,6 @@ static bool u32_before(u32 a, u32 b)
 	return (s32)(a - b) < 0;
 }
 
-#ifdef CONFIG_EXT_SUB_SCHED
-/**
- * scx_next_descendant_pre - find the next descendant for pre-order walk
- * @pos: the current position (%NULL to initiate traversal)
- * @root: sched whose descendants to walk
- *
- * To be used by scx_for_each_descendant_pre(). Find the next descendant to
- * visit for pre-order traversal of @root's descendants. @root is included in
- * the iteration and the first node to be visited.
- */
-static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos,
-						 struct scx_sched *root)
-{
-	struct scx_sched *next;
-
-	lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
-		       lockdep_is_held(&scx_sched_lock));
-
-	/* if first iteration, visit @root */
-	if (!pos)
-		return root;
-
-	/* visit the first child if exists */
-	next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
-	if (next)
-		return next;
-
-	/* no child, visit my or the closest ancestor's next sibling */
-	while (pos != root) {
-		if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
-			return list_next_entry(pos, sibling);
-		pos = scx_parent(pos);
-	}
-
-	return NULL;
-}
-
-static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
-{
-	return rhashtable_lookup(&scx_sched_hash, &cgroup_id,
-				 scx_sched_hash_params);
-}
-
-static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch)
-{
-	rcu_assign_pointer(p->scx.sched, sch);
-}
-#else	/* CONFIG_EXT_SUB_SCHED */
-static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
-static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
-#endif	/* CONFIG_EXT_SUB_SCHED */
-
 /**
  * scx_is_descendant - Test whether sched is a descendant
  * @sch: sched to test
@@ -337,19 +286,6 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor)
 	return sch->ancestors[ancestor->level] == ancestor;
 }
 
-/**
- * scx_for_each_descendant_pre - pre-order walk of a sched's descendants
- * @pos: iteration cursor
- * @root: sched to walk the descendants of
- *
- * Walk @root's descendants. @root is included in the iteration and the first
- * node to be visited. Must be called with either scx_enable_mutex or
- * scx_sched_lock held.
- */
-#define scx_for_each_descendant_pre(pos, root)					\
-	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
-	     (pos) = scx_next_descendant_pre((pos), (root)))
-
 static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu)
 {
 	return &sch->pnode[cpu_to_node(cpu)]->global_dsq;
@@ -935,32 +871,6 @@ struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter)
 	return NULL;
 }
 
-/**
- * scx_add_event - Increase an event counter for 'name' by 'cnt'
- * @sch: scx_sched to account events for
- * @name: an event name defined in struct scx_event_stats
- * @cnt: the number of the event occurred
- *
- * This can be used when preemption is not disabled.
- */
-#define scx_add_event(sch, name, cnt) do {					\
-	this_cpu_add((sch)->pcpu->event_stats.name, (cnt));			\
-	trace_sched_ext_event(#name, (cnt));					\
-} while(0)
-
-/**
- * __scx_add_event - Increase an event counter for 'name' by 'cnt'
- * @sch: scx_sched to account events for
- * @name: an event name defined in struct scx_event_stats
- * @cnt: the number of the event occurred
- *
- * This should be used only when preemption is disabled.
- */
-#define __scx_add_event(sch, name, cnt) do {					\
-	__this_cpu_add((sch)->pcpu->event_stats.name, (cnt));			\
-	trace_sched_ext_event(#name, cnt);					\
-} while(0)
-
 /**
  * scx_dump_event - Dump an event 'kind' in 'events' to 's'
  * @s: output seq_buf
@@ -2681,115 +2591,6 @@ static inline void maybe_queue_balance_callback(struct rq *rq)
 	rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING;
 }
 
-/*
- * One user of this function is scx_bpf_dispatch() which can be called
- * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
- * from the call frame.
- */
-static __always_inline bool
-scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
-		   struct task_struct *prev, bool nested)
-{
-	struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
-	int nr_loops = SCX_DSP_MAX_LOOPS;
-	s32 cpu = cpu_of(rq);
-	bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
-		scx_task_on_sched(sch, prev);
-
-	if (scx_consume_global_dsq(sch, rq))
-		return true;
-
-	if (scx_bypass_dsp_enabled(sch)) {
-		/* if @sch is bypassing, only the bypass DSQs are active */
-		if (scx_bypassing(sch, cpu))
-			return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
-
-#ifdef CONFIG_EXT_SUB_SCHED
-		/*
-		 * If @sch isn't bypassing but its children are, @sch is
-		 * responsible for making forward progress for both its own
-		 * tasks that aren't bypassing and the bypassing descendants'
-		 * tasks. The following implements a simple built-in behavior -
-		 * let each CPU try to run the bypass DSQ every Nth time.
-		 *
-		 * Later, if necessary, we can add an ops flag to suppress the
-		 * auto-consumption and a kfunc to consume the bypass DSQ and,
-		 * so that the BPF scheduler can fully control scheduling of
-		 * bypassed tasks.
-		 */
-		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
-
-		if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
-		    scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
-			__scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
-			return true;
-		}
-#endif	/* CONFIG_EXT_SUB_SCHED */
-	}
-
-	if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
-		return false;
-
-	dspc->rq = rq;
-
-	/*
-	 * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
-	 * lock, the local DSQ might still end up empty after a successful
-	 * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
-	 * produced some tasks, retry. The BPF scheduler may depend on this
-	 * looping behavior to simplify its implementation.
-	 */
-	do {
-		dspc->nr_tasks = 0;
-
-		if (nested) {
-			SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
-				    prev_on_sch ? prev : NULL);
-		} else {
-			/* stash @prev so that nested invocations can access it */
-			rq->scx.sub_dispatch_prev = prev;
-			SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
-				    prev_on_sch ? prev : NULL);
-			rq->scx.sub_dispatch_prev = NULL;
-		}
-
-		scx_flush_dispatch_buf(sch, rq);
-
-		if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
-			rq->scx.flags |= SCX_RQ_BAL_KEEP;
-			return true;
-		}
-		if (rq->scx.local_dsq.nr)
-			return true;
-		if (scx_consume_global_dsq(sch, rq))
-			return true;
-
-		/*
-		 * ops.dispatch() can trap us in this loop by repeatedly
-		 * dispatching ineligible tasks. Break out once in a while to
-		 * allow the watchdog to run. As IRQ can't be enabled in
-		 * balance(), we want to complete this scheduling cycle and then
-		 * start a new one. IOW, we want to call resched_curr() on the
-		 * next, most likely idle, task, not the current one. Use
-		 * __scx_bpf_kick_cpu() for deferred kicking.
-		 */
-		if (unlikely(!--nr_loops)) {
-			scx_kick_cpu(sch, cpu, 0);
-			break;
-		}
-	} while (dspc->nr_tasks);
-
-	/*
-	 * Prevent the CPU from going idle while bypassed descendants have tasks
-	 * queued. Without this fallback, bypassed tasks could stall if the host
-	 * scheduler's ops.dispatch() doesn't yield any tasks.
-	 */
-	if (scx_bypass_dsp_enabled(sch))
-		return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
-
-	return false;
-}
-
 static int balance_one(struct rq *rq, struct task_struct *prev)
 {
 	struct scx_sched *sch = scx_root;
@@ -4469,26 +4270,6 @@ static inline void scx_cgroup_lock(void) {}
 static inline void scx_cgroup_unlock(void) {}
 #endif	/* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */
 
-#ifdef CONFIG_EXT_SUB_SCHED
-static struct cgroup *sch_cgroup(struct scx_sched *sch)
-{
-	return sch->cgrp;
-}
-
-/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
-static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
-{
-	struct cgroup *pos;
-	struct cgroup_subsys_state *css;
-
-	cgroup_for_each_live_descendant_pre(pos, css, cgrp)
-		rcu_assign_pointer(pos->scx_sched, sch);
-}
-#else	/* CONFIG_EXT_SUB_SCHED */
-static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
-static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
-#endif	/* CONFIG_EXT_SUB_SCHED */
-
 /*
  * Omitted operations:
  *
@@ -5765,202 +5546,6 @@ void scx_log_sched_disable(struct scx_sched *sch)
 	}
 }
 
-#ifdef CONFIG_EXT_SUB_SCHED
-static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
-
-static void drain_descendants(struct scx_sched *sch)
-{
-	/*
-	 * Child scheds that finished the critical part of disabling will take
-	 * themselves off @sch->children. Wait for it to drain. As propagation
-	 * is recursive, empty @sch->children means that all proper descendant
-	 * scheds reached unlinking stage.
-	 */
-	wait_event(scx_unlink_waitq, list_empty(&sch->children));
-}
-
-static void scx_fail_parent(struct scx_sched *sch,
-			    struct task_struct *failed, s32 fail_code)
-{
-	struct scx_sched *parent = scx_parent(sch);
-	struct scx_task_iter sti;
-	struct task_struct *p;
-
-	scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler",
-		  fail_code, failed->comm, failed->pid);
-
-	/*
-	 * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into
-	 * it. This may cause downstream failures on the BPF side but $parent is
-	 * dying anyway.
-	 */
-	scx_bypass(parent, true);
-
-	scx_task_iter_start(&sti, sch->cgrp);
-	while ((p = scx_task_iter_next_locked(&sti))) {
-		if (scx_task_on_sched(parent, p))
-			continue;
-
-		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
-			scx_disable_and_exit_task(sch, p);
-			scx_set_task_sched(p, parent);
-		}
-	}
-	scx_task_iter_stop(&sti);
-}
-
-static void scx_sub_disable(struct scx_sched *sch)
-{
-	struct scx_sched *parent = scx_parent(sch);
-	struct scx_task_iter sti;
-	struct task_struct *p;
-	int ret;
-
-	/*
-	 * Guarantee forward progress and wait for descendants to be disabled.
-	 * To limit disruptions, $parent is not bypassed. Tasks are fully
-	 * prepped and then inserted back into $parent.
-	 */
-	scx_bypass(sch, true);
-	drain_descendants(sch);
-
-	/*
-	 * Here, every runnable task is guaranteed to make forward progress and
-	 * we can safely use blocking synchronization constructs. Actually
-	 * disable ops.
-	 */
-	mutex_lock(&scx_enable_mutex);
-	percpu_down_write(&scx_fork_rwsem);
-	scx_cgroup_lock();
-
-	set_cgroup_sched(sch_cgroup(sch), parent);
-
-	scx_task_iter_start(&sti, sch->cgrp);
-	while ((p = scx_task_iter_next_locked(&sti))) {
-		struct rq *rq;
-		struct rq_flags rf;
-
-		/* filter out duplicate visits */
-		if (scx_task_on_sched(parent, p))
-			continue;
-
-		/*
-		 * By the time control reaches here, all descendant schedulers
-		 * should already have been disabled.
-		 */
-		WARN_ON_ONCE(!scx_task_on_sched(sch, p));
-
-		/*
-		 * @p is pinned by the iter: css_task_iter_next() takes a
-		 * reference and holds it until the next iter_next() call, so
-		 * @p->usage is guaranteed > 0.
-		 */
-		get_task_struct(p);
-
-		scx_task_iter_unlock(&sti);
-
-		/*
-		 * $p is READY or ENABLED on @sch. Initialize for $parent,
-		 * disable and exit from @sch, and then switch over to $parent.
-		 *
-		 * If a task fails to initialize for $parent, the only available
-		 * action is disabling $parent too. While this allows disabling
-		 * of a child sched to cause the parent scheduler to fail, the
-		 * failure can only originate from ops.init_task() of the
-		 * parent. A child can't directly affect the parent through its
-		 * own failures.
-		 */
-		ret = __scx_init_task(parent, p, false);
-		if (ret) {
-			scx_fail_parent(sch, p, ret);
-			put_task_struct(p);
-			break;
-		}
-
-		rq = task_rq_lock(p, &rf);
-
-		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
-			/*
-			 * sched_ext_dead() raced us between __scx_init_task()
-			 * and this rq lock and ran exit_task() on @sch (the
-			 * sched @p was on at that point), not on $parent.
-			 * $parent's just-completed init is owed an exit_task()
-			 * and we issue it here.
-			 */
-			scx_sub_init_cancel_task(parent, p);
-			task_rq_unlock(rq, p, &rf);
-			put_task_struct(p);
-			continue;
-		}
-
-		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
-			/*
-			 * $p is initialized for $parent and still attached to
-			 * @sch. Disable and exit for @sch, switch over to
-			 * $parent, override the state to READY to account for
-			 * $p having already been initialized, and then enable.
-			 */
-			scx_disable_and_exit_task(sch, p);
-			scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
-			scx_set_task_state(p, SCX_TASK_INIT);
-			scx_set_task_sched(p, parent);
-			scx_set_task_state(p, SCX_TASK_READY);
-			scx_enable_task(parent, p);
-		}
-
-		task_rq_unlock(rq, p, &rf);
-		put_task_struct(p);
-	}
-	scx_task_iter_stop(&sti);
-
-	scx_disable_dump(sch);
-
-	scx_cgroup_unlock();
-	percpu_up_write(&scx_fork_rwsem);
-
-	/*
-	 * All tasks are moved off of @sch but there may still be on-going
-	 * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
-	 * the expedited version as ancestors may be waiting in bypass mode.
-	 * Also, tell the parent that there is no need to keep running bypass
-	 * DSQs for us.
-	 */
-	synchronize_rcu_expedited();
-	scx_disable_bypass_dsp(sch);
-
-	scx_unlink_sched(sch);
-
-	mutex_unlock(&scx_enable_mutex);
-
-	/*
-	 * @sch is now unlinked from the parent's children list. Notify and call
-	 * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called
-	 * after unlinking and releasing all locks. See scx_claim_exit().
-	 */
-	wake_up_all(&scx_unlink_waitq);
-
-	if (parent->ops.sub_detach && sch->sub_attached) {
-		struct scx_sub_detach_args sub_detach_args = {
-			.ops = &sch->ops,
-			.cgroup_path = sch->cgrp_path,
-		};
-		SCX_CALL_OP(parent, sub_detach, NULL,
-			    &sub_detach_args);
-	}
-
-	scx_log_sched_disable(sch);
-
-	if (sch->ops.exit)
-		SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
-	if (sch->sub_kset)
-		kobject_del(&sch->sub_kset->kobj);
-	kobject_del(&sch->kobj);
-}
-#else	/* CONFIG_EXT_SUB_SCHED */
-static inline void drain_descendants(struct scx_sched *sch) { }
-static inline void scx_sub_disable(struct scx_sched *sch) { }
-#endif	/* CONFIG_EXT_SUB_SCHED */
-
 static void scx_root_disable(struct scx_sched *sch)
 {
 	struct scx_task_iter sti;
@@ -7350,347 +6935,6 @@ static void scx_root_enable_workfn(struct kthread_work *work)
 	cmd->ret = 0;
 }
 
-#ifdef CONFIG_EXT_SUB_SCHED
-/* verify that a scheduler can be attached to @cgrp and return the parent */
-static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
-{
-	struct scx_sched *parent = cgrp->scx_sched;
-	struct scx_sched *pos;
-
-	lockdep_assert_held(&scx_sched_lock);
-
-	/* can't attach twice to the same cgroup */
-	if (parent->cgrp == cgrp)
-		return ERR_PTR(-EBUSY);
-
-	/* does $parent allow sub-scheds? */
-	if (!parent->ops.sub_attach)
-		return ERR_PTR(-EOPNOTSUPP);
-
-	/* can't insert between $parent and its exiting children */
-	list_for_each_entry(pos, &parent->children, sibling)
-		if (cgroup_is_descendant(pos->cgrp, cgrp))
-			return ERR_PTR(-EBUSY);
-
-	return parent;
-}
-
-static bool assert_task_ready_or_enabled(struct task_struct *p)
-{
-	u32 state = scx_get_task_state(p);
-
-	switch (state) {
-	case SCX_TASK_READY:
-	case SCX_TASK_ENABLED:
-		return true;
-	default:
-		WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched",
-			  state, p->comm, p->pid);
-		return false;
-	}
-}
-
-static void scx_sub_enable_workfn(struct kthread_work *work)
-{
-	struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
-	struct sched_ext_ops *ops = cmd->ops;
-	struct cgroup *cgrp;
-	struct scx_sched *parent, *sch;
-	struct scx_task_iter sti;
-	struct task_struct *p;
-	s32 i, ret;
-
-	mutex_lock(&scx_enable_mutex);
-
-	if (!scx_enabled()) {
-		ret = -ENODEV;
-		goto out_unlock;
-	}
-
-	/* See scx_root_enable_workfn() for the @ops->priv check. */
-	if (rcu_access_pointer(ops->priv)) {
-		ret = -EBUSY;
-		goto out_unlock;
-	}
-
-	cgrp = cgroup_get_from_id(ops->sub_cgroup_id);
-	if (IS_ERR(cgrp)) {
-		ret = PTR_ERR(cgrp);
-		goto out_unlock;
-	}
-
-	raw_spin_lock_irq(&scx_sched_lock);
-	parent = find_parent_sched(cgrp);
-	if (IS_ERR(parent)) {
-		raw_spin_unlock_irq(&scx_sched_lock);
-		ret = PTR_ERR(parent);
-		goto out_put_cgrp;
-	}
-	kobject_get(&parent->kobj);
-	raw_spin_unlock_irq(&scx_sched_lock);
-
-	/* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
-	sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
-	kobject_put(&parent->kobj);
-	if (IS_ERR(sch)) {
-		ret = PTR_ERR(sch);
-		goto out_unlock;
-	}
-
-	ret = scx_link_sched(sch);
-	if (ret)
-		goto err_disable;
-
-	if (sch->level >= SCX_SUB_MAX_DEPTH) {
-		scx_error(sch, "max nesting depth %d violated",
-			  SCX_SUB_MAX_DEPTH);
-		goto err_disable;
-	}
-
-	if (sch->ops.init) {
-		ret = SCX_CALL_OP_RET(sch, init, NULL);
-		if (ret) {
-			ret = scx_ops_sanitize_err(sch, "init", ret);
-			scx_error(sch, "ops.init() failed (%d)", ret);
-			goto err_disable;
-		}
-		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
-	}
-
-	ret = scx_arena_pool_init(sch);
-	if (ret)
-		goto err_disable;
-
-	ret = scx_set_cmask_scratch_alloc(sch);
-	if (ret)
-		goto err_disable;
-
-	if (scx_validate_ops(sch, ops))
-		goto err_disable;
-
-	struct scx_sub_attach_args sub_attach_args = {
-		.ops = &sch->ops,
-		.cgroup_path = sch->cgrp_path,
-	};
-
-	ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
-			      &sub_attach_args);
-	if (ret) {
-		ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
-		scx_error(sch, "parent rejected (%d)", ret);
-		goto err_disable;
-	}
-	sch->sub_attached = true;
-
-	scx_bypass(sch, true);
-
-	for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
-		if (((void (**)(void))ops)[i])
-			set_bit(i, sch->has_op);
-
-	percpu_down_write(&scx_fork_rwsem);
-	scx_cgroup_lock();
-
-	/*
-	 * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see
-	 * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down.
-	 */
-	set_cgroup_sched(sch_cgroup(sch), sch);
-	if (!(cgrp->self.flags & CSS_ONLINE)) {
-		scx_error(sch, "cgroup is not online");
-		goto err_unlock_and_disable;
-	}
-
-	/*
-	 * Initialize tasks for the new child $sch without exiting them for
-	 * $parent so that the tasks can always be reverted back to $parent
-	 * sched on child init failure.
-	 */
-	WARN_ON_ONCE(scx_enabling_sub_sched);
-	scx_enabling_sub_sched = sch;
-
-	scx_task_iter_start(&sti, sch->cgrp);
-	while ((p = scx_task_iter_next_locked(&sti))) {
-		struct rq *rq;
-		struct rq_flags rf;
-
-		/*
-		 * Task iteration may visit the same task twice when racing
-		 * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which
-		 * finished __scx_init_task() and skip if set.
-		 *
-		 * A task may exit and get freed between __scx_init_task()
-		 * completion and scx_enable_task(). In such cases,
-		 * scx_disable_and_exit_task() must exit the task for both the
-		 * parent and child scheds.
-		 */
-		if (p->scx.flags & SCX_TASK_SUB_INIT)
-			continue;
-
-		/* @p is pinned by the iter; see scx_sub_disable() */
-		get_task_struct(p);
-
-		if (!assert_task_ready_or_enabled(p)) {
-			ret = -EINVAL;
-			goto abort;
-		}
-
-		scx_task_iter_unlock(&sti);
-
-		/*
-		 * As $p is still on $parent, it can't be transitioned to INIT.
-		 * Let's worry about task state later. Use __scx_init_task().
-		 */
-		ret = __scx_init_task(sch, p, false);
-		if (ret)
-			goto abort;
-
-		rq = task_rq_lock(p, &rf);
-
-		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
-			/*
-			 * sched_ext_dead() raced us between __scx_init_task()
-			 * and this rq lock and ran exit_task() on $parent (the
-			 * sched @p was on at that point), not on @sch. @sch's
-			 * just-completed init is owed an exit_task() and we
-			 * issue it here.
-			 */
-			scx_sub_init_cancel_task(sch, p);
-			task_rq_unlock(rq, p, &rf);
-			put_task_struct(p);
-			continue;
-		}
-
-		p->scx.flags |= SCX_TASK_SUB_INIT;
-		task_rq_unlock(rq, p, &rf);
-
-		put_task_struct(p);
-	}
-	scx_task_iter_stop(&sti);
-
-	/*
-	 * All tasks are prepped. Disable/exit tasks for $parent and enable for
-	 * the new @sch.
-	 */
-	scx_task_iter_start(&sti, sch->cgrp);
-	while ((p = scx_task_iter_next_locked(&sti))) {
-		/*
-		 * Use clearing of %SCX_TASK_SUB_INIT to detect and skip
-		 * duplicate iterations.
-		 */
-		if (!(p->scx.flags & SCX_TASK_SUB_INIT))
-			continue;
-
-		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
-			/*
-			 * $p must be either READY or ENABLED. If ENABLED,
-			 * __scx_disabled_and_exit_task() first disables and
-			 * makes it READY. However, after exiting $p, it will
-			 * leave $p as READY.
-			 */
-			assert_task_ready_or_enabled(p);
-			__scx_disable_and_exit_task(parent, p);
-
-			/*
-			 * $p is now only initialized for @sch and READY, which
-			 * is what we want. Assign it to @sch and enable.
-			 */
-			scx_set_task_sched(p, sch);
-			scx_enable_task(sch, p);
-
-			p->scx.flags &= ~SCX_TASK_SUB_INIT;
-		}
-	}
-	scx_task_iter_stop(&sti);
-
-	scx_enabling_sub_sched = NULL;
-
-	scx_cgroup_unlock();
-	percpu_up_write(&scx_fork_rwsem);
-
-	scx_bypass(sch, false);
-
-	pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
-	kobject_uevent(&sch->kobj, KOBJ_ADD);
-	ret = 0;
-	goto out_unlock;
-
-out_put_cgrp:
-	cgroup_put(cgrp);
-out_unlock:
-	mutex_unlock(&scx_enable_mutex);
-	cmd->ret = ret;
-	return;
-
-abort:
-	put_task_struct(p);
-	scx_task_iter_stop(&sti);
-
-	/*
-	 * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
-	 * ran for @sch on them, so calling scx_disable_task() here would invoke
-	 * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
-	 * must stay set until SUB_INIT is cleared from every marked task -
-	 * scx_disable_and_exit_task() reads it when a task exits concurrently.
-	 */
-	scx_task_iter_start(&sti, sch->cgrp);
-	while ((p = scx_task_iter_next_locked(&sti))) {
-		if (p->scx.flags & SCX_TASK_SUB_INIT) {
-			scx_sub_init_cancel_task(sch, p);
-			p->scx.flags &= ~SCX_TASK_SUB_INIT;
-		}
-	}
-	scx_task_iter_stop(&sti);
-	scx_enabling_sub_sched = NULL;
-err_unlock_and_disable:
-	/* we'll soon enter disable path, keep bypass on */
-	scx_cgroup_unlock();
-	percpu_up_write(&scx_fork_rwsem);
-err_disable:
-	mutex_unlock(&scx_enable_mutex);
-	scx_flush_disable_work(sch);
-	cmd->ret = 0;
-}
-
-static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb,
-				      unsigned long action, void *data)
-{
-	struct cgroup *cgrp = data;
-	struct cgroup *parent = cgroup_parent(cgrp);
-
-	if (!cgroup_on_dfl(cgrp))
-		return NOTIFY_OK;
-
-	switch (action) {
-	case CGROUP_LIFETIME_ONLINE:
-		/* inherit ->scx_sched from $parent */
-		if (parent)
-			rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched);
-		break;
-	case CGROUP_LIFETIME_OFFLINE:
-		/* if there is a sched attached, shoot it down */
-		if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp)
-			scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN,
-				 SCX_ECODE_RSN_CGROUP_OFFLINE,
-				 "cgroup %llu going offline", cgroup_id(cgrp));
-		break;
-	}
-
-	return NOTIFY_OK;
-}
-
-static struct notifier_block scx_cgroup_lifetime_nb = {
-	.notifier_call = scx_cgroup_lifetime_notify,
-};
-
-static s32 __init scx_cgroup_lifetime_notifier_init(void)
-{
-	return blocking_notifier_chain_register(&cgroup_lifetime_notifier,
-						&scx_cgroup_lifetime_nb);
-}
-core_initcall(scx_cgroup_lifetime_notifier_init);
-#endif	/* CONFIG_EXT_SUB_SCHED */
-
 static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link)
 {
 	static struct kthread_worker *helper;
@@ -7837,20 +7081,6 @@ static int bpf_scx_init_member(const struct btf_type *t,
 	return 0;
 }
 
-#ifdef CONFIG_EXT_SUB_SCHED
-static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog)
-{
-	struct scx_sched *sch;
-
-	guard(rcu)();
-	sch = scx_prog_sched(prog->aux);
-	if (unlikely(!sch))
-		return;
-
-	scx_error(sch, "dispatch recursion detected");
-}
-#endif	/* CONFIG_EXT_SUB_SCHED */
-
 static int bpf_scx_check_member(const struct btf_type *t,
 				const struct btf_member *member,
 				const struct bpf_prog *prog)
@@ -9021,45 +8251,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter,
 			    p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ);
 }
 
-#ifdef CONFIG_EXT_SUB_SCHED
-/**
- * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler
- * @cgroup_id: cgroup ID of the child scheduler to dispatch
- * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
- *
- * Allows a parent scheduler to trigger dispatching on one of its direct
- * child schedulers. The child scheduler runs its dispatch operation to
- * move tasks from dispatch queues to the local runqueue.
- *
- * Returns: true on success, false if cgroup_id is invalid, not a direct
- * child, or caller lacks dispatch permission.
- */
-__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
-{
-	struct rq *this_rq = this_rq();
-	struct scx_sched *parent, *child;
-
-	guard(rcu)();
-	parent = scx_prog_sched(aux);
-	if (unlikely(!parent))
-		return false;
-
-	child = scx_find_sub_sched(cgroup_id);
-
-	if (unlikely(!child))
-		return false;
-
-	if (unlikely(scx_parent(child) != parent)) {
-		scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
-			  cgroup_id);
-		return false;
-	}
-
-	return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
-				  true);
-}
-#endif	/* CONFIG_EXT_SUB_SCHED */
-
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(scx_kfunc_ids_dispatch)
diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h
index c3b97ea4ae79..f9fe7c6ebc4b 100644
--- a/kernel/sched/ext/internal.h
+++ b/kernel/sched/ext/internal.h
@@ -11,6 +11,34 @@
 #include "../sched.h"
 #include "types.h"
 
+#include <trace/events/sched_ext.h>
+
+/**
+ * scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @sch: scx_sched to account events for
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occurred
+ *
+ * This can be used when preemption is not disabled.
+ */
+#define scx_add_event(sch, name, cnt) do {					\
+	this_cpu_add((sch)->pcpu->event_stats.name, (cnt));			\
+	trace_sched_ext_event(#name, (cnt));					\
+} while(0)
+
+/**
+ * __scx_add_event - Increase an event counter for 'name' by 'cnt'
+ * @sch: scx_sched to account events for
+ * @name: an event name defined in struct scx_event_stats
+ * @cnt: the number of the event occurred
+ *
+ * This should be used only when preemption is disabled.
+ */
+#define __scx_add_event(sch, name, cnt) do {					\
+	__this_cpu_add((sch)->pcpu->event_stats.name, (cnt));			\
+	trace_sched_ext_event(#name, cnt);					\
+} while(0)
+
 #define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
 #define SCX_MOFF_IDX(moff)	((moff) / sizeof(void (*)(void)))
 
diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
new file mode 100644
index 000000000000..050420427273
--- /dev/null
+++ b/kernel/sched/ext/sub.c
@@ -0,0 +1,668 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Sub-scheduler hierarchy support.
+ *
+ * A sub-scheduler is an scx_sched attached to a cgroup subtree under another
+ * scx_sched. This file holds the sub-scheduler implementation: the scheduler
+ * tree walk, capability delegation, per-shard cap state and its sync, and the
+ * sub-scheduler enable/disable paths. The core dispatch/enqueue machinery it
+ * builds on lives in ext.c.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#include <linux/rhashtable.h>
+#include "internal.h"
+#include "cid.h"
+#include "arena.h"
+#include "sub.h"
+
+#ifdef CONFIG_EXT_SUB_SCHED
+
+/**
+ * scx_next_descendant_pre - find the next descendant for pre-order walk
+ * @pos: the current position (%NULL to initiate traversal)
+ * @root: sched whose descendants to walk
+ *
+ * To be used by scx_for_each_descendant_pre(). Find the next descendant to
+ * visit for pre-order traversal of @root's descendants. @root is included in
+ * the iteration and the first node to be visited.
+ */
+struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root)
+{
+	struct scx_sched *next;
+
+	lockdep_assert(lockdep_is_held(&scx_enable_mutex) ||
+		       lockdep_is_held(&scx_sched_lock));
+
+	/* if first iteration, visit @root */
+	if (!pos)
+		return root;
+
+	/* visit the first child if exists */
+	next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling);
+	if (next)
+		return next;
+
+	/* no child, visit my or the closest ancestor's next sibling */
+	while (pos != root) {
+		if (!list_is_last(&pos->sibling, &scx_parent(pos)->children))
+			return list_next_entry(pos, sibling);
+		pos = scx_parent(pos);
+	}
+
+	return NULL;
+}
+
+static struct scx_sched *scx_find_sub_sched(u64 cgroup_id)
+{
+	return rhashtable_lookup(&scx_sched_hash, &cgroup_id,
+				 scx_sched_hash_params);
+}
+
+void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch)
+{
+	rcu_assign_pointer(p->scx.sched, sch);
+}
+
+struct cgroup *sch_cgroup(struct scx_sched *sch)
+{
+	return sch->cgrp;
+}
+
+/* for each descendant of @cgrp including self, set ->scx_sched to @sch */
+void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch)
+{
+	struct cgroup *pos;
+	struct cgroup_subsys_state *css;
+
+	cgroup_for_each_live_descendant_pre(pos, css, cgrp)
+		rcu_assign_pointer(pos->scx_sched, sch);
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq);
+
+void drain_descendants(struct scx_sched *sch)
+{
+	/*
+	 * Child scheds that finished the critical part of disabling will take
+	 * themselves off @sch->children. Wait for it to drain. As propagation
+	 * is recursive, empty @sch->children means that all proper descendant
+	 * scheds reached unlinking stage.
+	 */
+	wait_event(scx_unlink_waitq, list_empty(&sch->children));
+}
+
+static void scx_fail_parent(struct scx_sched *sch,
+			    struct task_struct *failed, s32 fail_code)
+{
+	struct scx_sched *parent = scx_parent(sch);
+	struct scx_task_iter sti;
+	struct task_struct *p;
+
+	scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler",
+		  fail_code, failed->comm, failed->pid);
+
+	/*
+	 * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into
+	 * it. This may cause downstream failures on the BPF side but $parent is
+	 * dying anyway.
+	 */
+	scx_bypass(parent, true);
+
+	scx_task_iter_start(&sti, sch->cgrp);
+	while ((p = scx_task_iter_next_locked(&sti))) {
+		if (scx_task_on_sched(parent, p))
+			continue;
+
+		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+			scx_disable_and_exit_task(sch, p);
+			scx_set_task_sched(p, parent);
+		}
+	}
+	scx_task_iter_stop(&sti);
+}
+
+void scx_sub_disable(struct scx_sched *sch)
+{
+	struct scx_sched *parent = scx_parent(sch);
+	struct scx_task_iter sti;
+	struct task_struct *p;
+	int ret;
+
+	/*
+	 * Guarantee forward progress and wait for descendants to be disabled.
+	 * To limit disruptions, $parent is not bypassed. Tasks are fully
+	 * prepped and then inserted back into $parent.
+	 */
+	scx_bypass(sch, true);
+	drain_descendants(sch);
+
+	/*
+	 * Here, every runnable task is guaranteed to make forward progress and
+	 * we can safely use blocking synchronization constructs. Actually
+	 * disable ops.
+	 */
+	mutex_lock(&scx_enable_mutex);
+	percpu_down_write(&scx_fork_rwsem);
+	scx_cgroup_lock();
+
+	set_cgroup_sched(sch_cgroup(sch), parent);
+
+	scx_task_iter_start(&sti, sch->cgrp);
+	while ((p = scx_task_iter_next_locked(&sti))) {
+		struct rq *rq;
+		struct rq_flags rf;
+
+		/* filter out duplicate visits */
+		if (scx_task_on_sched(parent, p))
+			continue;
+
+		/*
+		 * By the time control reaches here, all descendant schedulers
+		 * should already have been disabled.
+		 */
+		WARN_ON_ONCE(!scx_task_on_sched(sch, p));
+
+		/*
+		 * @p is pinned by the iter: css_task_iter_next() takes a
+		 * reference and holds it until the next iter_next() call, so
+		 * @p->usage is guaranteed > 0.
+		 */
+		get_task_struct(p);
+
+		scx_task_iter_unlock(&sti);
+
+		/*
+		 * $p is READY or ENABLED on @sch. Initialize for $parent,
+		 * disable and exit from @sch, and then switch over to $parent.
+		 *
+		 * If a task fails to initialize for $parent, the only available
+		 * action is disabling $parent too. While this allows disabling
+		 * of a child sched to cause the parent scheduler to fail, the
+		 * failure can only originate from ops.init_task() of the
+		 * parent. A child can't directly affect the parent through its
+		 * own failures.
+		 */
+		ret = __scx_init_task(parent, p, false);
+		if (ret) {
+			scx_fail_parent(sch, p, ret);
+			put_task_struct(p);
+			break;
+		}
+
+		rq = task_rq_lock(p, &rf);
+
+		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+			/*
+			 * sched_ext_dead() raced us between __scx_init_task()
+			 * and this rq lock and ran exit_task() on @sch (the
+			 * sched @p was on at that point), not on $parent.
+			 * $parent's just-completed init is owed an exit_task()
+			 * and we issue it here.
+			 */
+			scx_sub_init_cancel_task(parent, p);
+			task_rq_unlock(rq, p, &rf);
+			put_task_struct(p);
+			continue;
+		}
+
+		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+			/*
+			 * $p is initialized for $parent and still attached to
+			 * @sch. Disable and exit for @sch, switch over to
+			 * $parent, override the state to READY to account for
+			 * $p having already been initialized, and then enable.
+			 */
+			scx_disable_and_exit_task(sch, p);
+			scx_set_task_state(p, SCX_TASK_INIT_BEGIN);
+			scx_set_task_state(p, SCX_TASK_INIT);
+			scx_set_task_sched(p, parent);
+			scx_set_task_state(p, SCX_TASK_READY);
+			scx_enable_task(parent, p);
+		}
+
+		task_rq_unlock(rq, p, &rf);
+		put_task_struct(p);
+	}
+	scx_task_iter_stop(&sti);
+
+	scx_disable_dump(sch);
+
+	scx_cgroup_unlock();
+	percpu_up_write(&scx_fork_rwsem);
+
+	/*
+	 * All tasks are moved off of @sch but there may still be on-going
+	 * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
+	 * the expedited version as ancestors may be waiting in bypass mode.
+	 * Also, tell the parent that there is no need to keep running bypass
+	 * DSQs for us.
+	 */
+	synchronize_rcu_expedited();
+	scx_disable_bypass_dsp(sch);
+
+	scx_unlink_sched(sch);
+
+	mutex_unlock(&scx_enable_mutex);
+
+	/*
+	 * @sch is now unlinked from the parent's children list. Notify and call
+	 * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called
+	 * after unlinking and releasing all locks. See scx_claim_exit().
+	 */
+	wake_up_all(&scx_unlink_waitq);
+
+	if (parent->ops.sub_detach && sch->sub_attached) {
+		struct scx_sub_detach_args sub_detach_args = {
+			.ops = &sch->ops,
+			.cgroup_path = sch->cgrp_path,
+		};
+		SCX_CALL_OP(parent, sub_detach, NULL,
+			    &sub_detach_args);
+	}
+
+	scx_log_sched_disable(sch);
+
+	if (sch->ops.exit)
+		SCX_CALL_OP(sch, exit, NULL, sch->exit_info);
+	if (sch->sub_kset)
+		kobject_del(&sch->sub_kset->kobj);
+	kobject_del(&sch->kobj);
+}
+
+/* verify that a scheduler can be attached to @cgrp and return the parent */
+static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
+{
+	struct scx_sched *parent = cgrp->scx_sched;
+	struct scx_sched *pos;
+
+	lockdep_assert_held(&scx_sched_lock);
+
+	/* can't attach twice to the same cgroup */
+	if (parent->cgrp == cgrp)
+		return ERR_PTR(-EBUSY);
+
+	/* does $parent allow sub-scheds? */
+	if (!parent->ops.sub_attach)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	/* can't insert between $parent and its exiting children */
+	list_for_each_entry(pos, &parent->children, sibling)
+		if (cgroup_is_descendant(pos->cgrp, cgrp))
+			return ERR_PTR(-EBUSY);
+
+	return parent;
+}
+
+static bool assert_task_ready_or_enabled(struct task_struct *p)
+{
+	u32 state = scx_get_task_state(p);
+
+	switch (state) {
+	case SCX_TASK_READY:
+	case SCX_TASK_ENABLED:
+		return true;
+	default:
+		WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched",
+			  state, p->comm, p->pid);
+		return false;
+	}
+}
+
+void scx_sub_enable_workfn(struct kthread_work *work)
+{
+	struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work);
+	struct sched_ext_ops *ops = cmd->ops;
+	struct cgroup *cgrp;
+	struct scx_sched *parent, *sch;
+	struct scx_task_iter sti;
+	struct task_struct *p;
+	s32 i, ret;
+
+	mutex_lock(&scx_enable_mutex);
+
+	if (!scx_enabled()) {
+		ret = -ENODEV;
+		goto out_unlock;
+	}
+
+	/* See scx_root_enable_workfn() for the @ops->priv check. */
+	if (rcu_access_pointer(ops->priv)) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
+	cgrp = cgroup_get_from_id(ops->sub_cgroup_id);
+	if (IS_ERR(cgrp)) {
+		ret = PTR_ERR(cgrp);
+		goto out_unlock;
+	}
+
+	raw_spin_lock_irq(&scx_sched_lock);
+	parent = find_parent_sched(cgrp);
+	if (IS_ERR(parent)) {
+		raw_spin_unlock_irq(&scx_sched_lock);
+		ret = PTR_ERR(parent);
+		goto out_put_cgrp;
+	}
+	kobject_get(&parent->kobj);
+	raw_spin_unlock_irq(&scx_sched_lock);
+
+	/* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */
+	sch = scx_alloc_and_add_sched(cmd, cgrp, parent);
+	kobject_put(&parent->kobj);
+	if (IS_ERR(sch)) {
+		ret = PTR_ERR(sch);
+		goto out_unlock;
+	}
+
+	ret = scx_link_sched(sch);
+	if (ret)
+		goto err_disable;
+
+	if (sch->level >= SCX_SUB_MAX_DEPTH) {
+		scx_error(sch, "max nesting depth %d violated",
+			  SCX_SUB_MAX_DEPTH);
+		goto err_disable;
+	}
+
+	if (sch->ops.init) {
+		ret = SCX_CALL_OP_RET(sch, init, NULL);
+		if (ret) {
+			ret = scx_ops_sanitize_err(sch, "init", ret);
+			scx_error(sch, "ops.init() failed (%d)", ret);
+			goto err_disable;
+		}
+		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
+	}
+
+	ret = scx_arena_pool_init(sch);
+	if (ret)
+		goto err_disable;
+
+	ret = scx_set_cmask_scratch_alloc(sch);
+	if (ret)
+		goto err_disable;
+
+	if (scx_validate_ops(sch, ops))
+		goto err_disable;
+
+	struct scx_sub_attach_args sub_attach_args = {
+		.ops = &sch->ops,
+		.cgroup_path = sch->cgrp_path,
+	};
+
+	ret = SCX_CALL_OP_RET(parent, sub_attach, NULL,
+			      &sub_attach_args);
+	if (ret) {
+		ret = scx_ops_sanitize_err(sch, "sub_attach", ret);
+		scx_error(sch, "parent rejected (%d)", ret);
+		goto err_disable;
+	}
+	sch->sub_attached = true;
+
+	scx_bypass(sch, true);
+
+	for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++)
+		if (((void (**)(void))ops)[i])
+			set_bit(i, sch->has_op);
+
+	percpu_down_write(&scx_fork_rwsem);
+	scx_cgroup_lock();
+
+	/*
+	 * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see
+	 * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down.
+	 */
+	set_cgroup_sched(sch_cgroup(sch), sch);
+	if (!(cgrp->self.flags & CSS_ONLINE)) {
+		scx_error(sch, "cgroup is not online");
+		goto err_unlock_and_disable;
+	}
+
+	/*
+	 * Initialize tasks for the new child $sch without exiting them for
+	 * $parent so that the tasks can always be reverted back to $parent
+	 * sched on child init failure.
+	 */
+	WARN_ON_ONCE(scx_enabling_sub_sched);
+	scx_enabling_sub_sched = sch;
+
+	scx_task_iter_start(&sti, sch->cgrp);
+	while ((p = scx_task_iter_next_locked(&sti))) {
+		struct rq *rq;
+		struct rq_flags rf;
+
+		/*
+		 * Task iteration may visit the same task twice when racing
+		 * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which
+		 * finished __scx_init_task() and skip if set.
+		 *
+		 * A task may exit and get freed between __scx_init_task()
+		 * completion and scx_enable_task(). In such cases,
+		 * scx_disable_and_exit_task() must exit the task for both the
+		 * parent and child scheds.
+		 */
+		if (p->scx.flags & SCX_TASK_SUB_INIT)
+			continue;
+
+		/* @p is pinned by the iter; see scx_sub_disable() */
+		get_task_struct(p);
+
+		if (!assert_task_ready_or_enabled(p)) {
+			ret = -EINVAL;
+			goto abort;
+		}
+
+		scx_task_iter_unlock(&sti);
+
+		/*
+		 * As $p is still on $parent, it can't be transitioned to INIT.
+		 * Let's worry about task state later. Use __scx_init_task().
+		 */
+		ret = __scx_init_task(sch, p, false);
+		if (ret)
+			goto abort;
+
+		rq = task_rq_lock(p, &rf);
+
+		if (scx_get_task_state(p) == SCX_TASK_DEAD) {
+			/*
+			 * sched_ext_dead() raced us between __scx_init_task()
+			 * and this rq lock and ran exit_task() on $parent (the
+			 * sched @p was on at that point), not on @sch. @sch's
+			 * just-completed init is owed an exit_task() and we
+			 * issue it here.
+			 */
+			scx_sub_init_cancel_task(sch, p);
+			task_rq_unlock(rq, p, &rf);
+			put_task_struct(p);
+			continue;
+		}
+
+		p->scx.flags |= SCX_TASK_SUB_INIT;
+		task_rq_unlock(rq, p, &rf);
+
+		put_task_struct(p);
+	}
+	scx_task_iter_stop(&sti);
+
+	/*
+	 * All tasks are prepped. Disable/exit tasks for $parent and enable for
+	 * the new @sch.
+	 */
+	scx_task_iter_start(&sti, sch->cgrp);
+	while ((p = scx_task_iter_next_locked(&sti))) {
+		/*
+		 * Use clearing of %SCX_TASK_SUB_INIT to detect and skip
+		 * duplicate iterations.
+		 */
+		if (!(p->scx.flags & SCX_TASK_SUB_INIT))
+			continue;
+
+		scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) {
+			/*
+			 * $p must be either READY or ENABLED. If ENABLED,
+			 * __scx_disabled_and_exit_task() first disables and
+			 * makes it READY. However, after exiting $p, it will
+			 * leave $p as READY.
+			 */
+			assert_task_ready_or_enabled(p);
+			__scx_disable_and_exit_task(parent, p);
+
+			/*
+			 * $p is now only initialized for @sch and READY, which
+			 * is what we want. Assign it to @sch and enable.
+			 */
+			scx_set_task_sched(p, sch);
+			scx_enable_task(sch, p);
+
+			p->scx.flags &= ~SCX_TASK_SUB_INIT;
+		}
+	}
+	scx_task_iter_stop(&sti);
+
+	scx_enabling_sub_sched = NULL;
+
+	scx_cgroup_unlock();
+	percpu_up_write(&scx_fork_rwsem);
+
+	scx_bypass(sch, false);
+
+	pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name);
+	kobject_uevent(&sch->kobj, KOBJ_ADD);
+	ret = 0;
+	goto out_unlock;
+
+out_put_cgrp:
+	cgroup_put(cgrp);
+out_unlock:
+	mutex_unlock(&scx_enable_mutex);
+	cmd->ret = ret;
+	return;
+
+abort:
+	put_task_struct(p);
+	scx_task_iter_stop(&sti);
+
+	/*
+	 * Undo __scx_init_task() for tasks we marked. scx_enable_task() never
+	 * ran for @sch on them, so calling scx_disable_task() here would invoke
+	 * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched
+	 * must stay set until SUB_INIT is cleared from every marked task -
+	 * scx_disable_and_exit_task() reads it when a task exits concurrently.
+	 */
+	scx_task_iter_start(&sti, sch->cgrp);
+	while ((p = scx_task_iter_next_locked(&sti))) {
+		if (p->scx.flags & SCX_TASK_SUB_INIT) {
+			scx_sub_init_cancel_task(sch, p);
+			p->scx.flags &= ~SCX_TASK_SUB_INIT;
+		}
+	}
+	scx_task_iter_stop(&sti);
+	scx_enabling_sub_sched = NULL;
+err_unlock_and_disable:
+	/* we'll soon enter disable path, keep bypass on */
+	scx_cgroup_unlock();
+	percpu_up_write(&scx_fork_rwsem);
+err_disable:
+	mutex_unlock(&scx_enable_mutex);
+	scx_flush_disable_work(sch);
+	cmd->ret = 0;
+}
+
+static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb,
+				      unsigned long action, void *data)
+{
+	struct cgroup *cgrp = data;
+	struct cgroup *parent = cgroup_parent(cgrp);
+
+	if (!cgroup_on_dfl(cgrp))
+		return NOTIFY_OK;
+
+	switch (action) {
+	case CGROUP_LIFETIME_ONLINE:
+		/* inherit ->scx_sched from $parent */
+		if (parent)
+			rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched);
+		break;
+	case CGROUP_LIFETIME_OFFLINE:
+		/* if there is a sched attached, shoot it down */
+		if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp)
+			scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN,
+				 SCX_ECODE_RSN_CGROUP_OFFLINE,
+				 "cgroup %llu going offline", cgroup_id(cgrp));
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block scx_cgroup_lifetime_nb = {
+	.notifier_call = scx_cgroup_lifetime_notify,
+};
+
+static s32 __init scx_cgroup_lifetime_notifier_init(void)
+{
+	return blocking_notifier_chain_register(&cgroup_lifetime_notifier,
+						&scx_cgroup_lifetime_nb);
+}
+core_initcall(scx_cgroup_lifetime_notifier_init);
+
+void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog)
+{
+	struct scx_sched *sch;
+
+	guard(rcu)();
+	sch = scx_prog_sched(prog->aux);
+	if (unlikely(!sch))
+		return;
+
+	scx_error(sch, "dispatch recursion detected");
+}
+
+__bpf_kfunc_start_defs();
+
+/**
+ * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler
+ * @cgroup_id: cgroup ID of the child scheduler to dispatch
+ * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs
+ *
+ * Allows a parent scheduler to trigger dispatching on one of its direct
+ * child schedulers. The child scheduler runs its dispatch operation to
+ * move tasks from dispatch queues to the local runqueue.
+ *
+ * Returns: true on success, false if cgroup_id is invalid, not a direct
+ * child, or caller lacks dispatch permission.
+ */
+__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux)
+{
+	struct rq *this_rq = this_rq();
+	struct scx_sched *parent, *child;
+
+	guard(rcu)();
+	parent = scx_prog_sched(aux);
+	if (unlikely(!parent))
+		return false;
+
+	child = scx_find_sub_sched(cgroup_id);
+
+	if (unlikely(!child))
+		return false;
+
+	if (unlikely(scx_parent(child) != parent)) {
+		scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu",
+			  cgroup_id);
+		return false;
+	}
+
+	return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev,
+				  true);
+}
+
+__bpf_kfunc_end_defs();
+
+#endif	/* CONFIG_EXT_SUB_SCHED */
diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h
new file mode 100644
index 000000000000..460a9fd196dc
--- /dev/null
+++ b/kernel/sched/ext/sub.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst
+ *
+ * Sub-scheduler hierarchy support.
+ *
+ * Copyright (c) 2026 Meta Platforms, Inc. and affiliates.
+ * Copyright (c) 2026 Tejun Heo <tj@kernel.org>
+ */
+#ifndef _KERNEL_SCHED_EXT_SUB_H
+#define _KERNEL_SCHED_EXT_SUB_H
+
+#include "internal.h"
+#include "cid.h"
+
+#ifdef CONFIG_EXT_SUB_SCHED
+
+struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root);
+void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch);
+struct cgroup *sch_cgroup(struct scx_sched *sch);
+void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch);
+void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog);
+void drain_descendants(struct scx_sched *sch);
+void scx_sub_disable(struct scx_sched *sch);
+void scx_sub_enable_workfn(struct kthread_work *work);
+bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux);
+
+#else	/* CONFIG_EXT_SUB_SCHED */
+
+static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; }
+static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {}
+static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; }
+static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {}
+static inline void drain_descendants(struct scx_sched *sch) { }
+static inline void scx_sub_disable(struct scx_sched *sch) { }
+
+#endif	/* CONFIG_EXT_SUB_SCHED */
+
+/**
+ * scx_for_each_descendant_pre - pre-order walk of a sched's descendants
+ * @pos: iteration cursor
+ * @root: sched to walk the descendants of
+ *
+ * Walk @root's descendants. @root is included in the iteration and the first
+ * node to be visited. Must be called with either scx_enable_mutex or
+ * scx_sched_lock held.
+ */
+#define scx_for_each_descendant_pre(pos, root)					\
+	for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos);		\
+	     (pos) = scx_next_descendant_pre((pos), (root)))
+
+/*
+ * One user of this function is scx_bpf_dispatch() which can be called
+ * recursively as sub-sched dispatches nest. Always inline to reduce stack usage
+ * from the call frame.
+ */
+static __always_inline bool
+scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
+		   struct task_struct *prev, bool nested)
+{
+	struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx;
+	int nr_loops = SCX_DSP_MAX_LOOPS;
+	s32 cpu = cpu_of(rq);
+	bool prev_on_sch = (prev->sched_class == &ext_sched_class) &&
+		scx_task_on_sched(sch, prev);
+
+	if (scx_consume_global_dsq(sch, rq))
+		return true;
+
+	if (scx_bypass_dsp_enabled(sch)) {
+		/* if @sch is bypassing, only the bypass DSQs are active */
+		if (scx_bypassing(sch, cpu))
+			return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
+
+#ifdef CONFIG_EXT_SUB_SCHED
+		/*
+		 * If @sch isn't bypassing but its children are, @sch is
+		 * responsible for making forward progress for both its own
+		 * tasks that aren't bypassing and the bypassing descendants'
+		 * tasks. The following implements a simple built-in behavior -
+		 * let each CPU try to run the bypass DSQ every Nth time.
+		 *
+		 * Later, if necessary, we can add an ops flag to suppress the
+		 * auto-consumption and a kfunc to consume the bypass DSQ and,
+		 * so that the BPF scheduler can fully control scheduling of
+		 * bypassed tasks.
+		 */
+		struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+
+		if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
+		    scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) {
+			__scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
+			return true;
+		}
+#endif	/* CONFIG_EXT_SUB_SCHED */
+	}
+
+	if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
+		return false;
+
+	dspc->rq = rq;
+
+	/*
+	 * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq
+	 * lock, the local DSQ might still end up empty after a successful
+	 * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
+	 * produced some tasks, retry. The BPF scheduler may depend on this
+	 * looping behavior to simplify its implementation.
+	 */
+	do {
+		dspc->nr_tasks = 0;
+
+		if (nested) {
+			SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+				    prev_on_sch ? prev : NULL);
+		} else {
+			/* stash @prev so that nested invocations can access it */
+			rq->scx.sub_dispatch_prev = prev;
+			SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu),
+				    prev_on_sch ? prev : NULL);
+			rq->scx.sub_dispatch_prev = NULL;
+		}
+
+		scx_flush_dispatch_buf(sch, rq);
+
+		if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) {
+			rq->scx.flags |= SCX_RQ_BAL_KEEP;
+			return true;
+		}
+		if (rq->scx.local_dsq.nr)
+			return true;
+		if (scx_consume_global_dsq(sch, rq))
+			return true;
+
+		/*
+		 * ops.dispatch() can trap us in this loop by repeatedly
+		 * dispatching ineligible tasks. Break out once in a while to
+		 * allow the watchdog to run. As IRQ can't be enabled in
+		 * balance(), we want to complete this scheduling cycle and then
+		 * start a new one. IOW, we want to call resched_curr() on the
+		 * next, most likely idle, task, not the current one. Use
+		 * __scx_bpf_kick_cpu() for deferred kicking.
+		 */
+		if (unlikely(!--nr_loops)) {
+			scx_kick_cpu(sch, cpu, 0);
+			break;
+		}
+	} while (dspc->nr_tasks);
+
+	/*
+	 * Prevent the CPU from going idle while bypassed descendants have tasks
+	 * queued. Without this fallback, bypassed tasks could stall if the host
+	 * scheduler's ops.dispatch() doesn't yield any tasks.
+	 */
+	if (scx_bypass_dsp_enabled(sch))
+		return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0);
+
+	return false;
+}
+
+#endif /* _KERNEL_SCHED_EXT_SUB_H */
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c
  2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
                   ` (3 preceding siblings ...)
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 20:44 ` Tejun Heo
  4 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 20:44 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Applied 1-4 to sched_ext/for-7.3.

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
  2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
@ 2026-07-01 20:55   ` sashiko-bot
  2026-07-01 21:56     ` Tejun Heo
  0 siblings, 1 reply; 8+ messages in thread
From: sashiko-bot @ 2026-07-01 20:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: sched-ext

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

Pre-existing issues:
- [High] `scx_sub_disable` violates RCU removal ordering, leading to Use-After-Free of sub-schedulers.
- [High] NULL pointer dereference in `find_parent_sched` when attaching a sub-scheduler to a newly created cgroup.
- [High] Missing `scx_error()` calls in `scx_sub_enable_workfn` error paths cause the sub-scheduler to leak and remain bypassed indefinitely.
--

commit 3673367fbf0f16668d479297e4d6ace4a6a2dbef
Author: Tejun Heo <tj@kernel.org>
Subject: sched_ext: Split sub-scheduler implementation into sub.c

This commit moves the sub-scheduler implementation from ext.c into a new
kernel/sched/ext/sub.c file. It also creates sub.h to hold prototypes and
inline functions like scx_dispatch_sched() for better modularity.

> diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c
> new file mode 100644
> index 0000000000000..0504204272732
> --- /dev/null
> +++ b/kernel/sched/ext/sub.c

[ ... ]

> +void scx_sub_disable(struct scx_sched *sch)
> +{

[ ... ]

> +	/*
> +	 * All tasks are moved off of @sch but there may still be on-going
> +	 * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use
> +	 * the expedited version as ancestors may be waiting in bypass mode.
> +	 * Also, tell the parent that there is no need to keep running bypass
> +	 * DSQs for us.
> +	 */
> +	synchronize_rcu_expedited();
> +	scx_disable_bypass_dsp(sch);
> +
> +	scx_unlink_sched(sch);

[Severity: High]
This is a pre-existing issue, but does this violate RCU removal ordering?

The sub-scheduler is removed from the scx_sched_hash via scx_unlink_sched()
after the RCU grace period completes. Because it remains in the hash during
and after the grace period, couldn't a concurrent BPF program on the parent
scheduler call scx_bpf_sub_dispatch() and find the sub-scheduler?

Should scx_unlink_sched() be called before synchronize_rcu_expedited() to
ensure new RCU readers cannot look it up?

[ ... ]

> +/* verify that a scheduler can be attached to @cgrp and return the parent */
> +static struct scx_sched *find_parent_sched(struct cgroup *cgrp)
> +{
> +	struct scx_sched *parent = cgrp->scx_sched;
> +	struct scx_sched *pos;
> +
> +	lockdep_assert_held(&scx_sched_lock);
> +
> +	/* can't attach twice to the same cgroup */
> +	if (parent->cgrp == cgrp)

[Severity: High]
This isn't a bug introduced by this patch, but is it possible for parent
to be NULL here?

If a BPF program attaches a sub-scheduler to a newly created cgroup that
hasn't completed its ONLINE phase yet, cgrp->scx_sched might be NULL. The
code dereferences it immediately. 

The explicit check for CSS_ONLINE happens later in scx_sub_enable_workfn().
Should we verify parent is not NULL before dereferencing it?

[ ... ]

> +void scx_sub_enable_workfn(struct kthread_work *work)
> +{

[ ... ]

> +	ret = scx_link_sched(sch);
> +	if (ret)
> +		goto err_disable;
> +
> +	if (sch->level >= SCX_SUB_MAX_DEPTH) {
> +		scx_error(sch, "max nesting depth %d violated",
> +			  SCX_SUB_MAX_DEPTH);
> +		goto err_disable;
> +	}
> +
> +	if (sch->ops.init) {
> +		ret = SCX_CALL_OP_RET(sch, init, NULL);
> +		if (ret) {
> +			ret = scx_ops_sanitize_err(sch, "init", ret);
> +			scx_error(sch, "ops.init() failed (%d)", ret);
> +			goto err_disable;
> +		}
> +		sch->exit_info->flags |= SCX_EFLAG_INITIALIZED;
> +	}
> +
> +	ret = scx_arena_pool_init(sch);
> +	if (ret)
> +		goto err_disable;
> +
> +	ret = scx_set_cmask_scratch_alloc(sch);
> +	if (ret)
> +		goto err_disable;
> +
> +	if (scx_validate_ops(sch, ops))
> +		goto err_disable;

[Severity: High]
This is a pre-existing issue, but do we need to call scx_error() in these
failure paths?

If scx_link_sched(), scx_arena_pool_init(), scx_set_cmask_scratch_alloc(),
or scx_validate_ops() fail, the code jumps to err_disable without calling
scx_error().

Without scx_error(), sch->exit_kind remains SCX_EXIT_NONE. When
scx_flush_disable_work(sch) is executed at err_disable, it will immediately
return, leaving the sub-scheduler partially initialized and permanently
bypassed, while returning 0 (success) to userspace.

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260701203438.2837392-1-tj@kernel.org?part=4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c
  2026-07-01 20:55   ` sashiko-bot
@ 2026-07-01 21:56     ` Tejun Heo
  0 siblings, 0 replies; 8+ messages in thread
From: Tejun Heo @ 2026-07-01 21:56 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: sched-ext

Hello.

> The sub-scheduler is removed from the scx_sched_hash via scx_unlink_sched()
> after the RCU grace period completes. Because it remains in the hash during
> and after the grace period, couldn't a concurrent BPF program on the parent
> scheduler call scx_bpf_sub_dispatch() and find the sub-scheduler?

It can, but that's fine. The expedited synchronize is there to drain
in-flight ops after all tasks have been moved off, not to make the sched
unreachable. The sched is freed through another RCU grace period after
unlinking and scx_bpf_sub_dispatch() runs under RCU read lock, so the
racing lookup can't see the sched freed under it. Such a dispatch is also
inert as the child is already bypassed and has no tasks left.

> If a BPF program attaches a sub-scheduler to a newly created cgroup that
> hasn't completed its ONLINE phase yet, cgrp->scx_sched might be NULL. The
> code dereferences it immediately.

The cgroup can only come from cgroup_get_from_id() which fails on kernfs
nodes which haven't been activated, and cgroup_mkdir() activates the node
after CGROUP_LIFETIME_ONLINE has fired and ->scx_sched has been inherited
from the parent. The inherited pointer is non-NULL whenever the attach can
get this far. The root enable path sets ->scx_sched on all live cgroups
before __scx_enabled is turned on, under the same scx_enable_mutex that
scx_sub_enable_workfn() holds across the scx_enabled() check. The
CSS_ONLINE check afterwards guards against destruction, not incomplete
onlining.

> This is a pre-existing issue, but do we need to call scx_error() in these
> failure paths?

Will be dealt with separately.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-07-01 21:56 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 20:34 [PATCHSET v3 sched_ext/for-7.3] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 1/4] sched_ext: Prefix file-local ext.c helpers exposed by the sub.c split Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 2/4] sched_ext: Expose the ext.c internals used " Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 3/4] sched_ext: Inline small ext.c helpers shared across " Tejun Heo
2026-07-01 20:34 ` [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Tejun Heo
2026-07-01 20:55   ` sashiko-bot
2026-07-01 21:56     ` Tejun Heo
2026-07-01 20:44 ` [PATCHSET v3 sched_ext/for-7.3] " Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.