From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6F23F347514; Wed, 1 Jul 2026 20:34:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782938086; cv=none; b=n6ZegaD0Vx1hgkSp8JsoGNn/Nzw+w+PC0Haf5z+SWLv0hzFsHYQnivh1KWeB/qBmjTWn6iTMEfS+QpDZUheLCx53OO4VxRiCc2np79aYOGNm4n10AZiWk7toZscmgX74ids0y1QnK8uouqqpt34AmLEl2xdVp4Evwp3hgq+lPzg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782938086; c=relaxed/simple; bh=YJBDxk0VanstxHb1YEz3erGz65j3dZPN+a0+zfqCUBI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EoZWMnrOpfv5nPVckI6NEVh7GV8rn0dTEX8iUvk7PSoSQSNkgmz7fu8ihj/j4FUVAzLvnYspAtlG6lzUZ3x3Bd7f8H+7/Ha5Vgk/MpCbu6zAW96LAcyYg8XUFI5WvHgtr/cLMLGiIeLyyGW6264jnWEI1hJfl80Z1VUGusv6ISg= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=dk9QwOnV; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="dk9QwOnV" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 148B81F000E9; Wed, 1 Jul 2026 20:34:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1782938083; bh=VDKTeZZUyHW5L4pdCZQ+KtUAp+cHhgQ1EYZD4K02PBA=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=dk9QwOnVKzyTW+ZQXfV4x+b2Nz9PXiDOrOL/MeuU8q0KTiSF2UHXTaZ3Q83rXn2M6 GZIBM9AiYgFhNYDFneW0TtFHjRL1eAUNJnCZw2M4wo8VIAICbDqPGus2j8VxMqng5f mZSU3LWBV5QvNSiRcPMiGW0HbwxX4O3K5f08eu8+Q8GnMlomQ32gITPSo6xC7pNEsB rUVzab4GJfE7NfWu56O9pFhyFs7aKSJ7B2NrZEO4oFPsiljNePBLLIBIucfXitZste d9HYwV0znM433J04njiQzyJvscDMr7GoAtkgqBownZgp4QRT+SDqm6bjTQPzKWO7mM dL2IchqtaYu1Q== From: Tejun Heo To: David Vernet , Andrea Righi , Changwoo Min Cc: sched-ext@lists.linux.dev, Emil Tsalapatis , linux-kernel@vger.kernel.org, Tejun Heo Subject: [PATCH v3 sched_ext/for-7.3 4/4] sched_ext: Split sub-scheduler implementation into sub.c Date: Wed, 1 Jul 2026 10:34:38 -1000 Message-ID: <20260701203438.2837392-5-tj@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260701203438.2837392-1-tj@kernel.org> References: <20260701203438.2837392-1-tj@kernel.org> Precedence: bulk X-Mailing-List: sched-ext@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit The sub-scheduler implementation has grown and will continue to expand. Move the sub-scheduler functions from ext.c into a new kernel/sched/ext/sub.c. sub.h holds the prototypes and the !CONFIG_EXT_SUB_SCHED no-op stubs. scx_dispatch_sched() is shared: balance_one() in ext.c and the scx_bpf_sub_dispatch() kfunc in sub.c both call it, and the latter re-enters it as sub-scheduler dispatch nests. It moves into sub.h as a static __always_inline so both callers keep it inlined and per-level stack stays bounded across the recursion. The event macros it uses move to internal.h. No functional change. Signed-off-by: Tejun Heo Reviewed-by: Andrea Righi --- v2: Fold the scx_dispatch_sched() sub.h promotion into this patch (was a separate later patch in v1) so the split is self-contained (Andrea). kernel/sched/build_policy.c | 2 + kernel/sched/ext/ext.c | 811 +----------------------------------- kernel/sched/ext/internal.h | 28 ++ kernel/sched/ext/sub.c | 668 +++++++++++++++++++++++++++++ kernel/sched/ext/sub.h | 161 +++++++ 5 files changed, 860 insertions(+), 810 deletions(-) create mode 100644 kernel/sched/ext/sub.c create mode 100644 kernel/sched/ext/sub.h diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c index d74b54f81992..01dc7bf89af8 100644 --- a/kernel/sched/build_policy.c +++ b/kernel/sched/build_policy.c @@ -66,10 +66,12 @@ # include "ext/cid.h" # include "ext/arena.h" # include "ext/idle.h" +# include "ext/sub.h" # include "ext/ext.c" # include "ext/cid.c" # include "ext/arena.c" # include "ext/idle.c" +# include "ext/sub.c" #endif #include "syscalls.c" diff --git a/kernel/sched/ext/ext.c b/kernel/sched/ext/ext.c index d1ef79c1038d..1a0ec985da77 100644 --- a/kernel/sched/ext/ext.c +++ b/kernel/sched/ext/ext.c @@ -19,6 +19,7 @@ #include "cid.h" #include "arena.h" #include "idle.h" +#include "sub.h" DEFINE_RAW_SPINLOCK(scx_sched_lock); @@ -271,58 +272,6 @@ static bool u32_before(u32 a, u32 b) return (s32)(a - b) < 0; } -#ifdef CONFIG_EXT_SUB_SCHED -/** - * scx_next_descendant_pre - find the next descendant for pre-order walk - * @pos: the current position (%NULL to initiate traversal) - * @root: sched whose descendants to walk - * - * To be used by scx_for_each_descendant_pre(). Find the next descendant to - * visit for pre-order traversal of @root's descendants. @root is included in - * the iteration and the first node to be visited. - */ -static struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, - struct scx_sched *root) -{ - struct scx_sched *next; - - lockdep_assert(lockdep_is_held(&scx_enable_mutex) || - lockdep_is_held(&scx_sched_lock)); - - /* if first iteration, visit @root */ - if (!pos) - return root; - - /* visit the first child if exists */ - next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling); - if (next) - return next; - - /* no child, visit my or the closest ancestor's next sibling */ - while (pos != root) { - if (!list_is_last(&pos->sibling, &scx_parent(pos)->children)) - return list_next_entry(pos, sibling); - pos = scx_parent(pos); - } - - return NULL; -} - -static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) -{ - return rhashtable_lookup(&scx_sched_hash, &cgroup_id, - scx_sched_hash_params); -} - -static void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) -{ - rcu_assign_pointer(p->scx.sched, sch); -} -#else /* CONFIG_EXT_SUB_SCHED */ -static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; } -static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {} -#endif /* CONFIG_EXT_SUB_SCHED */ - /** * scx_is_descendant - Test whether sched is a descendant * @sch: sched to test @@ -337,19 +286,6 @@ static bool scx_is_descendant(struct scx_sched *sch, struct scx_sched *ancestor) return sch->ancestors[ancestor->level] == ancestor; } -/** - * scx_for_each_descendant_pre - pre-order walk of a sched's descendants - * @pos: iteration cursor - * @root: sched to walk the descendants of - * - * Walk @root's descendants. @root is included in the iteration and the first - * node to be visited. Must be called with either scx_enable_mutex or - * scx_sched_lock held. - */ -#define scx_for_each_descendant_pre(pos, root) \ - for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \ - (pos) = scx_next_descendant_pre((pos), (root))) - static struct scx_dispatch_q *find_global_dsq(struct scx_sched *sch, s32 cpu) { return &sch->pnode[cpu_to_node(cpu)]->global_dsq; @@ -935,32 +871,6 @@ struct task_struct *scx_task_iter_next_locked(struct scx_task_iter *iter) return NULL; } -/** - * scx_add_event - Increase an event counter for 'name' by 'cnt' - * @sch: scx_sched to account events for - * @name: an event name defined in struct scx_event_stats - * @cnt: the number of the event occurred - * - * This can be used when preemption is not disabled. - */ -#define scx_add_event(sch, name, cnt) do { \ - this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \ - trace_sched_ext_event(#name, (cnt)); \ -} while(0) - -/** - * __scx_add_event - Increase an event counter for 'name' by 'cnt' - * @sch: scx_sched to account events for - * @name: an event name defined in struct scx_event_stats - * @cnt: the number of the event occurred - * - * This should be used only when preemption is disabled. - */ -#define __scx_add_event(sch, name, cnt) do { \ - __this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \ - trace_sched_ext_event(#name, cnt); \ -} while(0) - /** * scx_dump_event - Dump an event 'kind' in 'events' to 's' * @s: output seq_buf @@ -2681,115 +2591,6 @@ static inline void maybe_queue_balance_callback(struct rq *rq) rq->scx.flags &= ~SCX_RQ_BAL_CB_PENDING; } -/* - * One user of this function is scx_bpf_dispatch() which can be called - * recursively as sub-sched dispatches nest. Always inline to reduce stack usage - * from the call frame. - */ -static __always_inline bool -scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, - struct task_struct *prev, bool nested) -{ - struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; - int nr_loops = SCX_DSP_MAX_LOOPS; - s32 cpu = cpu_of(rq); - bool prev_on_sch = (prev->sched_class == &ext_sched_class) && - scx_task_on_sched(sch, prev); - - if (scx_consume_global_dsq(sch, rq)) - return true; - - if (scx_bypass_dsp_enabled(sch)) { - /* if @sch is bypassing, only the bypass DSQs are active */ - if (scx_bypassing(sch, cpu)) - return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0); - -#ifdef CONFIG_EXT_SUB_SCHED - /* - * If @sch isn't bypassing but its children are, @sch is - * responsible for making forward progress for both its own - * tasks that aren't bypassing and the bypassing descendants' - * tasks. The following implements a simple built-in behavior - - * let each CPU try to run the bypass DSQ every Nth time. - * - * Later, if necessary, we can add an ops flag to suppress the - * auto-consumption and a kfunc to consume the bypass DSQ and, - * so that the BPF scheduler can fully control scheduling of - * bypassed tasks. - */ - struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); - - if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) && - scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) { - __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1); - return true; - } -#endif /* CONFIG_EXT_SUB_SCHED */ - } - - if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) - return false; - - dspc->rq = rq; - - /* - * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq - * lock, the local DSQ might still end up empty after a successful - * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() - * produced some tasks, retry. The BPF scheduler may depend on this - * looping behavior to simplify its implementation. - */ - do { - dspc->nr_tasks = 0; - - if (nested) { - SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu), - prev_on_sch ? prev : NULL); - } else { - /* stash @prev so that nested invocations can access it */ - rq->scx.sub_dispatch_prev = prev; - SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu), - prev_on_sch ? prev : NULL); - rq->scx.sub_dispatch_prev = NULL; - } - - scx_flush_dispatch_buf(sch, rq); - - if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) { - rq->scx.flags |= SCX_RQ_BAL_KEEP; - return true; - } - if (rq->scx.local_dsq.nr) - return true; - if (scx_consume_global_dsq(sch, rq)) - return true; - - /* - * ops.dispatch() can trap us in this loop by repeatedly - * dispatching ineligible tasks. Break out once in a while to - * allow the watchdog to run. As IRQ can't be enabled in - * balance(), we want to complete this scheduling cycle and then - * start a new one. IOW, we want to call resched_curr() on the - * next, most likely idle, task, not the current one. Use - * __scx_bpf_kick_cpu() for deferred kicking. - */ - if (unlikely(!--nr_loops)) { - scx_kick_cpu(sch, cpu, 0); - break; - } - } while (dspc->nr_tasks); - - /* - * Prevent the CPU from going idle while bypassed descendants have tasks - * queued. Without this fallback, bypassed tasks could stall if the host - * scheduler's ops.dispatch() doesn't yield any tasks. - */ - if (scx_bypass_dsp_enabled(sch)) - return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0); - - return false; -} - static int balance_one(struct rq *rq, struct task_struct *prev) { struct scx_sched *sch = scx_root; @@ -4469,26 +4270,6 @@ static inline void scx_cgroup_lock(void) {} static inline void scx_cgroup_unlock(void) {} #endif /* CONFIG_EXT_GROUP_SCHED || CONFIG_EXT_SUB_SCHED */ -#ifdef CONFIG_EXT_SUB_SCHED -static struct cgroup *sch_cgroup(struct scx_sched *sch) -{ - return sch->cgrp; -} - -/* for each descendant of @cgrp including self, set ->scx_sched to @sch */ -static void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) -{ - struct cgroup *pos; - struct cgroup_subsys_state *css; - - cgroup_for_each_live_descendant_pre(pos, css, cgrp) - rcu_assign_pointer(pos->scx_sched, sch); -} -#else /* CONFIG_EXT_SUB_SCHED */ -static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } -static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} -#endif /* CONFIG_EXT_SUB_SCHED */ - /* * Omitted operations: * @@ -5765,202 +5546,6 @@ void scx_log_sched_disable(struct scx_sched *sch) } } -#ifdef CONFIG_EXT_SUB_SCHED -static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq); - -static void drain_descendants(struct scx_sched *sch) -{ - /* - * Child scheds that finished the critical part of disabling will take - * themselves off @sch->children. Wait for it to drain. As propagation - * is recursive, empty @sch->children means that all proper descendant - * scheds reached unlinking stage. - */ - wait_event(scx_unlink_waitq, list_empty(&sch->children)); -} - -static void scx_fail_parent(struct scx_sched *sch, - struct task_struct *failed, s32 fail_code) -{ - struct scx_sched *parent = scx_parent(sch); - struct scx_task_iter sti; - struct task_struct *p; - - scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler", - fail_code, failed->comm, failed->pid); - - /* - * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into - * it. This may cause downstream failures on the BPF side but $parent is - * dying anyway. - */ - scx_bypass(parent, true); - - scx_task_iter_start(&sti, sch->cgrp); - while ((p = scx_task_iter_next_locked(&sti))) { - if (scx_task_on_sched(parent, p)) - continue; - - scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { - scx_disable_and_exit_task(sch, p); - scx_set_task_sched(p, parent); - } - } - scx_task_iter_stop(&sti); -} - -static void scx_sub_disable(struct scx_sched *sch) -{ - struct scx_sched *parent = scx_parent(sch); - struct scx_task_iter sti; - struct task_struct *p; - int ret; - - /* - * Guarantee forward progress and wait for descendants to be disabled. - * To limit disruptions, $parent is not bypassed. Tasks are fully - * prepped and then inserted back into $parent. - */ - scx_bypass(sch, true); - drain_descendants(sch); - - /* - * Here, every runnable task is guaranteed to make forward progress and - * we can safely use blocking synchronization constructs. Actually - * disable ops. - */ - mutex_lock(&scx_enable_mutex); - percpu_down_write(&scx_fork_rwsem); - scx_cgroup_lock(); - - set_cgroup_sched(sch_cgroup(sch), parent); - - scx_task_iter_start(&sti, sch->cgrp); - while ((p = scx_task_iter_next_locked(&sti))) { - struct rq *rq; - struct rq_flags rf; - - /* filter out duplicate visits */ - if (scx_task_on_sched(parent, p)) - continue; - - /* - * By the time control reaches here, all descendant schedulers - * should already have been disabled. - */ - WARN_ON_ONCE(!scx_task_on_sched(sch, p)); - - /* - * @p is pinned by the iter: css_task_iter_next() takes a - * reference and holds it until the next iter_next() call, so - * @p->usage is guaranteed > 0. - */ - get_task_struct(p); - - scx_task_iter_unlock(&sti); - - /* - * $p is READY or ENABLED on @sch. Initialize for $parent, - * disable and exit from @sch, and then switch over to $parent. - * - * If a task fails to initialize for $parent, the only available - * action is disabling $parent too. While this allows disabling - * of a child sched to cause the parent scheduler to fail, the - * failure can only originate from ops.init_task() of the - * parent. A child can't directly affect the parent through its - * own failures. - */ - ret = __scx_init_task(parent, p, false); - if (ret) { - scx_fail_parent(sch, p, ret); - put_task_struct(p); - break; - } - - rq = task_rq_lock(p, &rf); - - if (scx_get_task_state(p) == SCX_TASK_DEAD) { - /* - * sched_ext_dead() raced us between __scx_init_task() - * and this rq lock and ran exit_task() on @sch (the - * sched @p was on at that point), not on $parent. - * $parent's just-completed init is owed an exit_task() - * and we issue it here. - */ - scx_sub_init_cancel_task(parent, p); - task_rq_unlock(rq, p, &rf); - put_task_struct(p); - continue; - } - - scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { - /* - * $p is initialized for $parent and still attached to - * @sch. Disable and exit for @sch, switch over to - * $parent, override the state to READY to account for - * $p having already been initialized, and then enable. - */ - scx_disable_and_exit_task(sch, p); - scx_set_task_state(p, SCX_TASK_INIT_BEGIN); - scx_set_task_state(p, SCX_TASK_INIT); - scx_set_task_sched(p, parent); - scx_set_task_state(p, SCX_TASK_READY); - scx_enable_task(parent, p); - } - - task_rq_unlock(rq, p, &rf); - put_task_struct(p); - } - scx_task_iter_stop(&sti); - - scx_disable_dump(sch); - - scx_cgroup_unlock(); - percpu_up_write(&scx_fork_rwsem); - - /* - * All tasks are moved off of @sch but there may still be on-going - * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use - * the expedited version as ancestors may be waiting in bypass mode. - * Also, tell the parent that there is no need to keep running bypass - * DSQs for us. - */ - synchronize_rcu_expedited(); - scx_disable_bypass_dsp(sch); - - scx_unlink_sched(sch); - - mutex_unlock(&scx_enable_mutex); - - /* - * @sch is now unlinked from the parent's children list. Notify and call - * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called - * after unlinking and releasing all locks. See scx_claim_exit(). - */ - wake_up_all(&scx_unlink_waitq); - - if (parent->ops.sub_detach && sch->sub_attached) { - struct scx_sub_detach_args sub_detach_args = { - .ops = &sch->ops, - .cgroup_path = sch->cgrp_path, - }; - SCX_CALL_OP(parent, sub_detach, NULL, - &sub_detach_args); - } - - scx_log_sched_disable(sch); - - if (sch->ops.exit) - SCX_CALL_OP(sch, exit, NULL, sch->exit_info); - if (sch->sub_kset) - kobject_del(&sch->sub_kset->kobj); - kobject_del(&sch->kobj); -} -#else /* CONFIG_EXT_SUB_SCHED */ -static inline void drain_descendants(struct scx_sched *sch) { } -static inline void scx_sub_disable(struct scx_sched *sch) { } -#endif /* CONFIG_EXT_SUB_SCHED */ - static void scx_root_disable(struct scx_sched *sch) { struct scx_task_iter sti; @@ -7350,347 +6935,6 @@ static void scx_root_enable_workfn(struct kthread_work *work) cmd->ret = 0; } -#ifdef CONFIG_EXT_SUB_SCHED -/* verify that a scheduler can be attached to @cgrp and return the parent */ -static struct scx_sched *find_parent_sched(struct cgroup *cgrp) -{ - struct scx_sched *parent = cgrp->scx_sched; - struct scx_sched *pos; - - lockdep_assert_held(&scx_sched_lock); - - /* can't attach twice to the same cgroup */ - if (parent->cgrp == cgrp) - return ERR_PTR(-EBUSY); - - /* does $parent allow sub-scheds? */ - if (!parent->ops.sub_attach) - return ERR_PTR(-EOPNOTSUPP); - - /* can't insert between $parent and its exiting children */ - list_for_each_entry(pos, &parent->children, sibling) - if (cgroup_is_descendant(pos->cgrp, cgrp)) - return ERR_PTR(-EBUSY); - - return parent; -} - -static bool assert_task_ready_or_enabled(struct task_struct *p) -{ - u32 state = scx_get_task_state(p); - - switch (state) { - case SCX_TASK_READY: - case SCX_TASK_ENABLED: - return true; - default: - WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched", - state, p->comm, p->pid); - return false; - } -} - -static void scx_sub_enable_workfn(struct kthread_work *work) -{ - struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); - struct sched_ext_ops *ops = cmd->ops; - struct cgroup *cgrp; - struct scx_sched *parent, *sch; - struct scx_task_iter sti; - struct task_struct *p; - s32 i, ret; - - mutex_lock(&scx_enable_mutex); - - if (!scx_enabled()) { - ret = -ENODEV; - goto out_unlock; - } - - /* See scx_root_enable_workfn() for the @ops->priv check. */ - if (rcu_access_pointer(ops->priv)) { - ret = -EBUSY; - goto out_unlock; - } - - cgrp = cgroup_get_from_id(ops->sub_cgroup_id); - if (IS_ERR(cgrp)) { - ret = PTR_ERR(cgrp); - goto out_unlock; - } - - raw_spin_lock_irq(&scx_sched_lock); - parent = find_parent_sched(cgrp); - if (IS_ERR(parent)) { - raw_spin_unlock_irq(&scx_sched_lock); - ret = PTR_ERR(parent); - goto out_put_cgrp; - } - kobject_get(&parent->kobj); - raw_spin_unlock_irq(&scx_sched_lock); - - /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */ - sch = scx_alloc_and_add_sched(cmd, cgrp, parent); - kobject_put(&parent->kobj); - if (IS_ERR(sch)) { - ret = PTR_ERR(sch); - goto out_unlock; - } - - ret = scx_link_sched(sch); - if (ret) - goto err_disable; - - if (sch->level >= SCX_SUB_MAX_DEPTH) { - scx_error(sch, "max nesting depth %d violated", - SCX_SUB_MAX_DEPTH); - goto err_disable; - } - - if (sch->ops.init) { - ret = SCX_CALL_OP_RET(sch, init, NULL); - if (ret) { - ret = scx_ops_sanitize_err(sch, "init", ret); - scx_error(sch, "ops.init() failed (%d)", ret); - goto err_disable; - } - sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; - } - - ret = scx_arena_pool_init(sch); - if (ret) - goto err_disable; - - ret = scx_set_cmask_scratch_alloc(sch); - if (ret) - goto err_disable; - - if (scx_validate_ops(sch, ops)) - goto err_disable; - - struct scx_sub_attach_args sub_attach_args = { - .ops = &sch->ops, - .cgroup_path = sch->cgrp_path, - }; - - ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, - &sub_attach_args); - if (ret) { - ret = scx_ops_sanitize_err(sch, "sub_attach", ret); - scx_error(sch, "parent rejected (%d)", ret); - goto err_disable; - } - sch->sub_attached = true; - - scx_bypass(sch, true); - - for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) - if (((void (**)(void))ops)[i]) - set_bit(i, sch->has_op); - - percpu_down_write(&scx_fork_rwsem); - scx_cgroup_lock(); - - /* - * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see - * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down. - */ - set_cgroup_sched(sch_cgroup(sch), sch); - if (!(cgrp->self.flags & CSS_ONLINE)) { - scx_error(sch, "cgroup is not online"); - goto err_unlock_and_disable; - } - - /* - * Initialize tasks for the new child $sch without exiting them for - * $parent so that the tasks can always be reverted back to $parent - * sched on child init failure. - */ - WARN_ON_ONCE(scx_enabling_sub_sched); - scx_enabling_sub_sched = sch; - - scx_task_iter_start(&sti, sch->cgrp); - while ((p = scx_task_iter_next_locked(&sti))) { - struct rq *rq; - struct rq_flags rf; - - /* - * Task iteration may visit the same task twice when racing - * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which - * finished __scx_init_task() and skip if set. - * - * A task may exit and get freed between __scx_init_task() - * completion and scx_enable_task(). In such cases, - * scx_disable_and_exit_task() must exit the task for both the - * parent and child scheds. - */ - if (p->scx.flags & SCX_TASK_SUB_INIT) - continue; - - /* @p is pinned by the iter; see scx_sub_disable() */ - get_task_struct(p); - - if (!assert_task_ready_or_enabled(p)) { - ret = -EINVAL; - goto abort; - } - - scx_task_iter_unlock(&sti); - - /* - * As $p is still on $parent, it can't be transitioned to INIT. - * Let's worry about task state later. Use __scx_init_task(). - */ - ret = __scx_init_task(sch, p, false); - if (ret) - goto abort; - - rq = task_rq_lock(p, &rf); - - if (scx_get_task_state(p) == SCX_TASK_DEAD) { - /* - * sched_ext_dead() raced us between __scx_init_task() - * and this rq lock and ran exit_task() on $parent (the - * sched @p was on at that point), not on @sch. @sch's - * just-completed init is owed an exit_task() and we - * issue it here. - */ - scx_sub_init_cancel_task(sch, p); - task_rq_unlock(rq, p, &rf); - put_task_struct(p); - continue; - } - - p->scx.flags |= SCX_TASK_SUB_INIT; - task_rq_unlock(rq, p, &rf); - - put_task_struct(p); - } - scx_task_iter_stop(&sti); - - /* - * All tasks are prepped. Disable/exit tasks for $parent and enable for - * the new @sch. - */ - scx_task_iter_start(&sti, sch->cgrp); - while ((p = scx_task_iter_next_locked(&sti))) { - /* - * Use clearing of %SCX_TASK_SUB_INIT to detect and skip - * duplicate iterations. - */ - if (!(p->scx.flags & SCX_TASK_SUB_INIT)) - continue; - - scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { - /* - * $p must be either READY or ENABLED. If ENABLED, - * __scx_disabled_and_exit_task() first disables and - * makes it READY. However, after exiting $p, it will - * leave $p as READY. - */ - assert_task_ready_or_enabled(p); - __scx_disable_and_exit_task(parent, p); - - /* - * $p is now only initialized for @sch and READY, which - * is what we want. Assign it to @sch and enable. - */ - scx_set_task_sched(p, sch); - scx_enable_task(sch, p); - - p->scx.flags &= ~SCX_TASK_SUB_INIT; - } - } - scx_task_iter_stop(&sti); - - scx_enabling_sub_sched = NULL; - - scx_cgroup_unlock(); - percpu_up_write(&scx_fork_rwsem); - - scx_bypass(sch, false); - - pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name); - kobject_uevent(&sch->kobj, KOBJ_ADD); - ret = 0; - goto out_unlock; - -out_put_cgrp: - cgroup_put(cgrp); -out_unlock: - mutex_unlock(&scx_enable_mutex); - cmd->ret = ret; - return; - -abort: - put_task_struct(p); - scx_task_iter_stop(&sti); - - /* - * Undo __scx_init_task() for tasks we marked. scx_enable_task() never - * ran for @sch on them, so calling scx_disable_task() here would invoke - * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched - * must stay set until SUB_INIT is cleared from every marked task - - * scx_disable_and_exit_task() reads it when a task exits concurrently. - */ - scx_task_iter_start(&sti, sch->cgrp); - while ((p = scx_task_iter_next_locked(&sti))) { - if (p->scx.flags & SCX_TASK_SUB_INIT) { - scx_sub_init_cancel_task(sch, p); - p->scx.flags &= ~SCX_TASK_SUB_INIT; - } - } - scx_task_iter_stop(&sti); - scx_enabling_sub_sched = NULL; -err_unlock_and_disable: - /* we'll soon enter disable path, keep bypass on */ - scx_cgroup_unlock(); - percpu_up_write(&scx_fork_rwsem); -err_disable: - mutex_unlock(&scx_enable_mutex); - scx_flush_disable_work(sch); - cmd->ret = 0; -} - -static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb, - unsigned long action, void *data) -{ - struct cgroup *cgrp = data; - struct cgroup *parent = cgroup_parent(cgrp); - - if (!cgroup_on_dfl(cgrp)) - return NOTIFY_OK; - - switch (action) { - case CGROUP_LIFETIME_ONLINE: - /* inherit ->scx_sched from $parent */ - if (parent) - rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched); - break; - case CGROUP_LIFETIME_OFFLINE: - /* if there is a sched attached, shoot it down */ - if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp) - scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN, - SCX_ECODE_RSN_CGROUP_OFFLINE, - "cgroup %llu going offline", cgroup_id(cgrp)); - break; - } - - return NOTIFY_OK; -} - -static struct notifier_block scx_cgroup_lifetime_nb = { - .notifier_call = scx_cgroup_lifetime_notify, -}; - -static s32 __init scx_cgroup_lifetime_notifier_init(void) -{ - return blocking_notifier_chain_register(&cgroup_lifetime_notifier, - &scx_cgroup_lifetime_nb); -} -core_initcall(scx_cgroup_lifetime_notifier_init); -#endif /* CONFIG_EXT_SUB_SCHED */ - static s32 scx_enable(struct scx_enable_cmd *cmd, struct bpf_link *link) { static struct kthread_worker *helper; @@ -7837,20 +7081,6 @@ static int bpf_scx_init_member(const struct btf_type *t, return 0; } -#ifdef CONFIG_EXT_SUB_SCHED -static void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog) -{ - struct scx_sched *sch; - - guard(rcu)(); - sch = scx_prog_sched(prog->aux); - if (unlikely(!sch)) - return; - - scx_error(sch, "dispatch recursion detected"); -} -#endif /* CONFIG_EXT_SUB_SCHED */ - static int bpf_scx_check_member(const struct btf_type *t, const struct btf_member *member, const struct bpf_prog *prog) @@ -9021,45 +8251,6 @@ __bpf_kfunc bool scx_bpf_dsq_move_vtime(struct bpf_iter_scx_dsq *it__iter, p, dsq_id, enq_flags | SCX_ENQ_DSQ_PRIQ); } -#ifdef CONFIG_EXT_SUB_SCHED -/** - * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler - * @cgroup_id: cgroup ID of the child scheduler to dispatch - * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs - * - * Allows a parent scheduler to trigger dispatching on one of its direct - * child schedulers. The child scheduler runs its dispatch operation to - * move tasks from dispatch queues to the local runqueue. - * - * Returns: true on success, false if cgroup_id is invalid, not a direct - * child, or caller lacks dispatch permission. - */ -__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux) -{ - struct rq *this_rq = this_rq(); - struct scx_sched *parent, *child; - - guard(rcu)(); - parent = scx_prog_sched(aux); - if (unlikely(!parent)) - return false; - - child = scx_find_sub_sched(cgroup_id); - - if (unlikely(!child)) - return false; - - if (unlikely(scx_parent(child) != parent)) { - scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu", - cgroup_id); - return false; - } - - return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev, - true); -} -#endif /* CONFIG_EXT_SUB_SCHED */ - __bpf_kfunc_end_defs(); BTF_KFUNCS_START(scx_kfunc_ids_dispatch) diff --git a/kernel/sched/ext/internal.h b/kernel/sched/ext/internal.h index c3b97ea4ae79..f9fe7c6ebc4b 100644 --- a/kernel/sched/ext/internal.h +++ b/kernel/sched/ext/internal.h @@ -11,6 +11,34 @@ #include "../sched.h" #include "types.h" +#include + +/** + * scx_add_event - Increase an event counter for 'name' by 'cnt' + * @sch: scx_sched to account events for + * @name: an event name defined in struct scx_event_stats + * @cnt: the number of the event occurred + * + * This can be used when preemption is not disabled. + */ +#define scx_add_event(sch, name, cnt) do { \ + this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \ + trace_sched_ext_event(#name, (cnt)); \ +} while(0) + +/** + * __scx_add_event - Increase an event counter for 'name' by 'cnt' + * @sch: scx_sched to account events for + * @name: an event name defined in struct scx_event_stats + * @cnt: the number of the event occurred + * + * This should be used only when preemption is disabled. + */ +#define __scx_add_event(sch, name, cnt) do { \ + __this_cpu_add((sch)->pcpu->event_stats.name, (cnt)); \ + trace_sched_ext_event(#name, cnt); \ +} while(0) + #define SCX_OP_IDX(op) (offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void))) #define SCX_MOFF_IDX(moff) ((moff) / sizeof(void (*)(void))) diff --git a/kernel/sched/ext/sub.c b/kernel/sched/ext/sub.c new file mode 100644 index 000000000000..050420427273 --- /dev/null +++ b/kernel/sched/ext/sub.c @@ -0,0 +1,668 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * + * Sub-scheduler hierarchy support. + * + * A sub-scheduler is an scx_sched attached to a cgroup subtree under another + * scx_sched. This file holds the sub-scheduler implementation: the scheduler + * tree walk, capability delegation, per-shard cap state and its sync, and the + * sub-scheduler enable/disable paths. The core dispatch/enqueue machinery it + * builds on lives in ext.c. + * + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2026 Tejun Heo + */ +#include +#include "internal.h" +#include "cid.h" +#include "arena.h" +#include "sub.h" + +#ifdef CONFIG_EXT_SUB_SCHED + +/** + * scx_next_descendant_pre - find the next descendant for pre-order walk + * @pos: the current position (%NULL to initiate traversal) + * @root: sched whose descendants to walk + * + * To be used by scx_for_each_descendant_pre(). Find the next descendant to + * visit for pre-order traversal of @root's descendants. @root is included in + * the iteration and the first node to be visited. + */ +struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) +{ + struct scx_sched *next; + + lockdep_assert(lockdep_is_held(&scx_enable_mutex) || + lockdep_is_held(&scx_sched_lock)); + + /* if first iteration, visit @root */ + if (!pos) + return root; + + /* visit the first child if exists */ + next = list_first_entry_or_null(&pos->children, struct scx_sched, sibling); + if (next) + return next; + + /* no child, visit my or the closest ancestor's next sibling */ + while (pos != root) { + if (!list_is_last(&pos->sibling, &scx_parent(pos)->children)) + return list_next_entry(pos, sibling); + pos = scx_parent(pos); + } + + return NULL; +} + +static struct scx_sched *scx_find_sub_sched(u64 cgroup_id) +{ + return rhashtable_lookup(&scx_sched_hash, &cgroup_id, + scx_sched_hash_params); +} + +void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) +{ + rcu_assign_pointer(p->scx.sched, sch); +} + +struct cgroup *sch_cgroup(struct scx_sched *sch) +{ + return sch->cgrp; +} + +/* for each descendant of @cgrp including self, set ->scx_sched to @sch */ +void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) +{ + struct cgroup *pos; + struct cgroup_subsys_state *css; + + cgroup_for_each_live_descendant_pre(pos, css, cgrp) + rcu_assign_pointer(pos->scx_sched, sch); +} + +static DECLARE_WAIT_QUEUE_HEAD(scx_unlink_waitq); + +void drain_descendants(struct scx_sched *sch) +{ + /* + * Child scheds that finished the critical part of disabling will take + * themselves off @sch->children. Wait for it to drain. As propagation + * is recursive, empty @sch->children means that all proper descendant + * scheds reached unlinking stage. + */ + wait_event(scx_unlink_waitq, list_empty(&sch->children)); +} + +static void scx_fail_parent(struct scx_sched *sch, + struct task_struct *failed, s32 fail_code) +{ + struct scx_sched *parent = scx_parent(sch); + struct scx_task_iter sti; + struct task_struct *p; + + scx_error(parent, "ops.init_task() failed (%d) for %s[%d] while disabling a sub-scheduler", + fail_code, failed->comm, failed->pid); + + /* + * Once $parent is bypassed, it's safe to put SCX_TASK_NONE tasks into + * it. This may cause downstream failures on the BPF side but $parent is + * dying anyway. + */ + scx_bypass(parent, true); + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + if (scx_task_on_sched(parent, p)) + continue; + + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { + scx_disable_and_exit_task(sch, p); + scx_set_task_sched(p, parent); + } + } + scx_task_iter_stop(&sti); +} + +void scx_sub_disable(struct scx_sched *sch) +{ + struct scx_sched *parent = scx_parent(sch); + struct scx_task_iter sti; + struct task_struct *p; + int ret; + + /* + * Guarantee forward progress and wait for descendants to be disabled. + * To limit disruptions, $parent is not bypassed. Tasks are fully + * prepped and then inserted back into $parent. + */ + scx_bypass(sch, true); + drain_descendants(sch); + + /* + * Here, every runnable task is guaranteed to make forward progress and + * we can safely use blocking synchronization constructs. Actually + * disable ops. + */ + mutex_lock(&scx_enable_mutex); + percpu_down_write(&scx_fork_rwsem); + scx_cgroup_lock(); + + set_cgroup_sched(sch_cgroup(sch), parent); + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + struct rq *rq; + struct rq_flags rf; + + /* filter out duplicate visits */ + if (scx_task_on_sched(parent, p)) + continue; + + /* + * By the time control reaches here, all descendant schedulers + * should already have been disabled. + */ + WARN_ON_ONCE(!scx_task_on_sched(sch, p)); + + /* + * @p is pinned by the iter: css_task_iter_next() takes a + * reference and holds it until the next iter_next() call, so + * @p->usage is guaranteed > 0. + */ + get_task_struct(p); + + scx_task_iter_unlock(&sti); + + /* + * $p is READY or ENABLED on @sch. Initialize for $parent, + * disable and exit from @sch, and then switch over to $parent. + * + * If a task fails to initialize for $parent, the only available + * action is disabling $parent too. While this allows disabling + * of a child sched to cause the parent scheduler to fail, the + * failure can only originate from ops.init_task() of the + * parent. A child can't directly affect the parent through its + * own failures. + */ + ret = __scx_init_task(parent, p, false); + if (ret) { + scx_fail_parent(sch, p, ret); + put_task_struct(p); + break; + } + + rq = task_rq_lock(p, &rf); + + if (scx_get_task_state(p) == SCX_TASK_DEAD) { + /* + * sched_ext_dead() raced us between __scx_init_task() + * and this rq lock and ran exit_task() on @sch (the + * sched @p was on at that point), not on $parent. + * $parent's just-completed init is owed an exit_task() + * and we issue it here. + */ + scx_sub_init_cancel_task(parent, p); + task_rq_unlock(rq, p, &rf); + put_task_struct(p); + continue; + } + + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { + /* + * $p is initialized for $parent and still attached to + * @sch. Disable and exit for @sch, switch over to + * $parent, override the state to READY to account for + * $p having already been initialized, and then enable. + */ + scx_disable_and_exit_task(sch, p); + scx_set_task_state(p, SCX_TASK_INIT_BEGIN); + scx_set_task_state(p, SCX_TASK_INIT); + scx_set_task_sched(p, parent); + scx_set_task_state(p, SCX_TASK_READY); + scx_enable_task(parent, p); + } + + task_rq_unlock(rq, p, &rf); + put_task_struct(p); + } + scx_task_iter_stop(&sti); + + scx_disable_dump(sch); + + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); + + /* + * All tasks are moved off of @sch but there may still be on-going + * operations (e.g. ops.select_cpu()). Drain them by flushing RCU. Use + * the expedited version as ancestors may be waiting in bypass mode. + * Also, tell the parent that there is no need to keep running bypass + * DSQs for us. + */ + synchronize_rcu_expedited(); + scx_disable_bypass_dsp(sch); + + scx_unlink_sched(sch); + + mutex_unlock(&scx_enable_mutex); + + /* + * @sch is now unlinked from the parent's children list. Notify and call + * ops.sub_detach/exit(). Note that ops.sub_detach/exit() must be called + * after unlinking and releasing all locks. See scx_claim_exit(). + */ + wake_up_all(&scx_unlink_waitq); + + if (parent->ops.sub_detach && sch->sub_attached) { + struct scx_sub_detach_args sub_detach_args = { + .ops = &sch->ops, + .cgroup_path = sch->cgrp_path, + }; + SCX_CALL_OP(parent, sub_detach, NULL, + &sub_detach_args); + } + + scx_log_sched_disable(sch); + + if (sch->ops.exit) + SCX_CALL_OP(sch, exit, NULL, sch->exit_info); + if (sch->sub_kset) + kobject_del(&sch->sub_kset->kobj); + kobject_del(&sch->kobj); +} + +/* verify that a scheduler can be attached to @cgrp and return the parent */ +static struct scx_sched *find_parent_sched(struct cgroup *cgrp) +{ + struct scx_sched *parent = cgrp->scx_sched; + struct scx_sched *pos; + + lockdep_assert_held(&scx_sched_lock); + + /* can't attach twice to the same cgroup */ + if (parent->cgrp == cgrp) + return ERR_PTR(-EBUSY); + + /* does $parent allow sub-scheds? */ + if (!parent->ops.sub_attach) + return ERR_PTR(-EOPNOTSUPP); + + /* can't insert between $parent and its exiting children */ + list_for_each_entry(pos, &parent->children, sibling) + if (cgroup_is_descendant(pos->cgrp, cgrp)) + return ERR_PTR(-EBUSY); + + return parent; +} + +static bool assert_task_ready_or_enabled(struct task_struct *p) +{ + u32 state = scx_get_task_state(p); + + switch (state) { + case SCX_TASK_READY: + case SCX_TASK_ENABLED: + return true; + default: + WARN_ONCE(true, "sched_ext: Invalid task state %d for %s[%d] during enabling sub sched", + state, p->comm, p->pid); + return false; + } +} + +void scx_sub_enable_workfn(struct kthread_work *work) +{ + struct scx_enable_cmd *cmd = container_of(work, struct scx_enable_cmd, work); + struct sched_ext_ops *ops = cmd->ops; + struct cgroup *cgrp; + struct scx_sched *parent, *sch; + struct scx_task_iter sti; + struct task_struct *p; + s32 i, ret; + + mutex_lock(&scx_enable_mutex); + + if (!scx_enabled()) { + ret = -ENODEV; + goto out_unlock; + } + + /* See scx_root_enable_workfn() for the @ops->priv check. */ + if (rcu_access_pointer(ops->priv)) { + ret = -EBUSY; + goto out_unlock; + } + + cgrp = cgroup_get_from_id(ops->sub_cgroup_id); + if (IS_ERR(cgrp)) { + ret = PTR_ERR(cgrp); + goto out_unlock; + } + + raw_spin_lock_irq(&scx_sched_lock); + parent = find_parent_sched(cgrp); + if (IS_ERR(parent)) { + raw_spin_unlock_irq(&scx_sched_lock); + ret = PTR_ERR(parent); + goto out_put_cgrp; + } + kobject_get(&parent->kobj); + raw_spin_unlock_irq(&scx_sched_lock); + + /* scx_alloc_and_add_sched() consumes @cgrp whether it succeeds or not */ + sch = scx_alloc_and_add_sched(cmd, cgrp, parent); + kobject_put(&parent->kobj); + if (IS_ERR(sch)) { + ret = PTR_ERR(sch); + goto out_unlock; + } + + ret = scx_link_sched(sch); + if (ret) + goto err_disable; + + if (sch->level >= SCX_SUB_MAX_DEPTH) { + scx_error(sch, "max nesting depth %d violated", + SCX_SUB_MAX_DEPTH); + goto err_disable; + } + + if (sch->ops.init) { + ret = SCX_CALL_OP_RET(sch, init, NULL); + if (ret) { + ret = scx_ops_sanitize_err(sch, "init", ret); + scx_error(sch, "ops.init() failed (%d)", ret); + goto err_disable; + } + sch->exit_info->flags |= SCX_EFLAG_INITIALIZED; + } + + ret = scx_arena_pool_init(sch); + if (ret) + goto err_disable; + + ret = scx_set_cmask_scratch_alloc(sch); + if (ret) + goto err_disable; + + if (scx_validate_ops(sch, ops)) + goto err_disable; + + struct scx_sub_attach_args sub_attach_args = { + .ops = &sch->ops, + .cgroup_path = sch->cgrp_path, + }; + + ret = SCX_CALL_OP_RET(parent, sub_attach, NULL, + &sub_attach_args); + if (ret) { + ret = scx_ops_sanitize_err(sch, "sub_attach", ret); + scx_error(sch, "parent rejected (%d)", ret); + goto err_disable; + } + sch->sub_attached = true; + + scx_bypass(sch, true); + + for (i = SCX_OPI_BEGIN; i < SCX_OPI_END; i++) + if (((void (**)(void))ops)[i]) + set_bit(i, sch->has_op); + + percpu_down_write(&scx_fork_rwsem); + scx_cgroup_lock(); + + /* + * Set cgroup->scx_sched's and check CSS_ONLINE. Either we see + * !CSS_ONLINE or scx_cgroup_lifetime_notify() sees and shoots us down. + */ + set_cgroup_sched(sch_cgroup(sch), sch); + if (!(cgrp->self.flags & CSS_ONLINE)) { + scx_error(sch, "cgroup is not online"); + goto err_unlock_and_disable; + } + + /* + * Initialize tasks for the new child $sch without exiting them for + * $parent so that the tasks can always be reverted back to $parent + * sched on child init failure. + */ + WARN_ON_ONCE(scx_enabling_sub_sched); + scx_enabling_sub_sched = sch; + + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + struct rq *rq; + struct rq_flags rf; + + /* + * Task iteration may visit the same task twice when racing + * against exiting. Use %SCX_TASK_SUB_INIT to mark tasks which + * finished __scx_init_task() and skip if set. + * + * A task may exit and get freed between __scx_init_task() + * completion and scx_enable_task(). In such cases, + * scx_disable_and_exit_task() must exit the task for both the + * parent and child scheds. + */ + if (p->scx.flags & SCX_TASK_SUB_INIT) + continue; + + /* @p is pinned by the iter; see scx_sub_disable() */ + get_task_struct(p); + + if (!assert_task_ready_or_enabled(p)) { + ret = -EINVAL; + goto abort; + } + + scx_task_iter_unlock(&sti); + + /* + * As $p is still on $parent, it can't be transitioned to INIT. + * Let's worry about task state later. Use __scx_init_task(). + */ + ret = __scx_init_task(sch, p, false); + if (ret) + goto abort; + + rq = task_rq_lock(p, &rf); + + if (scx_get_task_state(p) == SCX_TASK_DEAD) { + /* + * sched_ext_dead() raced us between __scx_init_task() + * and this rq lock and ran exit_task() on $parent (the + * sched @p was on at that point), not on @sch. @sch's + * just-completed init is owed an exit_task() and we + * issue it here. + */ + scx_sub_init_cancel_task(sch, p); + task_rq_unlock(rq, p, &rf); + put_task_struct(p); + continue; + } + + p->scx.flags |= SCX_TASK_SUB_INIT; + task_rq_unlock(rq, p, &rf); + + put_task_struct(p); + } + scx_task_iter_stop(&sti); + + /* + * All tasks are prepped. Disable/exit tasks for $parent and enable for + * the new @sch. + */ + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + /* + * Use clearing of %SCX_TASK_SUB_INIT to detect and skip + * duplicate iterations. + */ + if (!(p->scx.flags & SCX_TASK_SUB_INIT)) + continue; + + scoped_guard (sched_change, p, DEQUEUE_SAVE | DEQUEUE_MOVE) { + /* + * $p must be either READY or ENABLED. If ENABLED, + * __scx_disabled_and_exit_task() first disables and + * makes it READY. However, after exiting $p, it will + * leave $p as READY. + */ + assert_task_ready_or_enabled(p); + __scx_disable_and_exit_task(parent, p); + + /* + * $p is now only initialized for @sch and READY, which + * is what we want. Assign it to @sch and enable. + */ + scx_set_task_sched(p, sch); + scx_enable_task(sch, p); + + p->scx.flags &= ~SCX_TASK_SUB_INIT; + } + } + scx_task_iter_stop(&sti); + + scx_enabling_sub_sched = NULL; + + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); + + scx_bypass(sch, false); + + pr_info("sched_ext: BPF sub-scheduler \"%s\" enabled\n", sch->ops.name); + kobject_uevent(&sch->kobj, KOBJ_ADD); + ret = 0; + goto out_unlock; + +out_put_cgrp: + cgroup_put(cgrp); +out_unlock: + mutex_unlock(&scx_enable_mutex); + cmd->ret = ret; + return; + +abort: + put_task_struct(p); + scx_task_iter_stop(&sti); + + /* + * Undo __scx_init_task() for tasks we marked. scx_enable_task() never + * ran for @sch on them, so calling scx_disable_task() here would invoke + * ops.disable() without a matching ops.enable(). scx_enabling_sub_sched + * must stay set until SUB_INIT is cleared from every marked task - + * scx_disable_and_exit_task() reads it when a task exits concurrently. + */ + scx_task_iter_start(&sti, sch->cgrp); + while ((p = scx_task_iter_next_locked(&sti))) { + if (p->scx.flags & SCX_TASK_SUB_INIT) { + scx_sub_init_cancel_task(sch, p); + p->scx.flags &= ~SCX_TASK_SUB_INIT; + } + } + scx_task_iter_stop(&sti); + scx_enabling_sub_sched = NULL; +err_unlock_and_disable: + /* we'll soon enter disable path, keep bypass on */ + scx_cgroup_unlock(); + percpu_up_write(&scx_fork_rwsem); +err_disable: + mutex_unlock(&scx_enable_mutex); + scx_flush_disable_work(sch); + cmd->ret = 0; +} + +static s32 scx_cgroup_lifetime_notify(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct cgroup *cgrp = data; + struct cgroup *parent = cgroup_parent(cgrp); + + if (!cgroup_on_dfl(cgrp)) + return NOTIFY_OK; + + switch (action) { + case CGROUP_LIFETIME_ONLINE: + /* inherit ->scx_sched from $parent */ + if (parent) + rcu_assign_pointer(cgrp->scx_sched, parent->scx_sched); + break; + case CGROUP_LIFETIME_OFFLINE: + /* if there is a sched attached, shoot it down */ + if (cgrp->scx_sched && cgrp->scx_sched->cgrp == cgrp) + scx_exit(cgrp->scx_sched, SCX_EXIT_UNREG_KERN, + SCX_ECODE_RSN_CGROUP_OFFLINE, + "cgroup %llu going offline", cgroup_id(cgrp)); + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block scx_cgroup_lifetime_nb = { + .notifier_call = scx_cgroup_lifetime_notify, +}; + +static s32 __init scx_cgroup_lifetime_notifier_init(void) +{ + return blocking_notifier_chain_register(&cgroup_lifetime_notifier, + &scx_cgroup_lifetime_nb); +} +core_initcall(scx_cgroup_lifetime_notifier_init); + +void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog) +{ + struct scx_sched *sch; + + guard(rcu)(); + sch = scx_prog_sched(prog->aux); + if (unlikely(!sch)) + return; + + scx_error(sch, "dispatch recursion detected"); +} + +__bpf_kfunc_start_defs(); + +/** + * scx_bpf_sub_dispatch - Trigger dispatching on a child scheduler + * @cgroup_id: cgroup ID of the child scheduler to dispatch + * @aux: implicit BPF argument to access bpf_prog_aux hidden from BPF progs + * + * Allows a parent scheduler to trigger dispatching on one of its direct + * child schedulers. The child scheduler runs its dispatch operation to + * move tasks from dispatch queues to the local runqueue. + * + * Returns: true on success, false if cgroup_id is invalid, not a direct + * child, or caller lacks dispatch permission. + */ +__bpf_kfunc bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux) +{ + struct rq *this_rq = this_rq(); + struct scx_sched *parent, *child; + + guard(rcu)(); + parent = scx_prog_sched(aux); + if (unlikely(!parent)) + return false; + + child = scx_find_sub_sched(cgroup_id); + + if (unlikely(!child)) + return false; + + if (unlikely(scx_parent(child) != parent)) { + scx_error(parent, "trying to dispatch a distant sub-sched on cgroup %llu", + cgroup_id); + return false; + } + + return scx_dispatch_sched(child, this_rq, this_rq->scx.sub_dispatch_prev, + true); +} + +__bpf_kfunc_end_defs(); + +#endif /* CONFIG_EXT_SUB_SCHED */ diff --git a/kernel/sched/ext/sub.h b/kernel/sched/ext/sub.h new file mode 100644 index 000000000000..460a9fd196dc --- /dev/null +++ b/kernel/sched/ext/sub.h @@ -0,0 +1,161 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * BPF extensible scheduler class: Documentation/scheduler/sched-ext.rst + * + * Sub-scheduler hierarchy support. + * + * Copyright (c) 2026 Meta Platforms, Inc. and affiliates. + * Copyright (c) 2026 Tejun Heo + */ +#ifndef _KERNEL_SCHED_EXT_SUB_H +#define _KERNEL_SCHED_EXT_SUB_H + +#include "internal.h" +#include "cid.h" + +#ifdef CONFIG_EXT_SUB_SCHED + +struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root); +void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch); +struct cgroup *sch_cgroup(struct scx_sched *sch); +void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch); +void scx_pstack_recursion_on_dispatch(struct bpf_prog *prog); +void drain_descendants(struct scx_sched *sch); +void scx_sub_disable(struct scx_sched *sch); +void scx_sub_enable_workfn(struct kthread_work *work); +bool scx_bpf_sub_dispatch(u64 cgroup_id, const struct bpf_prog_aux *aux); + +#else /* CONFIG_EXT_SUB_SCHED */ + +static inline struct scx_sched *scx_next_descendant_pre(struct scx_sched *pos, struct scx_sched *root) { return pos ? NULL : root; } +static inline void scx_set_task_sched(struct task_struct *p, struct scx_sched *sch) {} +static inline struct cgroup *sch_cgroup(struct scx_sched *sch) { return NULL; } +static inline void set_cgroup_sched(struct cgroup *cgrp, struct scx_sched *sch) {} +static inline void drain_descendants(struct scx_sched *sch) { } +static inline void scx_sub_disable(struct scx_sched *sch) { } + +#endif /* CONFIG_EXT_SUB_SCHED */ + +/** + * scx_for_each_descendant_pre - pre-order walk of a sched's descendants + * @pos: iteration cursor + * @root: sched to walk the descendants of + * + * Walk @root's descendants. @root is included in the iteration and the first + * node to be visited. Must be called with either scx_enable_mutex or + * scx_sched_lock held. + */ +#define scx_for_each_descendant_pre(pos, root) \ + for ((pos) = scx_next_descendant_pre(NULL, (root)); (pos); \ + (pos) = scx_next_descendant_pre((pos), (root))) + +/* + * One user of this function is scx_bpf_dispatch() which can be called + * recursively as sub-sched dispatches nest. Always inline to reduce stack usage + * from the call frame. + */ +static __always_inline bool +scx_dispatch_sched(struct scx_sched *sch, struct rq *rq, + struct task_struct *prev, bool nested) +{ + struct scx_dsp_ctx *dspc = &this_cpu_ptr(sch->pcpu)->dsp_ctx; + int nr_loops = SCX_DSP_MAX_LOOPS; + s32 cpu = cpu_of(rq); + bool prev_on_sch = (prev->sched_class == &ext_sched_class) && + scx_task_on_sched(sch, prev); + + if (scx_consume_global_dsq(sch, rq)) + return true; + + if (scx_bypass_dsp_enabled(sch)) { + /* if @sch is bypassing, only the bypass DSQs are active */ + if (scx_bypassing(sch, cpu)) + return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0); + +#ifdef CONFIG_EXT_SUB_SCHED + /* + * If @sch isn't bypassing but its children are, @sch is + * responsible for making forward progress for both its own + * tasks that aren't bypassing and the bypassing descendants' + * tasks. The following implements a simple built-in behavior - + * let each CPU try to run the bypass DSQ every Nth time. + * + * Later, if necessary, we can add an ops flag to suppress the + * auto-consumption and a kfunc to consume the bypass DSQ and, + * so that the BPF scheduler can fully control scheduling of + * bypassed tasks. + */ + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu); + + if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) && + scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0)) { + __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1); + return true; + } +#endif /* CONFIG_EXT_SUB_SCHED */ + } + + if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq)) + return false; + + dspc->rq = rq; + + /* + * The dispatch loop. Because scx_flush_dispatch_buf() may drop the rq + * lock, the local DSQ might still end up empty after a successful + * ops.dispatch(). If the local DSQ is empty even after ops.dispatch() + * produced some tasks, retry. The BPF scheduler may depend on this + * looping behavior to simplify its implementation. + */ + do { + dspc->nr_tasks = 0; + + if (nested) { + SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu), + prev_on_sch ? prev : NULL); + } else { + /* stash @prev so that nested invocations can access it */ + rq->scx.sub_dispatch_prev = prev; + SCX_CALL_OP(sch, dispatch, rq, scx_cpu_arg(cpu), + prev_on_sch ? prev : NULL); + rq->scx.sub_dispatch_prev = NULL; + } + + scx_flush_dispatch_buf(sch, rq); + + if ((prev->scx.flags & SCX_TASK_QUEUED) && prev->scx.slice) { + rq->scx.flags |= SCX_RQ_BAL_KEEP; + return true; + } + if (rq->scx.local_dsq.nr) + return true; + if (scx_consume_global_dsq(sch, rq)) + return true; + + /* + * ops.dispatch() can trap us in this loop by repeatedly + * dispatching ineligible tasks. Break out once in a while to + * allow the watchdog to run. As IRQ can't be enabled in + * balance(), we want to complete this scheduling cycle and then + * start a new one. IOW, we want to call resched_curr() on the + * next, most likely idle, task, not the current one. Use + * __scx_bpf_kick_cpu() for deferred kicking. + */ + if (unlikely(!--nr_loops)) { + scx_kick_cpu(sch, cpu, 0); + break; + } + } while (dspc->nr_tasks); + + /* + * Prevent the CPU from going idle while bypassed descendants have tasks + * queued. Without this fallback, bypassed tasks could stall if the host + * scheduler's ops.dispatch() doesn't yield any tasks. + */ + if (scx_bypass_dsp_enabled(sch)) + return scx_consume_dispatch_q(sch, rq, scx_bypass_dsq(sch, cpu), 0); + + return false; +} + +#endif /* _KERNEL_SCHED_EXT_SUB_H */ -- 2.54.0