From: Tejun Heo <tj@kernel.org>
To: linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev
Cc: void@manifault.com, andrea.righi@linux.dev, changwoo@igalia.com,
emil@etsalapatis.com, Tejun Heo <tj@kernel.org>
Subject: [PATCH 23/34] sched_ext: Implement hierarchical bypass mode
Date: Wed, 21 Jan 2026 13:11:29 -1000 [thread overview]
Message-ID: <20260121231140.832332-24-tj@kernel.org> (raw)
In-Reply-To: <20260121231140.832332-1-tj@kernel.org>
When a sub-scheduler enters bypass mode, its tasks must be scheduled by an
ancestor to guarantee forward progress. Tasks from bypassing descendants are
queued in the bypass DSQs of the nearest non-bypassing ancestor, or the root
scheduler if all ancestors are bypassing. This requires coordination between
bypassing schedulers and their hosts.
Add bypass_enq_target_dsq() to find the correct bypass DSQ by walking up the
hierarchy until reaching a non-bypassing ancestor. When a sub-scheduler starts
bypassing, all its runnable tasks are re-enqueued after scx_bypassing() is set,
ensuring proper migration to ancestor bypass DSQs.
Update scx_dispatch_sched() to handle hosting bypassed descendants. When a
scheduler is not bypassing but has bypassing descendants, it must schedule both
its own tasks and bypassed descendant tasks. A simple policy is implemented
where every Nth dispatch attempt (SCX_BYPASS_HOST_NTH=2) consumes from the
bypass DSQ. A fallback consumption is also added at the end of dispatch to
ensure bypassed tasks make progress even when normal scheduling is idle.
Update enable_bypass_dsp() and disable_bypass_dsp() to increment
bypass_dsp_enable_depth on both the bypassing scheduler and its parent host,
ensuring both can detect that bypass dispatch is active through
bypass_dsp_enabled().
Add SCX_EV_SUB_BYPASS_DISPATCH event counter to track scheduling of bypassed
descendant tasks.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
kernel/sched/ext.c | 96 ++++++++++++++++++++++++++++++++++---
kernel/sched/ext_internal.h | 11 +++++
2 files changed, 100 insertions(+), 7 deletions(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 084e346b0f0e..6087083e8b70 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -356,6 +356,27 @@ static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu)
return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
}
+static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu)
+{
+#ifdef CONFIG_EXT_SUB_SCHED
+ /*
+ * If @sch is a sub-sched which is bypassing, its tasks should go into
+ * the bypass DSQs of the nearest ancestor which is not bypassing. The
+ * not-bypassing ancestor is responsible for scheduling all tasks from
+ * bypassing sub-trees. If all ancestors including root are bypassing,
+ * @p should go to the root's bypass DSQs.
+ *
+ * Whenever a sched starts bypassing, all runnable tasks in its subtree
+ * are re-enqueued after scx_bypassing() is turned on, guaranteeing that
+ * all tasks are transferred to the right DSQs.
+ */
+ while (scx_parent(sch) && scx_bypassing(sch, cpu))
+ sch = scx_parent(sch);
+#endif /* CONFIG_EXT_SUB_SCHED */
+
+ return bypass_dsq(sch, cpu);
+}
+
/**
* bypass_dsp_enabled - Check if bypass dispatch path is enabled
* @sch: scheduler to check
@@ -1585,7 +1606,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
dsq = find_global_dsq(sch, p);
goto enqueue;
bypass:
- dsq = bypass_dsq(sch, task_cpu(p));
+ dsq = bypass_enq_target_dsq(sch, task_cpu(p));
goto enqueue;
enqueue:
@@ -2326,8 +2347,31 @@ static bool scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
if (consume_global_dsq(sch, rq))
return true;
- if (bypass_dsp_enabled(sch) && scx_bypassing(sch, cpu))
- return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu));
+ if (bypass_dsp_enabled(sch)) {
+ struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
+
+ /* if @sch is bypassing, only the bypass DSQs are active */
+ if (scx_bypassing(sch, cpu))
+ return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu));
+
+ /*
+ * If @sch isn't bypassing but its children are, @sch is
+ * responsible for making forward progress for both its own
+ * tasks that aren't bypassing and the bypassing descendants'
+ * tasks. The following implements a simple built-in behavior -
+ * let each CPU try to run the bypass DSQ every Nth time.
+ *
+ * Later, if necessary, we can add an ops flag to suppress the
+ * auto-consumption and a kfunc to consume the bypass DSQ and,
+ * so that the BPF scheduler can fully control scheduling of
+ * bypassed tasks.
+ */
+ if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
+ consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu))) {
+ __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
+ return true;
+ }
+ }
if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
return false;
@@ -2373,6 +2417,14 @@ static bool scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
}
} while (dspc->nr_tasks);
+ /*
+ * Prevent the CPU from going idle while bypassed descendants have tasks
+ * queued. Without this fallback, bypassed tasks could stall if the host
+ * scheduler's ops.dispatch() doesn't yield any tasks.
+ */
+ if (bypass_dsp_enabled(sch))
+ return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu));
+
return false;
}
@@ -3892,6 +3944,7 @@ static ssize_t scx_attr_events_show(struct kobject *kobj,
at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DISPATCH);
at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_ACTIVATE);
at += scx_attr_event_show(buf, at, &events, SCX_EV_INSERT_NOT_OWNED);
+ at += scx_attr_event_show(buf, at, &events, SCX_EV_SUB_BYPASS_DISPATCH);
return at;
}
SCX_ATTR(events);
@@ -4252,6 +4305,7 @@ static bool dec_bypass_depth(struct scx_sched *sch)
{
lockdep_assert_held(&scx_bypass_lock);
+
WARN_ON_ONCE(sch->bypass_depth < 1);
WRITE_ONCE(sch->bypass_depth, sch->bypass_depth - 1);
if (sch->bypass_depth != 0)
@@ -4265,6 +4319,7 @@ static bool dec_bypass_depth(struct scx_sched *sch)
static void enable_bypass_dsp(struct scx_sched *sch)
{
+ struct scx_sched *host = scx_parent(sch) ?: sch;
u32 intv_us = READ_ONCE(scx_bypass_lb_intv_us);
s32 ret;
@@ -4276,14 +4331,35 @@ static void enable_bypass_dsp(struct scx_sched *sch)
return;
/*
- * The LB timer will stop running if bypass_arm_depth is 0. Increment
- * before starting the LB timer.
+ * When a sub-sched bypasses, its tasks are queued on the bypass DSQs of
+ * the nearest non-bypassing ancestor or root. As enable_bypass_dsp() is
+ * called iff @sch is not already bypassed due to an ancestor bypassing,
+ * we can assume that the parent is not bypassing and thus will be the
+ * host of the bypass DSQs.
+ *
+ * While the situation may change in the future, the following
+ * guarantees that the nearest non-bypassing ancestor or root has bypass
+ * dispatch enabled while a descendant is bypassing, which is all that's
+ * required.
+ *
+ * bypass_dsp_enabled() test is used to detemrine whether to enter the
+ * bypass dispatch handling path from both bypassing and hosting scheds.
+ * Bump enable depth on both @sch and bypass dispatch host.
*/
ret = atomic_inc_return(&sch->bypass_dsp_enable_depth);
WARN_ON_ONCE(ret <= 0);
- if (intv_us && !timer_pending(&sch->bypass_lb_timer))
- mod_timer(&sch->bypass_lb_timer,
+ if (host != sch) {
+ ret = atomic_inc_return(&host->bypass_dsp_enable_depth);
+ WARN_ON_ONCE(ret <= 0);
+ }
+
+ /*
+ * The LB timer will stop running if bypass dispatch is disabled. Start
+ * after enabling bypass dispatch.
+ */
+ if (intv_us && !timer_pending(&host->bypass_lb_timer))
+ mod_timer(&host->bypass_lb_timer,
jiffies + usecs_to_jiffies(intv_us));
}
@@ -4297,6 +4373,11 @@ static void disable_bypass_dsp(struct scx_sched *sch)
ret = atomic_dec_return(&sch->bypass_dsp_enable_depth);
WARN_ON_ONCE(ret < 0);
+
+ if (scx_parent(sch)) {
+ ret = atomic_dec_return(&scx_parent(sch)->bypass_dsp_enable_depth);
+ WARN_ON_ONCE(ret < 0);
+ }
}
/**
@@ -5061,6 +5142,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
scx_dump_event(s, &events, SCX_EV_BYPASS_DISPATCH);
scx_dump_event(s, &events, SCX_EV_BYPASS_ACTIVATE);
scx_dump_event(s, &events, SCX_EV_INSERT_NOT_OWNED);
+ scx_dump_event(s, &events, SCX_EV_SUB_BYPASS_DISPATCH);
if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
memcpy(ei->dump + dump_len - sizeof(trunc_marker),
diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
index 26b7ab28de44..db2065ec94ee 100644
--- a/kernel/sched/ext_internal.h
+++ b/kernel/sched/ext_internal.h
@@ -24,6 +24,8 @@ enum scx_consts {
*/
SCX_TASK_ITER_BATCH = 32,
+ SCX_BYPASS_HOST_NTH = 2,
+
SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC,
SCX_BYPASS_LB_DONOR_PCT = 125,
SCX_BYPASS_LB_MIN_DELTA_DIV = 4,
@@ -923,6 +925,12 @@ struct scx_event_stats {
* scheduler.
*/
s64 SCX_EV_INSERT_NOT_OWNED;
+
+ /*
+ * The number of times tasks from bypassing descendants are scheduled
+ * from sub_bypass_dsq's.
+ */
+ s64 SCX_EV_SUB_BYPASS_DISPATCH;
};
enum scx_sched_pcpu_flags {
@@ -940,6 +948,9 @@ struct scx_sched_pcpu {
struct scx_event_stats event_stats;
struct scx_dispatch_q bypass_dsq;
+#ifdef CONFIG_EXT_SUB_SCHED
+ u32 bypass_host_seq;
+#endif
};
struct scx_sched {
--
2.52.0
next prev parent reply other threads:[~2026-01-21 23:12 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-21 23:11 [PATCHSET v1 sched_ext/for-6.20] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-01-21 23:11 ` [PATCH 01/34] sched_ext: Implement cgroup subtree iteration for scx_task_iter Tejun Heo
2026-01-21 23:11 ` [PATCH 02/34] sched_ext: Add @kargs to scx_fork() Tejun Heo
2026-01-21 23:11 ` [PATCH 03/34] sched/core: Swap the order between sched_post_fork() and cgroup_post_fork() Tejun Heo
2026-01-21 23:11 ` [PATCH 04/34] cgroup: Expose some cgroup helpers Tejun Heo
2026-01-21 23:11 ` [PATCH 05/34] sched_ext: Update p->scx.disallow warning in scx_init_task() Tejun Heo
2026-01-21 23:11 ` [PATCH 06/34] sched_ext: Reorganize enable/disable path for multi-scheduler support Tejun Heo
2026-01-21 23:11 ` [PATCH 07/34] sched_ext: Introduce cgroup sub-sched support Tejun Heo
2026-01-21 23:11 ` [PATCH 08/34] sched_ext: Introduce scx_task_sched[_rcu]() Tejun Heo
2026-01-21 23:11 ` [PATCH 09/34] sched_ext: Introduce scx_prog_sched() Tejun Heo
2026-01-21 23:11 ` [PATCH 10/34] sched_ext: Enforce scheduling authority in dispatch and select_cpu operations Tejun Heo
2026-01-21 23:11 ` [PATCH 11/34] sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime Tejun Heo
2026-01-21 23:11 ` [PATCH 12/34] sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler Tejun Heo
2026-01-21 23:11 ` [PATCH 13/34] sched_ext: Refactor task init/exit helpers Tejun Heo
2026-01-21 23:11 ` [PATCH 14/34] sched_ext: Make scx_prio_less() handle multiple schedulers Tejun Heo
2026-01-21 23:11 ` [PATCH 15/34] sched_ext: Move default slice to per-scheduler field Tejun Heo
2026-01-21 23:11 ` [PATCH 16/34] sched_ext: Move aborting flag " Tejun Heo
2026-01-21 23:11 ` [PATCH 17/34] sched_ext: Move bypass_dsq into scx_sched_pcpu Tejun Heo
2026-01-21 23:11 ` [PATCH 18/34] sched_ext: Move bypass state into scx_sched Tejun Heo
2026-01-21 23:11 ` [PATCH 19/34] sched_ext: Prepare bypass mode for hierarchical operation Tejun Heo
2026-01-21 23:11 ` [PATCH 20/34] sched_ext: Factor out scx_dispatch_sched() Tejun Heo
2026-01-21 23:11 ` [PATCH 21/34] sched_ext: When calling ops.dispatch() @prev must be on the same scx_sched Tejun Heo
2026-01-21 23:11 ` [PATCH 22/34] sched_ext: Separate bypass dispatch enabling from bypass depth tracking Tejun Heo
2026-01-21 23:11 ` Tejun Heo [this message]
2026-01-21 23:11 ` [PATCH 24/34] sched_ext: Dispatch from all scx_sched instances Tejun Heo
2026-01-21 23:11 ` [PATCH 25/34] sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched Tejun Heo
2026-01-21 23:11 ` [PATCH 26/34] sched_ext: Make watchdog sub-sched aware Tejun Heo
2026-01-21 23:11 ` [PATCH 27/34] sched_ext: Convert scx_dump_state() spinlock to raw spinlock Tejun Heo
2026-01-21 23:11 ` [PATCH 28/34] sched_ext: Support dumping multiple schedulers and add scheduler identification Tejun Heo
2026-01-21 23:11 ` [PATCH 29/34] sched_ext: Implement cgroup sub-sched enabling and disabling Tejun Heo
2026-01-21 23:11 ` [PATCH 30/34] sched_ext: Add scx_sched back pointer to scx_sched_pcpu Tejun Heo
2026-01-21 23:11 ` [PATCH 31/34] sched_ext: Make scx_bpf_reenqueue_local() sub-sched aware Tejun Heo
2026-01-21 23:11 ` [PATCH 32/34] sched_ext: Factor out scx_link_sched() and scx_unlink_sched() Tejun Heo
2026-01-21 23:11 ` [PATCH 33/34] sched_ext: Add rhashtable lookup for sub-schedulers Tejun Heo
2026-01-21 23:11 ` [PATCH 34/34] sched_ext: Add basic building blocks for nested sub-scheduler dispatching Tejun Heo
-- strict thread matches above, loose matches on Subject: below --
2026-02-25 5:00 [PATCHSET v2 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-02-25 5:00 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-02-25 5:01 [PATCHSET v2 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-02-25 5:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-03-04 22:00 [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-04 22:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-03-06 7:03 ` Andrea Righi
2026-03-06 7:23 ` Andrea Righi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260121231140.832332-24-tj@kernel.org \
--to=tj@kernel.org \
--cc=andrea.righi@linux.dev \
--cc=changwoo@igalia.com \
--cc=emil@etsalapatis.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox