From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
void@manifault.com, changwoo@igalia.com, emil@etsalapatis.com
Subject: Re: [PATCH 23/34] sched_ext: Implement hierarchical bypass mode
Date: Fri, 6 Mar 2026 08:23:49 +0100 [thread overview]
Message-ID: <aaqBBQPy0QYnWJbn@gpd4> (raw)
In-Reply-To: <20260304220119.4095551-24-tj@kernel.org>
On Wed, Mar 04, 2026 at 12:01:08PM -1000, Tejun Heo wrote:
> When a sub-scheduler enters bypass mode, its tasks must be scheduled by an
> ancestor to guarantee forward progress. Tasks from bypassing descendants are
> queued in the bypass DSQs of the nearest non-bypassing ancestor, or the root
> scheduler if all ancestors are bypassing. This requires coordination between
> bypassing schedulers and their hosts.
>
> Add bypass_enq_target_dsq() to find the correct bypass DSQ by walking up the
> hierarchy until reaching a non-bypassing ancestor. When a sub-scheduler starts
> bypassing, all its runnable tasks are re-enqueued after scx_bypassing() is set,
> ensuring proper migration to ancestor bypass DSQs.
>
> Update scx_dispatch_sched() to handle hosting bypassed descendants. When a
> scheduler is not bypassing but has bypassing descendants, it must schedule both
> its own tasks and bypassed descendant tasks. A simple policy is implemented
> where every Nth dispatch attempt (SCX_BYPASS_HOST_NTH=2) consumes from the
> bypass DSQ. A fallback consumption is also added at the end of dispatch to
> ensure bypassed tasks make progress even when normal scheduling is idle.
>
> Update enable_bypass_dsp() and disable_bypass_dsp() to increment
> bypass_dsp_enable_depth on both the bypassing scheduler and its parent host,
> ensuring both can detect that bypass dispatch is active through
> bypass_dsp_enabled().
>
> Add SCX_EV_SUB_BYPASS_DISPATCH event counter to track scheduling of bypassed
> descendant tasks.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> kernel/sched/ext.c | 97 ++++++++++++++++++++++++++++++++++---
> kernel/sched/ext_internal.h | 11 +++++
> 2 files changed, 101 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 6b07d97b0af6..2a19df67a66c 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -357,6 +357,27 @@ static struct scx_dispatch_q *bypass_dsq(struct scx_sched *sch, s32 cpu)
> return &per_cpu_ptr(sch->pcpu, cpu)->bypass_dsq;
> }
>
> +static struct scx_dispatch_q *bypass_enq_target_dsq(struct scx_sched *sch, s32 cpu)
> +{
> +#ifdef CONFIG_EXT_SUB_SCHED
> + /*
> + * If @sch is a sub-sched which is bypassing, its tasks should go into
> + * the bypass DSQs of the nearest ancestor which is not bypassing. The
> + * not-bypassing ancestor is responsible for scheduling all tasks from
> + * bypassing sub-trees. If all ancestors including root are bypassing,
> + * @p should go to the root's bypass DSQs.
Another nit: no @p in scope, maybe we should use "all tasks" for clarity.
Thanks,
-Andrea
> + *
> + * Whenever a sched starts bypassing, all runnable tasks in its subtree
> + * are re-enqueued after scx_bypassing() is turned on, guaranteeing that
> + * all tasks are transferred to the right DSQs.
> + */
> + while (scx_parent(sch) && scx_bypassing(sch, cpu))
> + sch = scx_parent(sch);
> +#endif /* CONFIG_EXT_SUB_SCHED */
> +
> + return bypass_dsq(sch, cpu);
> +}
> +
> /**
> * bypass_dsp_enabled - Check if bypass dispatch path is enabled
> * @sch: scheduler to check
> @@ -1650,7 +1671,7 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> dsq = find_global_dsq(sch, p);
> goto enqueue;
> bypass:
> - dsq = bypass_dsq(sch, task_cpu(p));
> + dsq = bypass_enq_target_dsq(sch, task_cpu(p));
> goto enqueue;
>
> enqueue:
> @@ -2420,8 +2441,33 @@ static bool scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
> if (consume_global_dsq(sch, rq))
> return true;
>
> - if (bypass_dsp_enabled(sch) && scx_bypassing(sch, cpu))
> - return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu));
> + if (bypass_dsp_enabled(sch)) {
> + /* if @sch is bypassing, only the bypass DSQs are active */
> + if (scx_bypassing(sch, cpu))
> + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu));
> +
> +#ifdef CONFIG_EXT_SUB_SCHED
> + /*
> + * If @sch isn't bypassing but its children are, @sch is
> + * responsible for making forward progress for both its own
> + * tasks that aren't bypassing and the bypassing descendants'
> + * tasks. The following implements a simple built-in behavior -
> + * let each CPU try to run the bypass DSQ every Nth time.
> + *
> + * Later, if necessary, we can add an ops flag to suppress the
> + * auto-consumption and a kfunc to consume the bypass DSQ and,
> + * so that the BPF scheduler can fully control scheduling of
> + * bypassed tasks.
> + */
> + struct scx_sched_pcpu *pcpu = per_cpu_ptr(sch->pcpu, cpu);
> +
> + if (!(pcpu->bypass_host_seq++ % SCX_BYPASS_HOST_NTH) &&
> + consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu))) {
> + __scx_add_event(sch, SCX_EV_SUB_BYPASS_DISPATCH, 1);
> + return true;
> + }
> +#endif /* CONFIG_EXT_SUB_SCHED */
> + }
>
> if (unlikely(!SCX_HAS_OP(sch, dispatch)) || !scx_rq_online(rq))
> return false;
> @@ -2467,6 +2513,14 @@ static bool scx_dispatch_sched(struct scx_sched *sch, struct rq *rq,
> }
> } while (dspc->nr_tasks);
>
> + /*
> + * Prevent the CPU from going idle while bypassed descendants have tasks
> + * queued. Without this fallback, bypassed tasks could stall if the host
> + * scheduler's ops.dispatch() doesn't yield any tasks.
> + */
> + if (bypass_dsp_enabled(sch))
> + return consume_dispatch_q(sch, rq, bypass_dsq(sch, cpu));
> +
> return false;
> }
>
> @@ -4085,6 +4139,7 @@ static ssize_t scx_attr_events_show(struct kobject *kobj,
> at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_DISPATCH);
> at += scx_attr_event_show(buf, at, &events, SCX_EV_BYPASS_ACTIVATE);
> at += scx_attr_event_show(buf, at, &events, SCX_EV_INSERT_NOT_OWNED);
> + at += scx_attr_event_show(buf, at, &events, SCX_EV_SUB_BYPASS_DISPATCH);
> return at;
> }
> SCX_ATTR(events);
> @@ -4460,6 +4515,7 @@ static bool dec_bypass_depth(struct scx_sched *sch)
>
> static void enable_bypass_dsp(struct scx_sched *sch)
> {
> + struct scx_sched *host = scx_parent(sch) ?: sch;
> u32 intv_us = READ_ONCE(scx_bypass_lb_intv_us);
> s32 ret;
>
> @@ -4471,14 +4527,35 @@ static void enable_bypass_dsp(struct scx_sched *sch)
> return;
>
> /*
> - * The LB timer will stop running if bypass_arm_depth is 0. Increment
> - * before starting the LB timer.
> + * When a sub-sched bypasses, its tasks are queued on the bypass DSQs of
> + * the nearest non-bypassing ancestor or root. As enable_bypass_dsp() is
> + * called iff @sch is not already bypassed due to an ancestor bypassing,
> + * we can assume that the parent is not bypassing and thus will be the
> + * host of the bypass DSQs.
> + *
> + * While the situation may change in the future, the following
> + * guarantees that the nearest non-bypassing ancestor or root has bypass
> + * dispatch enabled while a descendant is bypassing, which is all that's
> + * required.
> + *
> + * bypass_dsp_enabled() test is used to detemrine whether to enter the
> + * bypass dispatch handling path from both bypassing and hosting scheds.
> + * Bump enable depth on both @sch and bypass dispatch host.
> */
> ret = atomic_inc_return(&sch->bypass_dsp_enable_depth);
> WARN_ON_ONCE(ret <= 0);
>
> - if (intv_us && !timer_pending(&sch->bypass_lb_timer))
> - mod_timer(&sch->bypass_lb_timer,
> + if (host != sch) {
> + ret = atomic_inc_return(&host->bypass_dsp_enable_depth);
> + WARN_ON_ONCE(ret <= 0);
> + }
> +
> + /*
> + * The LB timer will stop running if bypass dispatch is disabled. Start
> + * after enabling bypass dispatch.
> + */
> + if (intv_us && !timer_pending(&host->bypass_lb_timer))
> + mod_timer(&host->bypass_lb_timer,
> jiffies + usecs_to_jiffies(intv_us));
> }
>
> @@ -4492,6 +4569,11 @@ static void disable_bypass_dsp(struct scx_sched *sch)
>
> ret = atomic_dec_return(&sch->bypass_dsp_enable_depth);
> WARN_ON_ONCE(ret < 0);
> +
> + if (scx_parent(sch)) {
> + ret = atomic_dec_return(&scx_parent(sch)->bypass_dsp_enable_depth);
> + WARN_ON_ONCE(ret < 0);
> + }
> }
>
> /**
> @@ -5266,6 +5348,7 @@ static void scx_dump_state(struct scx_exit_info *ei, size_t dump_len)
> scx_dump_event(s, &events, SCX_EV_BYPASS_DISPATCH);
> scx_dump_event(s, &events, SCX_EV_BYPASS_ACTIVATE);
> scx_dump_event(s, &events, SCX_EV_INSERT_NOT_OWNED);
> + scx_dump_event(s, &events, SCX_EV_SUB_BYPASS_DISPATCH);
>
> if (seq_buf_has_overflowed(&s) && dump_len >= sizeof(trunc_marker))
> memcpy(ei->dump + dump_len - sizeof(trunc_marker),
> diff --git a/kernel/sched/ext_internal.h b/kernel/sched/ext_internal.h
> index fd2671340019..79d44d396152 100644
> --- a/kernel/sched/ext_internal.h
> +++ b/kernel/sched/ext_internal.h
> @@ -24,6 +24,8 @@ enum scx_consts {
> */
> SCX_TASK_ITER_BATCH = 32,
>
> + SCX_BYPASS_HOST_NTH = 2,
> +
> SCX_BYPASS_LB_DFL_INTV_US = 500 * USEC_PER_MSEC,
> SCX_BYPASS_LB_DONOR_PCT = 125,
> SCX_BYPASS_LB_MIN_DELTA_DIV = 4,
> @@ -923,6 +925,12 @@ struct scx_event_stats {
> * scheduler.
> */
> s64 SCX_EV_INSERT_NOT_OWNED;
> +
> + /*
> + * The number of times tasks from bypassing descendants are scheduled
> + * from sub_bypass_dsq's.
> + */
> + s64 SCX_EV_SUB_BYPASS_DISPATCH;
> };
>
> enum scx_sched_pcpu_flags {
> @@ -940,6 +948,9 @@ struct scx_sched_pcpu {
> struct scx_event_stats event_stats;
>
> struct scx_dispatch_q bypass_dsq;
> +#ifdef CONFIG_EXT_SUB_SCHED
> + u32 bypass_host_seq;
> +#endif
> };
>
> struct scx_sched {
> --
> 2.53.0
>
next prev parent reply other threads:[~2026-03-06 7:23 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 22:00 [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 01/34] sched_ext: Implement cgroup subtree iteration for scx_task_iter Tejun Heo
2026-03-04 22:00 ` [PATCH 02/34] sched_ext: Add @kargs to scx_fork() Tejun Heo
2026-03-04 22:00 ` [PATCH 03/34] sched/core: Swap the order between sched_post_fork() and cgroup_post_fork() Tejun Heo
2026-03-06 4:17 ` Tejun Heo
2026-03-06 8:44 ` Peter Zijlstra
2026-03-04 22:00 ` [PATCH 04/34] cgroup: Expose some cgroup helpers Tejun Heo
2026-03-06 4:18 ` Tejun Heo
2026-03-04 22:00 ` [PATCH 05/34] sched_ext: Update p->scx.disallow warning in scx_init_task() Tejun Heo
2026-03-04 22:00 ` [PATCH 06/34] sched_ext: Reorganize enable/disable path for multi-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 07/34] sched_ext: Introduce cgroup sub-sched support Tejun Heo
2026-03-04 22:00 ` [PATCH 08/34] sched_ext: Introduce scx_task_sched[_rcu]() Tejun Heo
2026-03-04 22:00 ` [PATCH 09/34] sched_ext: Introduce scx_prog_sched() Tejun Heo
2026-03-04 22:00 ` [PATCH 10/34] sched_ext: Enforce scheduling authority in dispatch and select_cpu operations Tejun Heo
2026-03-04 22:00 ` [PATCH 11/34] sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime Tejun Heo
2026-03-04 22:00 ` [PATCH 12/34] sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler Tejun Heo
2026-03-04 22:00 ` [PATCH 13/34] sched_ext: Refactor task init/exit helpers Tejun Heo
2026-03-04 22:00 ` [PATCH 14/34] sched_ext: Make scx_prio_less() handle multiple schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 15/34] sched_ext: Move default slice to per-scheduler field Tejun Heo
2026-03-04 22:01 ` [PATCH 16/34] sched_ext: Move aborting flag " Tejun Heo
2026-03-04 22:01 ` [PATCH 17/34] sched_ext: Move bypass_dsq into scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 18/34] sched_ext: Move bypass state into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 19/34] sched_ext: Prepare bypass mode for hierarchical operation Tejun Heo
2026-03-04 22:01 ` [PATCH 20/34] sched_ext: Factor out scx_dispatch_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 21/34] sched_ext: When calling ops.dispatch() @prev must be on the same scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 22/34] sched_ext: Separate bypass dispatch enabling from bypass depth tracking Tejun Heo
2026-03-04 22:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-03-06 7:03 ` Andrea Righi
2026-03-06 7:23 ` Andrea Righi [this message]
2026-03-06 17:39 ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 24/34] sched_ext: Dispatch from all scx_sched instances Tejun Heo
2026-03-04 22:01 ` [PATCH 25/34] sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 26/34] sched_ext: Make watchdog sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 27/34] sched_ext: Convert scx_dump_state() spinlock to raw spinlock Tejun Heo
2026-03-04 22:01 ` [PATCH 28/34] sched_ext: Support dumping multiple schedulers and add scheduler identification Tejun Heo
2026-03-04 22:01 ` [PATCH 29/34] sched_ext: Implement cgroup sub-sched enabling and disabling Tejun Heo
2026-03-06 9:41 ` Cheng-Yang Chou
2026-03-06 17:39 ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 30/34] sched_ext: Add scx_sched back pointer to scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 31/34] sched_ext: Make scx_bpf_reenqueue_local() sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 32/34] sched_ext: Factor out scx_link_sched() and scx_unlink_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 33/34] sched_ext: Add rhashtable lookup for sub-schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 34/34] sched_ext: Add basic building blocks for nested sub-scheduler dispatching Tejun Heo
2026-03-06 4:09 ` [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-06 4:17 ` Tejun Heo
2026-03-06 7:29 ` Andrea Righi
2026-03-06 18:14 ` Tejun Heo
-- strict thread matches above, loose matches on Subject: below --
2026-02-25 5:01 [PATCHSET v2 " Tejun Heo
2026-02-25 5:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-02-25 5:00 [PATCHSET v2 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-02-25 5:00 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-01-21 23:11 [PATCHSET v1 sched_ext/for-6.20] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-01-21 23:11 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aaqBBQPy0QYnWJbn@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=emil@etsalapatis.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.