From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
void@manifault.com, changwoo@igalia.com, emil@etsalapatis.com
Subject: Re: [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support
Date: Fri, 6 Mar 2026 08:29:14 +0100 [thread overview]
Message-ID: <aaqCSg8fcR4L74u1@gpd4> (raw)
In-Reply-To: <20260304220119.4095551-1-tj@kernel.org>
Hi Tejun,
On Wed, Mar 04, 2026 at 12:00:45PM -1000, Tejun Heo wrote:
> This patchset has been around for a while. I'm planning to apply this soon
> and resolve remaining issues incrementally.
>
> This patchset implements cgroup sub-scheduler support for sched_ext, enabling
> multiple scheduler instances to be attached to the cgroup hierarchy. This is a
> partial implementation focusing on the dispatch path - select_cpu and enqueue
> paths will be updated in subsequent patchsets. While incomplete, the dispatch
> path changes are sufficient to demonstrate and exercise the core sub-scheduler
> structures.
>
> Motivation
> ==========
>
> Applications often have domain-specific knowledge that generic schedulers cannot
> possess. Database systems understand query priorities and lock holder
> criticality. Virtual machine monitors can coordinate with guest schedulers and
> handle vCPU placement intelligently. Game engines know rendering deadlines and
> which threads are latency-critical.
>
> On multi-tenant systems where multiple such workloads coexist, implementing
> application-customized scheduling is difficult. Hard partitioning with cpuset
> lacks the dynamism needed - users often don't care about specific CPU
> assignments and want optimizations enabled by sharing a larger machine:
> opportunistic over-commit, improving latency-critical workload characteristics
> while maintaining bandwidth fairness, and packing similar workloads on the same
> L3 caches for efficiency.
>
> Sub-scheduler support addresses this by allowing schedulers to be attached to
> the cgroup hierarchy. Each application domain runs its own BPF scheduler
> tailored to its needs, while a parent scheduler dynamically controls CPU
> allocation to children without static partitioning.
>
> Structure
> =========
>
> Schedulers attach to cgroup nodes forming a hierarchy up to SCX_SUB_MAX_DEPTH
> (4) levels deep. Each scheduler instance maintains its own state including
> default time slice, watchdog, and bypass mode. Tasks belong to exactly one
> scheduler - the one attached to their cgroup or the nearest ancestor with a
> scheduler attached.
>
> A parent scheduler is responsible for allocating CPU time to its children. When
> a parent's ops.dispatch() is invoked, it can call scx_bpf_sub_dispatch() to
> trigger dispatch on a child scheduler, allowing the parent to control when and
> how much CPU time each child receives. Currently only the dispatch path supports
> this - ops.select_cpu() and ops.enqueue() always operate on the task's own
> scheduler. Full support for these paths will follow in subsequent patchsets.
>
> Kfuncs use the new KF_IMPLICIT_ARGS BPF feature to identify their calling
> scheduler - the kernel passes bpf_prog_aux implicitly, from which scx_prog_sched()
> finds the associated scx_sched. This enables authority enforcement ensuring
> schedulers can only manipulate their own tasks, preventing cross-scheduler
> interference.
>
> Bypass mode, used for error recovery and orderly shutdown, propagates
> hierarchically - when a scheduler enters bypass, its descendants follow. This
> ensures forward progress even when nested schedulers malfunction. The dump
> infrastructure supports multiple schedulers, identifying which scheduler each
> task and DSQ belongs to for debugging.
I've reviewed and conducted some basic testing with this. Apart from the
few minor nits, I haven't noticed any bugs or performance regressions, even
using scx_bpf_task_set_slice/dsq_vtime(), which is really good! I'll keep
running more tests, but for now everything looks good to me. Good job!
Reviewed-by: Andrea Righi <arighi@nvidia.com>
Thanks,
-Andrea
>
> Patches
> =======
>
> 0001-0004: Preparatory changes exposing cgroup helpers, adding cgroup subtree
> iteration for sched_ext, passing kernel_clone_args to scx_fork(), and reordering
> sched_post_fork() after cgroup_post_fork().
>
> 0005-0006: Reorganize enable/disable paths in preparation for multiple scheduler
> instances.
>
> 0007-0009: Core sub-scheduler infrastructure introducing scx_sched structure,
> cgroup attachment, scx_task_sched() for task-to-scheduler mapping, and
> scx_prog_sched() for BPF program-to-scheduler association.
>
> 0010-0012: Authority enforcement ensuring schedulers can only manipulate their
> own tasks in dispatch, DSQ operations, and task state updates.
>
> 0013-0014: Refactor task init/exit helpers and update scx_prio_less() to handle
> tasks from different schedulers.
>
> 0015-0018: Migrate global state to per-scheduler fields: default slice, aborting
> flag, bypass DSQ, and bypass state.
>
> 0019-0023: Implement hierarchical bypass mode where bypass state propagates from
> parent to descendants, with proper separation of bypass dispatch enabling.
>
> 0024-0028: Multi-scheduler dispatch and diagnostics - dispatching from all
> scheduler instances, per-scheduler dispatch context, watchdog awareness, and
> multi-scheduler dump support.
>
> 0029: Implement sub-scheduler enabling and disabling with proper task migration
> between parent and child schedulers.
>
> 0030-0034: Building blocks for nested dispatching including scx_sched back
> pointers, reenqueue awareness, scheduler linking helpers, rhashtable lookup, and
> scx_bpf_sub_dispatch() kfunc.
>
> v3:
> - Adapt to for-7.0-fixes change that punts enable path to kthread to avoid
> starvation. Keep scx_enable() as unified entry dispatching to
> scx_root_enable_workfn() or scx_sub_enable_workfn() (#6, #7, #29).
>
> - Fix build with various config combinations (Andrea):
> - !CONFIG_CGROUP: add root_cgroup()/sch_cgroup() accessors with stubs
> (#7, #29, #31).
> - !CONFIG_EXT_SUB_SCHED: add null define for scx_enabling_sub_sched,
> guard unguarded references, use scx_task_on_sched() helper (#21, #23,
> #29).
> - !CONFIG_EXT_GROUP_SCHED: remove unused tg variable (#13).
>
> - Note scx_is_descendant() usage by later patch to address bisect concern
> (#7) (Andrea).
>
> v2: http://lkml.kernel.org/r/20260225050109.1070059-1-tj@kernel.org
> v1: http://lkml.kernel.org/r/20260121231140.832332-1-tj@kernel.org
>
> Based on sched_ext/for-7.1 (0e953de88b92). The scx_claim_exit() preempt
> fix which was a separate prerequisite for v2 has been merged into for-7.1.
>
> Git tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-sub-sched-v3
>
> include/linux/cgroup-defs.h | 4 +
> include/linux/cgroup.h | 65 +-
> include/linux/sched/ext.h | 11 +
> init/Kconfig | 4 +
> kernel/cgroup/cgroup-internal.h | 6 -
> kernel/cgroup/cgroup.c | 55 -
> kernel/fork.c | 6 +-
> kernel/sched/core.c | 2 +-
> kernel/sched/ext.c | 2388 +++++++++++++++++++++++-------
> kernel/sched/ext.h | 4 +-
> kernel/sched/ext_idle.c | 104 +-
> kernel/sched/ext_internal.h | 248 +++-
> kernel/sched/sched.h | 7 +-
> tools/sched_ext/include/scx/common.bpf.h | 1 +
> tools/sched_ext/include/scx/compat.h | 10 +
> tools/sched_ext/scx_qmap.bpf.c | 44 +-
> tools/sched_ext/scx_qmap.c | 13 +-
> 17 files changed, 2321 insertions(+), 651 deletions(-)
>
> --
> tejun
next prev parent reply other threads:[~2026-03-06 7:29 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 22:00 [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 01/34] sched_ext: Implement cgroup subtree iteration for scx_task_iter Tejun Heo
2026-03-04 22:00 ` [PATCH 02/34] sched_ext: Add @kargs to scx_fork() Tejun Heo
2026-03-04 22:00 ` [PATCH 03/34] sched/core: Swap the order between sched_post_fork() and cgroup_post_fork() Tejun Heo
2026-03-06 4:17 ` Tejun Heo
2026-03-06 8:44 ` Peter Zijlstra
2026-03-04 22:00 ` [PATCH 04/34] cgroup: Expose some cgroup helpers Tejun Heo
2026-03-06 4:18 ` Tejun Heo
2026-03-04 22:00 ` [PATCH 05/34] sched_ext: Update p->scx.disallow warning in scx_init_task() Tejun Heo
2026-03-04 22:00 ` [PATCH 06/34] sched_ext: Reorganize enable/disable path for multi-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 07/34] sched_ext: Introduce cgroup sub-sched support Tejun Heo
2026-03-04 22:00 ` [PATCH 08/34] sched_ext: Introduce scx_task_sched[_rcu]() Tejun Heo
2026-03-04 22:00 ` [PATCH 09/34] sched_ext: Introduce scx_prog_sched() Tejun Heo
2026-03-04 22:00 ` [PATCH 10/34] sched_ext: Enforce scheduling authority in dispatch and select_cpu operations Tejun Heo
2026-03-04 22:00 ` [PATCH 11/34] sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime Tejun Heo
2026-03-04 22:00 ` [PATCH 12/34] sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler Tejun Heo
2026-03-04 22:00 ` [PATCH 13/34] sched_ext: Refactor task init/exit helpers Tejun Heo
2026-03-04 22:00 ` [PATCH 14/34] sched_ext: Make scx_prio_less() handle multiple schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 15/34] sched_ext: Move default slice to per-scheduler field Tejun Heo
2026-03-04 22:01 ` [PATCH 16/34] sched_ext: Move aborting flag " Tejun Heo
2026-03-04 22:01 ` [PATCH 17/34] sched_ext: Move bypass_dsq into scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 18/34] sched_ext: Move bypass state into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 19/34] sched_ext: Prepare bypass mode for hierarchical operation Tejun Heo
2026-03-04 22:01 ` [PATCH 20/34] sched_ext: Factor out scx_dispatch_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 21/34] sched_ext: When calling ops.dispatch() @prev must be on the same scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 22/34] sched_ext: Separate bypass dispatch enabling from bypass depth tracking Tejun Heo
2026-03-04 22:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-03-06 7:03 ` Andrea Righi
2026-03-06 7:23 ` Andrea Righi
2026-03-06 17:39 ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 24/34] sched_ext: Dispatch from all scx_sched instances Tejun Heo
2026-03-04 22:01 ` [PATCH 25/34] sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 26/34] sched_ext: Make watchdog sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 27/34] sched_ext: Convert scx_dump_state() spinlock to raw spinlock Tejun Heo
2026-03-04 22:01 ` [PATCH 28/34] sched_ext: Support dumping multiple schedulers and add scheduler identification Tejun Heo
2026-03-04 22:01 ` [PATCH 29/34] sched_ext: Implement cgroup sub-sched enabling and disabling Tejun Heo
2026-03-06 9:41 ` Cheng-Yang Chou
2026-03-06 17:39 ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 30/34] sched_ext: Add scx_sched back pointer to scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 31/34] sched_ext: Make scx_bpf_reenqueue_local() sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 32/34] sched_ext: Factor out scx_link_sched() and scx_unlink_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 33/34] sched_ext: Add rhashtable lookup for sub-schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 34/34] sched_ext: Add basic building blocks for nested sub-scheduler dispatching Tejun Heo
2026-03-06 4:09 ` [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-06 4:17 ` Tejun Heo
2026-03-06 7:29 ` Andrea Righi [this message]
2026-03-06 18:14 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aaqCSg8fcR4L74u1@gpd4 \
--to=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=emil@etsalapatis.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox