public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev
Cc: void@manifault.com, arighi@nvidia.com, changwoo@igalia.com,
	emil@etsalapatis.com, Tejun Heo <tj@kernel.org>
Subject: [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support
Date: Wed,  4 Mar 2026 12:00:45 -1000	[thread overview]
Message-ID: <20260304220119.4095551-1-tj@kernel.org> (raw)

This patchset has been around for a while. I'm planning to apply this soon
and resolve remaining issues incrementally.

This patchset implements cgroup sub-scheduler support for sched_ext, enabling
multiple scheduler instances to be attached to the cgroup hierarchy. This is a
partial implementation focusing on the dispatch path - select_cpu and enqueue
paths will be updated in subsequent patchsets. While incomplete, the dispatch
path changes are sufficient to demonstrate and exercise the core sub-scheduler
structures.

Motivation
==========

Applications often have domain-specific knowledge that generic schedulers cannot
possess. Database systems understand query priorities and lock holder
criticality. Virtual machine monitors can coordinate with guest schedulers and
handle vCPU placement intelligently. Game engines know rendering deadlines and
which threads are latency-critical.

On multi-tenant systems where multiple such workloads coexist, implementing
application-customized scheduling is difficult. Hard partitioning with cpuset
lacks the dynamism needed - users often don't care about specific CPU
assignments and want optimizations enabled by sharing a larger machine:
opportunistic over-commit, improving latency-critical workload characteristics
while maintaining bandwidth fairness, and packing similar workloads on the same
L3 caches for efficiency.

Sub-scheduler support addresses this by allowing schedulers to be attached to
the cgroup hierarchy. Each application domain runs its own BPF scheduler
tailored to its needs, while a parent scheduler dynamically controls CPU
allocation to children without static partitioning.

Structure
=========

Schedulers attach to cgroup nodes forming a hierarchy up to SCX_SUB_MAX_DEPTH
(4) levels deep. Each scheduler instance maintains its own state including
default time slice, watchdog, and bypass mode. Tasks belong to exactly one
scheduler - the one attached to their cgroup or the nearest ancestor with a
scheduler attached.

A parent scheduler is responsible for allocating CPU time to its children. When
a parent's ops.dispatch() is invoked, it can call scx_bpf_sub_dispatch() to
trigger dispatch on a child scheduler, allowing the parent to control when and
how much CPU time each child receives. Currently only the dispatch path supports
this - ops.select_cpu() and ops.enqueue() always operate on the task's own
scheduler. Full support for these paths will follow in subsequent patchsets.

Kfuncs use the new KF_IMPLICIT_ARGS BPF feature to identify their calling
scheduler - the kernel passes bpf_prog_aux implicitly, from which scx_prog_sched()
finds the associated scx_sched. This enables authority enforcement ensuring
schedulers can only manipulate their own tasks, preventing cross-scheduler
interference.

Bypass mode, used for error recovery and orderly shutdown, propagates
hierarchically - when a scheduler enters bypass, its descendants follow. This
ensures forward progress even when nested schedulers malfunction. The dump
infrastructure supports multiple schedulers, identifying which scheduler each
task and DSQ belongs to for debugging.

Patches
=======

0001-0004: Preparatory changes exposing cgroup helpers, adding cgroup subtree
iteration for sched_ext, passing kernel_clone_args to scx_fork(), and reordering
sched_post_fork() after cgroup_post_fork().

0005-0006: Reorganize enable/disable paths in preparation for multiple scheduler
instances.

0007-0009: Core sub-scheduler infrastructure introducing scx_sched structure,
cgroup attachment, scx_task_sched() for task-to-scheduler mapping, and
scx_prog_sched() for BPF program-to-scheduler association.

0010-0012: Authority enforcement ensuring schedulers can only manipulate their
own tasks in dispatch, DSQ operations, and task state updates.

0013-0014: Refactor task init/exit helpers and update scx_prio_less() to handle
tasks from different schedulers.

0015-0018: Migrate global state to per-scheduler fields: default slice, aborting
flag, bypass DSQ, and bypass state.

0019-0023: Implement hierarchical bypass mode where bypass state propagates from
parent to descendants, with proper separation of bypass dispatch enabling.

0024-0028: Multi-scheduler dispatch and diagnostics - dispatching from all
scheduler instances, per-scheduler dispatch context, watchdog awareness, and
multi-scheduler dump support.

0029: Implement sub-scheduler enabling and disabling with proper task migration
between parent and child schedulers.

0030-0034: Building blocks for nested dispatching including scx_sched back
pointers, reenqueue awareness, scheduler linking helpers, rhashtable lookup, and
scx_bpf_sub_dispatch() kfunc.

v3:
- Adapt to for-7.0-fixes change that punts enable path to kthread to avoid
  starvation. Keep scx_enable() as unified entry dispatching to
  scx_root_enable_workfn() or scx_sub_enable_workfn() (#6, #7, #29).

- Fix build with various config combinations (Andrea):
  - !CONFIG_CGROUP: add root_cgroup()/sch_cgroup() accessors with stubs
    (#7, #29, #31).
  - !CONFIG_EXT_SUB_SCHED: add null define for scx_enabling_sub_sched,
    guard unguarded references, use scx_task_on_sched() helper (#21, #23,
    #29).
  - !CONFIG_EXT_GROUP_SCHED: remove unused tg variable (#13).

- Note scx_is_descendant() usage by later patch to address bisect concern
  (#7) (Andrea).

v2: http://lkml.kernel.org/r/20260225050109.1070059-1-tj@kernel.org
v1: http://lkml.kernel.org/r/20260121231140.832332-1-tj@kernel.org

Based on sched_ext/for-7.1 (0e953de88b92). The scx_claim_exit() preempt
fix which was a separate prerequisite for v2 has been merged into for-7.1.

Git tree:
  git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-sub-sched-v3

 include/linux/cgroup-defs.h              |    4 +
 include/linux/cgroup.h                   |   65 +-
 include/linux/sched/ext.h                |   11 +
 init/Kconfig                             |    4 +
 kernel/cgroup/cgroup-internal.h          |    6 -
 kernel/cgroup/cgroup.c                   |   55 -
 kernel/fork.c                            |    6 +-
 kernel/sched/core.c                      |    2 +-
 kernel/sched/ext.c                       | 2388 +++++++++++++++++++++++-------
 kernel/sched/ext.h                       |    4 +-
 kernel/sched/ext_idle.c                  |  104 +-
 kernel/sched/ext_internal.h              |  248 +++-
 kernel/sched/sched.h                     |    7 +-
 tools/sched_ext/include/scx/common.bpf.h |    1 +
 tools/sched_ext/include/scx/compat.h     |   10 +
 tools/sched_ext/scx_qmap.bpf.c           |   44 +-
 tools/sched_ext/scx_qmap.c               |   13 +-
 17 files changed, 2321 insertions(+), 651 deletions(-)

--
tejun

             reply	other threads:[~2026-03-04 22:01 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04 22:00 Tejun Heo [this message]
2026-03-04 22:00 ` [PATCH 01/34] sched_ext: Implement cgroup subtree iteration for scx_task_iter Tejun Heo
2026-03-04 22:00 ` [PATCH 02/34] sched_ext: Add @kargs to scx_fork() Tejun Heo
2026-03-04 22:00 ` [PATCH 03/34] sched/core: Swap the order between sched_post_fork() and cgroup_post_fork() Tejun Heo
2026-03-06  4:17   ` Tejun Heo
2026-03-06  8:44     ` Peter Zijlstra
2026-03-04 22:00 ` [PATCH 04/34] cgroup: Expose some cgroup helpers Tejun Heo
2026-03-06  4:18   ` Tejun Heo
2026-03-04 22:00 ` [PATCH 05/34] sched_ext: Update p->scx.disallow warning in scx_init_task() Tejun Heo
2026-03-04 22:00 ` [PATCH 06/34] sched_ext: Reorganize enable/disable path for multi-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 07/34] sched_ext: Introduce cgroup sub-sched support Tejun Heo
2026-03-04 22:00 ` [PATCH 08/34] sched_ext: Introduce scx_task_sched[_rcu]() Tejun Heo
2026-03-04 22:00 ` [PATCH 09/34] sched_ext: Introduce scx_prog_sched() Tejun Heo
2026-03-04 22:00 ` [PATCH 10/34] sched_ext: Enforce scheduling authority in dispatch and select_cpu operations Tejun Heo
2026-03-04 22:00 ` [PATCH 11/34] sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime Tejun Heo
2026-03-04 22:00 ` [PATCH 12/34] sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler Tejun Heo
2026-03-04 22:00 ` [PATCH 13/34] sched_ext: Refactor task init/exit helpers Tejun Heo
2026-03-04 22:00 ` [PATCH 14/34] sched_ext: Make scx_prio_less() handle multiple schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 15/34] sched_ext: Move default slice to per-scheduler field Tejun Heo
2026-03-04 22:01 ` [PATCH 16/34] sched_ext: Move aborting flag " Tejun Heo
2026-03-04 22:01 ` [PATCH 17/34] sched_ext: Move bypass_dsq into scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 18/34] sched_ext: Move bypass state into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 19/34] sched_ext: Prepare bypass mode for hierarchical operation Tejun Heo
2026-03-04 22:01 ` [PATCH 20/34] sched_ext: Factor out scx_dispatch_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 21/34] sched_ext: When calling ops.dispatch() @prev must be on the same scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 22/34] sched_ext: Separate bypass dispatch enabling from bypass depth tracking Tejun Heo
2026-03-04 22:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-03-06  7:03   ` Andrea Righi
2026-03-06  7:23   ` Andrea Righi
2026-03-06 17:39   ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 24/34] sched_ext: Dispatch from all scx_sched instances Tejun Heo
2026-03-04 22:01 ` [PATCH 25/34] sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 26/34] sched_ext: Make watchdog sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 27/34] sched_ext: Convert scx_dump_state() spinlock to raw spinlock Tejun Heo
2026-03-04 22:01 ` [PATCH 28/34] sched_ext: Support dumping multiple schedulers and add scheduler identification Tejun Heo
2026-03-04 22:01 ` [PATCH 29/34] sched_ext: Implement cgroup sub-sched enabling and disabling Tejun Heo
2026-03-06  9:41   ` Cheng-Yang Chou
2026-03-06 17:39   ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 30/34] sched_ext: Add scx_sched back pointer to scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 31/34] sched_ext: Make scx_bpf_reenqueue_local() sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 32/34] sched_ext: Factor out scx_link_sched() and scx_unlink_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 33/34] sched_ext: Add rhashtable lookup for sub-schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 34/34] sched_ext: Add basic building blocks for nested sub-scheduler dispatching Tejun Heo
2026-03-06  4:09 ` [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-06  4:17 ` Tejun Heo
2026-03-06  7:29 ` Andrea Righi
2026-03-06 18:14 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260304220119.4095551-1-tj@kernel.org \
    --to=tj@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=emil@etsalapatis.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox