All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: Tejun Heo <tj@kernel.org>
Cc: linux-kernel@vger.kernel.org, sched-ext@lists.linux.dev,
	void@manifault.com, changwoo@igalia.com, emil@etsalapatis.com
Subject: Re: [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support
Date: Fri, 6 Mar 2026 08:29:14 +0100	[thread overview]
Message-ID: <aaqCSg8fcR4L74u1@gpd4> (raw)
In-Reply-To: <20260304220119.4095551-1-tj@kernel.org>

Hi Tejun,

On Wed, Mar 04, 2026 at 12:00:45PM -1000, Tejun Heo wrote:
> This patchset has been around for a while. I'm planning to apply this soon
> and resolve remaining issues incrementally.
> 
> This patchset implements cgroup sub-scheduler support for sched_ext, enabling
> multiple scheduler instances to be attached to the cgroup hierarchy. This is a
> partial implementation focusing on the dispatch path - select_cpu and enqueue
> paths will be updated in subsequent patchsets. While incomplete, the dispatch
> path changes are sufficient to demonstrate and exercise the core sub-scheduler
> structures.
> 
> Motivation
> ==========
> 
> Applications often have domain-specific knowledge that generic schedulers cannot
> possess. Database systems understand query priorities and lock holder
> criticality. Virtual machine monitors can coordinate with guest schedulers and
> handle vCPU placement intelligently. Game engines know rendering deadlines and
> which threads are latency-critical.
> 
> On multi-tenant systems where multiple such workloads coexist, implementing
> application-customized scheduling is difficult. Hard partitioning with cpuset
> lacks the dynamism needed - users often don't care about specific CPU
> assignments and want optimizations enabled by sharing a larger machine:
> opportunistic over-commit, improving latency-critical workload characteristics
> while maintaining bandwidth fairness, and packing similar workloads on the same
> L3 caches for efficiency.
> 
> Sub-scheduler support addresses this by allowing schedulers to be attached to
> the cgroup hierarchy. Each application domain runs its own BPF scheduler
> tailored to its needs, while a parent scheduler dynamically controls CPU
> allocation to children without static partitioning.
> 
> Structure
> =========
> 
> Schedulers attach to cgroup nodes forming a hierarchy up to SCX_SUB_MAX_DEPTH
> (4) levels deep. Each scheduler instance maintains its own state including
> default time slice, watchdog, and bypass mode. Tasks belong to exactly one
> scheduler - the one attached to their cgroup or the nearest ancestor with a
> scheduler attached.
> 
> A parent scheduler is responsible for allocating CPU time to its children. When
> a parent's ops.dispatch() is invoked, it can call scx_bpf_sub_dispatch() to
> trigger dispatch on a child scheduler, allowing the parent to control when and
> how much CPU time each child receives. Currently only the dispatch path supports
> this - ops.select_cpu() and ops.enqueue() always operate on the task's own
> scheduler. Full support for these paths will follow in subsequent patchsets.
> 
> Kfuncs use the new KF_IMPLICIT_ARGS BPF feature to identify their calling
> scheduler - the kernel passes bpf_prog_aux implicitly, from which scx_prog_sched()
> finds the associated scx_sched. This enables authority enforcement ensuring
> schedulers can only manipulate their own tasks, preventing cross-scheduler
> interference.
> 
> Bypass mode, used for error recovery and orderly shutdown, propagates
> hierarchically - when a scheduler enters bypass, its descendants follow. This
> ensures forward progress even when nested schedulers malfunction. The dump
> infrastructure supports multiple schedulers, identifying which scheduler each
> task and DSQ belongs to for debugging.

I've reviewed and conducted some basic testing with this. Apart from the
few minor nits, I haven't noticed any bugs or performance regressions, even
using scx_bpf_task_set_slice/dsq_vtime(), which is really good! I'll keep
running more tests, but for now everything looks good to me. Good job!

Reviewed-by: Andrea Righi <arighi@nvidia.com>

Thanks,
-Andrea

> 
> Patches
> =======
> 
> 0001-0004: Preparatory changes exposing cgroup helpers, adding cgroup subtree
> iteration for sched_ext, passing kernel_clone_args to scx_fork(), and reordering
> sched_post_fork() after cgroup_post_fork().
> 
> 0005-0006: Reorganize enable/disable paths in preparation for multiple scheduler
> instances.
> 
> 0007-0009: Core sub-scheduler infrastructure introducing scx_sched structure,
> cgroup attachment, scx_task_sched() for task-to-scheduler mapping, and
> scx_prog_sched() for BPF program-to-scheduler association.
> 
> 0010-0012: Authority enforcement ensuring schedulers can only manipulate their
> own tasks in dispatch, DSQ operations, and task state updates.
> 
> 0013-0014: Refactor task init/exit helpers and update scx_prio_less() to handle
> tasks from different schedulers.
> 
> 0015-0018: Migrate global state to per-scheduler fields: default slice, aborting
> flag, bypass DSQ, and bypass state.
> 
> 0019-0023: Implement hierarchical bypass mode where bypass state propagates from
> parent to descendants, with proper separation of bypass dispatch enabling.
> 
> 0024-0028: Multi-scheduler dispatch and diagnostics - dispatching from all
> scheduler instances, per-scheduler dispatch context, watchdog awareness, and
> multi-scheduler dump support.
> 
> 0029: Implement sub-scheduler enabling and disabling with proper task migration
> between parent and child schedulers.
> 
> 0030-0034: Building blocks for nested dispatching including scx_sched back
> pointers, reenqueue awareness, scheduler linking helpers, rhashtable lookup, and
> scx_bpf_sub_dispatch() kfunc.
> 
> v3:
> - Adapt to for-7.0-fixes change that punts enable path to kthread to avoid
>   starvation. Keep scx_enable() as unified entry dispatching to
>   scx_root_enable_workfn() or scx_sub_enable_workfn() (#6, #7, #29).
> 
> - Fix build with various config combinations (Andrea):
>   - !CONFIG_CGROUP: add root_cgroup()/sch_cgroup() accessors with stubs
>     (#7, #29, #31).
>   - !CONFIG_EXT_SUB_SCHED: add null define for scx_enabling_sub_sched,
>     guard unguarded references, use scx_task_on_sched() helper (#21, #23,
>     #29).
>   - !CONFIG_EXT_GROUP_SCHED: remove unused tg variable (#13).
> 
> - Note scx_is_descendant() usage by later patch to address bisect concern
>   (#7) (Andrea).
> 
> v2: http://lkml.kernel.org/r/20260225050109.1070059-1-tj@kernel.org
> v1: http://lkml.kernel.org/r/20260121231140.832332-1-tj@kernel.org
> 
> Based on sched_ext/for-7.1 (0e953de88b92). The scx_claim_exit() preempt
> fix which was a separate prerequisite for v2 has been merged into for-7.1.
> 
> Git tree:
>   git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-sub-sched-v3
> 
>  include/linux/cgroup-defs.h              |    4 +
>  include/linux/cgroup.h                   |   65 +-
>  include/linux/sched/ext.h                |   11 +
>  init/Kconfig                             |    4 +
>  kernel/cgroup/cgroup-internal.h          |    6 -
>  kernel/cgroup/cgroup.c                   |   55 -
>  kernel/fork.c                            |    6 +-
>  kernel/sched/core.c                      |    2 +-
>  kernel/sched/ext.c                       | 2388 +++++++++++++++++++++++-------
>  kernel/sched/ext.h                       |    4 +-
>  kernel/sched/ext_idle.c                  |  104 +-
>  kernel/sched/ext_internal.h              |  248 +++-
>  kernel/sched/sched.h                     |    7 +-
>  tools/sched_ext/include/scx/common.bpf.h |    1 +
>  tools/sched_ext/include/scx/compat.h     |   10 +
>  tools/sched_ext/scx_qmap.bpf.c           |   44 +-
>  tools/sched_ext/scx_qmap.c               |   13 +-
>  17 files changed, 2321 insertions(+), 651 deletions(-)
> 
> --
> tejun

  parent reply	other threads:[~2026-03-06  7:29 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04 22:00 [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 01/34] sched_ext: Implement cgroup subtree iteration for scx_task_iter Tejun Heo
2026-03-04 22:00 ` [PATCH 02/34] sched_ext: Add @kargs to scx_fork() Tejun Heo
2026-03-04 22:00 ` [PATCH 03/34] sched/core: Swap the order between sched_post_fork() and cgroup_post_fork() Tejun Heo
2026-03-06  4:17   ` Tejun Heo
2026-03-06  8:44     ` Peter Zijlstra
2026-03-04 22:00 ` [PATCH 04/34] cgroup: Expose some cgroup helpers Tejun Heo
2026-03-06  4:18   ` Tejun Heo
2026-03-04 22:00 ` [PATCH 05/34] sched_ext: Update p->scx.disallow warning in scx_init_task() Tejun Heo
2026-03-04 22:00 ` [PATCH 06/34] sched_ext: Reorganize enable/disable path for multi-scheduler support Tejun Heo
2026-03-04 22:00 ` [PATCH 07/34] sched_ext: Introduce cgroup sub-sched support Tejun Heo
2026-03-04 22:00 ` [PATCH 08/34] sched_ext: Introduce scx_task_sched[_rcu]() Tejun Heo
2026-03-04 22:00 ` [PATCH 09/34] sched_ext: Introduce scx_prog_sched() Tejun Heo
2026-03-04 22:00 ` [PATCH 10/34] sched_ext: Enforce scheduling authority in dispatch and select_cpu operations Tejun Heo
2026-03-04 22:00 ` [PATCH 11/34] sched_ext: Enforce scheduler ownership when updating slice and dsq_vtime Tejun Heo
2026-03-04 22:00 ` [PATCH 12/34] sched_ext: scx_dsq_move() should validate the task belongs to the right scheduler Tejun Heo
2026-03-04 22:00 ` [PATCH 13/34] sched_ext: Refactor task init/exit helpers Tejun Heo
2026-03-04 22:00 ` [PATCH 14/34] sched_ext: Make scx_prio_less() handle multiple schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 15/34] sched_ext: Move default slice to per-scheduler field Tejun Heo
2026-03-04 22:01 ` [PATCH 16/34] sched_ext: Move aborting flag " Tejun Heo
2026-03-04 22:01 ` [PATCH 17/34] sched_ext: Move bypass_dsq into scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 18/34] sched_ext: Move bypass state into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 19/34] sched_ext: Prepare bypass mode for hierarchical operation Tejun Heo
2026-03-04 22:01 ` [PATCH 20/34] sched_ext: Factor out scx_dispatch_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 21/34] sched_ext: When calling ops.dispatch() @prev must be on the same scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 22/34] sched_ext: Separate bypass dispatch enabling from bypass depth tracking Tejun Heo
2026-03-04 22:01 ` [PATCH 23/34] sched_ext: Implement hierarchical bypass mode Tejun Heo
2026-03-06  7:03   ` Andrea Righi
2026-03-06  7:23   ` Andrea Righi
2026-03-06 17:39   ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 24/34] sched_ext: Dispatch from all scx_sched instances Tejun Heo
2026-03-04 22:01 ` [PATCH 25/34] sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched Tejun Heo
2026-03-04 22:01 ` [PATCH 26/34] sched_ext: Make watchdog sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 27/34] sched_ext: Convert scx_dump_state() spinlock to raw spinlock Tejun Heo
2026-03-04 22:01 ` [PATCH 28/34] sched_ext: Support dumping multiple schedulers and add scheduler identification Tejun Heo
2026-03-04 22:01 ` [PATCH 29/34] sched_ext: Implement cgroup sub-sched enabling and disabling Tejun Heo
2026-03-06  9:41   ` Cheng-Yang Chou
2026-03-06 17:39   ` [PATCH v2 " Tejun Heo
2026-03-04 22:01 ` [PATCH 30/34] sched_ext: Add scx_sched back pointer to scx_sched_pcpu Tejun Heo
2026-03-04 22:01 ` [PATCH 31/34] sched_ext: Make scx_bpf_reenqueue_local() sub-sched aware Tejun Heo
2026-03-04 22:01 ` [PATCH 32/34] sched_ext: Factor out scx_link_sched() and scx_unlink_sched() Tejun Heo
2026-03-04 22:01 ` [PATCH 33/34] sched_ext: Add rhashtable lookup for sub-schedulers Tejun Heo
2026-03-04 22:01 ` [PATCH 34/34] sched_ext: Add basic building blocks for nested sub-scheduler dispatching Tejun Heo
2026-03-06  4:09 ` [PATCHSET v3 sched_ext/for-7.1] sched_ext: Implement cgroup sub-scheduler support Tejun Heo
2026-03-06  4:17 ` Tejun Heo
2026-03-06  7:29 ` Andrea Righi [this message]
2026-03-06 18:14 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aaqCSg8fcR4L74u1@gpd4 \
    --to=arighi@nvidia.com \
    --cc=changwoo@igalia.com \
    --cc=emil@etsalapatis.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.