From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>
Cc: sched-ext@lists.linux.dev, Emil Tsalapatis <emil@etsalapatis.com>,
linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>
Subject: [PATCHSET v2 sched_ext/for-7.1] sched_ext: Implement SCX_ENQ_IMMED
Date: Fri, 13 Mar 2026 01:31:08 -1000 [thread overview]
Message-ID: <20260313113114.1591010-1-tj@kernel.org> (raw)
Hello,
Currently, BPF schedulers that want to ensure tasks don't linger on local
DSQs behind other tasks or on CPUs taken by higher-priority scheduling
classes must resort to hooking the sched_switch tracepoint or implementing
the now-deprecated ops.cpu_acquire/release(). Both approaches are cumbersome
and partial - sched_switch doesn't handle cases where a local DSQ ends up
with multiple tasks queued, which can be difficult to control perfectly.
cpu_release() is even more limited, missing cases like a higher-priority
task waking up while an idle CPU is waking up to an SCX task. Neither can
atomically determine whether a CPU is truly available at the moment of
dispatch.
SCX_ENQ_IMMED replaces these with a single dispatch flag that provides a
kernel-enforced guarantee: a task dispatched with IMMED either gets on the
CPU immediately, or gets reenqueued back to the BPF scheduler. It will never
linger on a local DSQ behind other tasks or be silently put back after
preemption. This gives BPF schedulers comprehensive latency control directly
in the dispatch path.
The protection is persistent - it survives SAVE/RESTORE cycles, slice
extensions and higher-priority class preemptions. If an IMMED task is
preempted while running, it gets reenqueued through ops.enqueue() with
SCX_TASK_REENQ_PREEMPTED instead of silently placed back on the local DSQ.
This also enables opportunistic CPU sharing across sub-schedulers. Without
IMMED, a sub-scheduler can stuff the local DSQ of a shared CPU, making it
difficult for others to use. With IMMED, tasks only stay on a CPU when they
can actually run, keeping CPUs available for other schedulers.
Patches 1-2 are prep refactoring. Patch 3 implements SCX_ENQ_IMMED. Patches
4-5 plumb enq_flags through the consume and move_to_local paths so IMMED
works on those paths too. Patch 6 adds SCX_OPS_ALWAYS_ENQ_IMMED.
v2: - Split prep patches out of main IMMED patch (#1, #2).
- Rewrite is_curr_done() as rq_is_open() using rq->next_class and
implement wakeup_preempt_scx() for complete higher-class preemption
coverage (#3).
- Track IMMED persistently in p->scx.flags and reenqueue
preempted-while-running tasks through ops.enqueue() (#3).
- Drop "disallow setting slice to zero" patch - no longer needed with
rq_is_open() approach.
- Plumb enq_flags through consume and move_to_local paths (#4, #5).
- Cover scx_bpf_dsq_move_to_local() in OPS_ALWAYS_IMMED (#6).
- Remove obsolete sched_switch tracepoint and cpu_release handlers
from scx_qmap, add IMMED stress test (#6) (Andrea Righi).
v1: https://lore.kernel.org/r/20260307002817.1298341-1-tj@kernel.org
Based on sched_ext/for-7.1 (bd377af09701).
0001-sched_ext-Split-task_should_reenq-into-local-and-use.patch
0002-sched_ext-Add-scx_vet_enq_flags-and-plumb-dsq_id-int.patch
0003-sched_ext-Implement-SCX_ENQ_IMMED.patch
0004-sched_ext-Plumb-enq_flags-through-the-consume-path.patch
0005-sched_ext-Add-enq_flags-to-scx_bpf_dsq_move_to_local.patch
0006-sched_ext-Add-SCX_OPS_ALWAYS_ENQ_IMMED-ops-flag.patch
Git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-enq-immed-v2
include/linux/sched/ext.h | 5 +
kernel/sched/ext.c | 350 +++++++++++++++++++++++++++----
kernel/sched/ext_internal.h | 56 ++++-
kernel/sched/sched.h | 2 +
tools/sched_ext/include/scx/compat.bpf.h | 20 +-
tools/sched_ext/include/scx/compat.h | 1 +
tools/sched_ext/scx_central.bpf.c | 4 +-
tools/sched_ext/scx_cpu0.bpf.c | 2 +-
tools/sched_ext/scx_flatcg.bpf.c | 6 +-
tools/sched_ext/scx_qmap.bpf.c | 70 +++----
tools/sched_ext/scx_qmap.c | 13 +-
tools/sched_ext/scx_sdt.bpf.c | 2 +-
tools/sched_ext/scx_simple.bpf.c | 2 +-
13 files changed, 435 insertions(+), 98 deletions(-)
--
tejun
next reply other threads:[~2026-03-13 11:31 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-13 11:31 Tejun Heo [this message]
2026-03-13 11:31 ` [PATCH 1/6] sched_ext: Split task_should_reenq() into local and user variants Tejun Heo
2026-03-13 11:31 ` [PATCH 2/6] sched_ext: Add scx_vet_enq_flags() and plumb dsq_id into preamble Tejun Heo
2026-03-13 11:31 ` [PATCH 3/6] sched_ext: Implement SCX_ENQ_IMMED Tejun Heo
2026-03-13 19:15 ` Andrea Righi
2026-03-13 11:31 ` [PATCH 4/6] sched_ext: Plumb enq_flags through the consume path Tejun Heo
2026-03-13 11:31 ` [PATCH 5/6] sched_ext: Add enq_flags to scx_bpf_dsq_move_to_local() Tejun Heo
2026-03-13 11:31 ` [PATCH 6/6] sched_ext: Add SCX_OPS_ALWAYS_ENQ_IMMED ops flag Tejun Heo
2026-03-13 18:37 ` [PATCH 7/6 sched_ext/for-7.1] sched_ext: Use schedule_deferred_locked() in schedule_dsq_reenq() Tejun Heo
2026-03-13 19:21 ` [PATCHSET v2 sched_ext/for-7.1] sched_ext: Implement SCX_ENQ_IMMED Andrea Righi
2026-03-13 19:45 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260313113114.1591010-1-tj@kernel.org \
--to=tj@kernel.org \
--cc=arighi@nvidia.com \
--cc=changwoo@igalia.com \
--cc=emil@etsalapatis.com \
--cc=linux-kernel@vger.kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.