All of lore.kernel.org
 help / color / mirror / Atom feed
From: David Carlier <devnexen@gmail.com>
To: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>
Cc: linux-kernel@vger.kernel.org, David Carlier <devnexen@gmail.com>
Subject: [PATCH] sched_ext: add unlikely() hints in do_enqueue_task() hot path
Date: Thu, 26 Feb 2026 17:50:15 +0000	[thread overview]
Message-ID: <20260226175019.40449-1-devnexen@gmail.com> (raw)

Add unlikely() branch hints to the error/bypass checks in
do_enqueue_task() that are rarely taken during normal operation:
offline CPU, bypass mode, exiting task, and migration-disabled task.

Signed-off-by: David Carlier <devnexen@gmail.com>
---
 CLAUDE.md          | 158 +++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/ext.c |  12 ++--
 2 files changed, 164 insertions(+), 6 deletions(-)
 create mode 100644 CLAUDE.md

diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 000000000000..e892eeea804e
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,158 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Important Rules
+
+- **Do NOT modify code directly.** Only analyze, explain, and suggest changes. The user writes all code themselves.
+
+## Repository Overview
+
+This is the Linux kernel tree with the **sched_ext** subsystem — a BPF-based extensible scheduler class that allows scheduling policies to be implemented as BPF programs and loaded/unloaded at runtime. The kernel falls back to the default fair-class scheduler on any error or when the BPF scheduler exits.
+
+## Build Commands
+
+### Kernel (requires CONFIG_SCHED_CLASS_EXT=y)
+```bash
+# Required Kconfig options:
+# CONFIG_BPF=y CONFIG_SCHED_CLASS_EXT=y CONFIG_BPF_SYSCALL=y
+# CONFIG_BPF_JIT=y CONFIG_DEBUG_INFO_BTF=y
+make -j$(nproc)
+```
+
+### Example BPF schedulers (tools/sched_ext/)
+```bash
+make -j$(nproc) -C tools/sched_ext          # build all
+make -C tools/sched_ext scx_simple           # build one scheduler
+make -C tools/sched_ext clean
+```
+Output goes to `tools/sched_ext/build/bin/`. Requires clang >= 16, pahole >= 1.25. The build auto-generates `vmlinux.h` from the first available vmlinux (kernel tree root, `/sys/kernel/btf/vmlinux`, or `/boot/vmlinux-$(uname -r)`).
+
+### Selftests (tools/testing/selftests/sched_ext/)
+```bash
+make -j$(nproc) -C tools/testing/selftests/sched_ext
+# Run all tests:
+tools/testing/selftests/sched_ext/runner
+```
+
+## Architecture
+
+### Kernel-side (kernel/sched/)
+- **`ext.c`** — Core sched_ext implementation: BPF scheduler loading/unloading, dispatch queue (DSQ) management, all `scx_bpf_*` kfunc helpers callable from BPF
+- **`ext_idle.c`** — Built-in idle CPU tracking and selection (per-node/global idle cpumasks)
+- **`ext_internal.h`** — Internal data structures: `struct scx_dispatch_q`, task states, exit codes, config flags
+- **`ext.h`** — Kernel-internal header with scheduler hook declarations (`scx_tick`, `scx_enqueue`, etc.) and no-op stubs when `CONFIG_SCHED_CLASS_EXT` is disabled
+
+### Public header
+- **`include/linux/sched/ext.h`** — Defines `struct sched_ext_ops` (the BPF struct_ops table), `struct sched_ext_entity` (per-task state), and all constants/flags
+
+### BPF scheduler interface
+BPF schedulers implement callbacks in `struct sched_ext_ops` via `SEC(".struct_ops")`. Key callbacks: `select_cpu`, `enqueue`, `dequeue`, `dispatch`, `init`, `exit`. The kernel communicates with BPF through kfuncs prefixed `scx_bpf_*` (e.g., `scx_bpf_dsq_insert()`, `scx_bpf_select_cpu_dfl()`, `scx_bpf_pick_idle_cpu()`).
+
+### Dispatch Queues (DSQs)
+Central abstraction bridging the scheduler core and BPF:
+- `SCX_DSQ_GLOBAL` — Global FIFO queue
+- `SCX_DSQ_LOCAL` / `SCX_DSQ_LOCAL_ON | cpu` — Per-CPU local queues
+- Custom DSQs created with `scx_bpf_create_dsq()`
+
+A CPU runs tasks from its local DSQ; if empty, it pulls from the global DSQ, then calls `ops.dispatch()`.
+
+### Example schedulers (tools/sched_ext/)
+Each scheduler is a pair: `scx_foo.bpf.c` (BPF program) + `scx_foo.c` (userspace loader). Available schedulers: `scx_simple`, `scx_qmap`, `scx_central`, `scx_flatcg`, `scx_pair`, `scx_sdt`, `scx_cpu0`, `scx_userland`.
+
+Shared headers live in `tools/sched_ext/include/scx/`:
+- `common.bpf.h` — BPF kfunc declarations, helper macros
+- `common.h` — Userspace utilities (loading, stats printing)
+- `compat.bpf.h` / `compat.h` — Cross-kernel-version compatibility
+- `user_exit_info.h` / `user_exit_info.bpf.h` — Exit info shared between BPF and userspace
+
+### Selftest framework (tools/testing/selftests/sched_ext/)
+Tests follow a `*.bpf.c` + `*.c` pair pattern. Each test registers via `REGISTER_SCX_TEST()` (ELF constructor) and implements `setup`/`run`/`cleanup` returning `SCX_TEST_PASS`/`SCX_TEST_SKIP`/`SCX_TEST_FAIL`. The `runner` binary aggregates and executes all registered tests. Assertion macros: `SCX_FAIL_IF`, `SCX_EQ`, `SCX_GT`, `SCX_GE`, `SCX_LT`, `SCX_LE`, `SCX_ASSERT`.
+
+## Key Conventions
+
+- BPF struct_ops callbacks use `BPF_STRUCT_OPS()` / `BPF_STRUCT_OPS_SLEEPABLE()` macros
+- The sched_ext ABI between kernel and BPF schedulers has **no stability guarantees** across kernel versions
+- Schedulers must be compiled with `-target bpf` and linked through bpftool skeleton generation (`.bpf.c` → `.bpf.o` → `.bpf.skel.h`)
+- CFLAGS include `-Wall -Werror` for both tools and selftests
+- Production-ready schedulers live in the separate [sched-ext/scx](https://github.com/sched-ext/scx) repository; the in-tree ones are examples
+- Commit messages must include a `Signed-off-by:` line (use `git commit -s`)
+
+## Known Bugs in tools/sched_ext/
+
+### `common.h` `SCX_BUG` reads errno after fprintf
+The `SCX_BUG` macro calls `fprintf` before checking `errno`, but `fprintf` itself may clobber `errno`. The value should be saved before the first `fprintf` call.
+
+### `scx_simple` / `scx_cpu0` VLA in `read_stats`
+`__u64 cnts[2][nr_cpus]` on the stack; problematic at very high CPU counts (e.g. 4096+ CPUs = 64 KB stack).
+
+## Submitted Patches (pending upstream review)
+
+### `scx_idle_init_masks()` NUMA OOB fix
+`scx_idle_node_masks` was allocated with `num_possible_nodes()` (count) but indexed by node IDs via `for_each_node()`. On non-contiguous NUMA topologies, node IDs can exceed the array size. Fixed by allocating with `nr_node_ids`. Branch: `numa_id_alloc_fix`.
+
+### `sched_ext_entity` cache line layout optimization
+Reordered `ops_state`, `ddsp_dsq_id`, and `ddsp_enq_flags` to sit immediately after `dsq` in `struct sched_ext_entity` (`include/linux/sched/ext.h`). These fields are accessed together in the `do_enqueue_task()` and `finish_dispatch()` hot paths but were previously spread across three different cache lines. Branch: `sched_ext_entity_layout_upd`.
+
+### TOCTOU on `p->scx.dsq` in `scx_dump_task()` fix
+Used `READ_ONCE()` to capture `p->scx.dsq` into a local variable before dereferencing, preventing another CPU from NULLing the pointer between check and use. Branch: `scx_dump_concur_fix`.
+
+### `SCX_EFLAG_INITIALIZED` no-op flag fix
+`SCX_EFLAG_INITIALIZED` in `enum scx_exit_flags` defaulted to 0, making the `|=` in `scx_ops_init()` a no-op. BPF schedulers could not distinguish whether `ops.init()` completed. Assigned `1LLU << 0`. Branch: `SCX_EFLAG_INITIALIZED_value`.
+
+### Direct `scx_root` dereference without RCU in dump paths fix
+`scx_dump_task()` and `scx_dump_state()` now use `rcu_dereference()` to read `scx_root` under RCU protection, with an early return if NULL, preventing NULL-deref during concurrent scheduler teardown. Branch: `scx_dump_concur_fix`.
+
+## Analyzing Struct Cache Line Layouts with pahole
+
+To verify cache line placement of struct fields (e.g. when reviewing or proposing layout optimizations), use `pahole` on a compiled `.o` file from the kernel tree.
+
+### Setup
+```bash
+# Need: pahole (from dwarves package), libdw-dev, CONFIG_DEBUG_INFO_DWARF5=y
+make defconfig
+scripts/config --enable CONFIG_SCHED_CLASS_EXT --enable CONFIG_DEBUG_INFO \
+    --enable CONFIG_DEBUG_INFO_DWARF5 --enable CONFIG_SCHED_CORE \
+    --enable CONFIG_EXT_GROUP_SCHED
+make olddefconfig
+```
+
+### Build a single .o and inspect
+```bash
+make prepare -j$(nproc)
+# ext.c may fail with newer GCC; core.o also includes sched_ext_entity
+make kernel/sched/core.o -j$(nproc)
+pahole -C sched_ext_entity kernel/sched/core.o
+```
+
+### Before/after comparison workflow
+1. Build the `.o` on the current branch, save pahole output
+2. Checkout the patched header (`git checkout <branch> -- include/linux/sched/ext.h`)
+3. Rebuild the same `.o`, run pahole again
+4. Restore with `git checkout master -- include/linux/sched/ext.h`
+
+The output shows field offsets, sizes, and cacheline boundaries — look for hot-path fields that cross `/* --- cacheline N boundary --- */` markers.
+
+### Running kernel (alternative)
+```bash
+# If the struct exists in the running kernel's BTF:
+pahole -C sched_ext_entity /sys/kernel/btf/vmlinux
+```
+
+## Identified Optimization Opportunities
+
+### `struct scx_dispatch_q` false sharing (`include/linux/sched/ext.h`)
+`lock` (write-heavy) and `first_task` (read-mostly, lockless RCU peek) share the same cache line. Separating them with `____cacheline_aligned_in_smp` would eliminate false sharing on dispatch.
+
+### Repeated `idle_cpumask(node)` indirection (`kernel/sched/ext_idle.c`)
+Multiple calls within the same function re-evaluate the conditional pointer dereference; result should be cached in a local variable.
+
+### O(N^2) NUMA node traversal in `pick_idle_cpu_from_online_nodes()` (`kernel/sched/ext_idle.c`)
+Pre-computing per-CPU distance-ordered node arrays at init time would reduce this to O(N).
+
+### `flush_dispatch_buf` lock cycling (`kernel/sched/ext.c`)
+Each buffered task dispatched to a remote local DSQ causes a separate rq lock release/acquire cycle; batching by destination CPU would amortize lock overhead.
+
+### Missing `__always_inline` on hot helpers (`kernel/sched/ext_idle.c`)
+`idle_cpumask()`, `scx_cpu_node_if_enabled()`, `task_affinity_all()` are `static inline` but not `__always_inline`.
+
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c18e81e8ef51..1048bb9934c5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1360,10 +1360,10 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 	 * is offline and are just running the hotplug path. Don't bother the
 	 * BPF scheduler.
 	 */
-	if (!scx_rq_online(rq))
+	if (unlikely(!scx_rq_online(rq)))
 		goto local;
 
-	if (scx_rq_bypassing(rq)) {
+	if (unlikely(scx_rq_bypassing(rq))) {
 		__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
 		goto bypass;
 	}
@@ -1372,15 +1372,15 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
 		goto direct;
 
 	/* see %SCX_OPS_ENQ_EXITING */
-	if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
-	    unlikely(p->flags & PF_EXITING)) {
+	if (unlikely(!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
+	    p->flags & PF_EXITING)) {
 		__scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
 		goto local;
 	}
 
 	/* see %SCX_OPS_ENQ_MIGRATION_DISABLED */
-	if (!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) &&
-	    is_migration_disabled(p)) {
+	if (unlikely(!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) &&
+	    is_migration_disabled(p))) {
 		__scx_add_event(sch, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED, 1);
 		goto local;
 	}
-- 
2.51.0


             reply	other threads:[~2026-02-26 17:50 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-26 17:50 David Carlier [this message]
2026-02-26 22:08 ` [PATCH] sched_ext: add unlikely() hints in do_enqueue_task() hot path Tejun Heo
2026-02-27 17:40   ` David CARLIER
  -- strict thread matches above, loose matches on Subject: below --
2026-02-26 22:44 David Carlier
2026-02-27 17:30 ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260226175019.40449-1-devnexen@gmail.com \
    --to=devnexen@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.