From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-f52.google.com (mail-wm1-f52.google.com [209.85.128.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CCB1E263C9F for ; Thu, 26 Feb 2026 17:50:24 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.52 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772128226; cv=none; b=j8slyayGCIsS3D9qHeU+ohebj76caNpKirh75sb43MV4eQtlNwHVP0LNHKPR9B7sJ75+A09D3Xb2fj+eCgOXq4DXGrG9WZQ/8poIStpf/XDfa+fiScfwngIYaoz4xnXraMUy6tjAVr7ibMhI5RuF1+yPIMoTHMw+daRnjHiZmGk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772128226; c=relaxed/simple; bh=XscLT5Ty7/tr+w+z+t1GIuffW9Y1V1B4MaXT+ifp5QQ=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=Fs1m29IvV8+DdJgfd2mlAJOZ9Qc5p66CoI2IlspIoZiWKpYS+Kr2YcZh/SpqrhBKxIQOZzBIqotrobNa2hAQGQtRGRNB9N1RA9NMd6csTB6KV0R/+pxXLTen3+ubHVtEKtRNbq99TOSd1dPeNnjswu7bC/AOkzp9F4V6gx+3RbQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=KLv4DC4g; arc=none smtp.client-ip=209.85.128.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="KLv4DC4g" Received: by mail-wm1-f52.google.com with SMTP id 5b1f17b1804b1-48379a42f76so9233465e9.0 for ; Thu, 26 Feb 2026 09:50:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1772128223; x=1772733023; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=BsEizC/uUxxOle4iZapGSvEg//w1VXrgl+1kN/cM4+Y=; b=KLv4DC4gZJOr3BTpoYtBJJhRgfLswx9Az4qYvxkCnczLnyfmMf8PVoK06V5EW+p4eW LMTHGtsqMddd5X3zSWlzEpB7V6kLf9bLRYFjCMgo0cp/NgNt1FKoUm3uy+L344ATcJoj 5EGNwX3seShp/DY2HU9Mq2tZGWpiUaYuBeLypuTR5iDVsNHq86dCacwS8J7wkhTJlRYc jonlyD/GWsGKN4eQh/D5Wc4gn4kJGkfcDXYfHFIpcTB3zS07cRyasT3E9ZuO3bcf2Fny ymS6ZEhOfHWCs4be5rl6Ei/B1M+/VuOJ8jPl8TwnXFgrqOxlChW/WC7qz/2y22hvcKrN uZWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1772128223; x=1772733023; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=BsEizC/uUxxOle4iZapGSvEg//w1VXrgl+1kN/cM4+Y=; b=u1qXRwljLnCx2xaxeWqYYBCU11kZMz5XVBgctnelteOlC2vdgSuN/wdUlGcRFPQXpo Z5GqQXr6Xd5urqmxMIkTYPr/j8Mf5Ek/OCWTuCDjaZWpNOn5lar040TaYBT6Hhi83vY4 6oK40m0Uw89/WMGwMJ4I5T4nqDUOL7mCpwYkfdWKDhVqV3tCrQlApZ66gm6RLvrQpSun Y1vF8MrnrF37QKaC4fN89gINjchkTJJgEeCFwwLbLvuSmE18uzHJsNq5JhYMj2DBjryL uSSWwYilX1far8K1DYYl6QmXyulFiG0GmMbVuG93EWYogllaSun369Qi72nv5y5gmfSp Q/Hw== X-Gm-Message-State: AOJu0YxRNCFcc+tu/OV3Pc157NLb0cqse1KXmNfzjbvq5IGkE6PxFU2D U3Ax+54v+yJdKoKONYAkgVoCRcjHrZjh73WTvD6QRT4hHoC1gUWcg37A X-Gm-Gg: ATEYQzybQKb6+V/OE225zCspHfXWsA3qZiLi68GZRM/mfV3vL+kpGUgfHJl9HNWZHZp nC7gCaTQZGy7tkrbXQI5Nr5HNV651H2Zo3zrC9SiX3kaBVUxzjwIQW8+PeCHZ2Cu9lFfixDmy6y uXGFx/n9oI+e6GNxWTW/HXwoLDg19ETui7KtdL9MI4uPz23amNu6IRK0BYdntFJAFPb/6Jd4qyk KFkp7EKD+K32HvCvhhAYwz6CCFEZy//R77+foYq80bt4W90L8lM0A+PO9qLyRfxuZq4iZj4eBaF Zx/UGFjYVFJdKIJuN5RVL7XoSwXu8utMCe+30ZlJVRlyJYM0ogIXYiuvDtnvcvKiHB8Iz0v8uWQ YKbjzv661CnSaicU8W+N1P8YeQ+PfshrKdmAdlprFPC7N1pF4D5SUkjtqyfaMIQWnLocYC7Q7Jp UB+NBijWJ2FUv+60Fdp1sry9ak+TXnLvGg/uT+QRg1ssjU+P0rAuEabXDBREuzL1NQyPhy2ebTH Iemt/2SfBfO+Pj7NKTFE5sZr6c/uDQ= X-Received: by 2002:a05:600c:1f83:b0:483:885:f0b0 with SMTP id 5b1f17b1804b1-483c21b7f56mr99540705e9.35.1772128222620; Thu, 26 Feb 2026 09:50:22 -0800 (PST) Received: from rozandragon.chello.ie (188-141-5-72.dynamic.upc.ie. [188.141.5.72]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-483c3b770e7sm58676235e9.9.2026.02.26.09.50.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 26 Feb 2026 09:50:22 -0800 (PST) From: David Carlier To: Tejun Heo , David Vernet Cc: linux-kernel@vger.kernel.org, David Carlier Subject: [PATCH] sched_ext: add unlikely() hints in do_enqueue_task() hot path Date: Thu, 26 Feb 2026 17:50:15 +0000 Message-ID: <20260226175019.40449-1-devnexen@gmail.com> X-Mailer: git-send-email 2.51.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add unlikely() branch hints to the error/bypass checks in do_enqueue_task() that are rarely taken during normal operation: offline CPU, bypass mode, exiting task, and migration-disabled task. Signed-off-by: David Carlier --- CLAUDE.md | 158 +++++++++++++++++++++++++++++++++++++++++++++ kernel/sched/ext.c | 12 ++-- 2 files changed, 164 insertions(+), 6 deletions(-) create mode 100644 CLAUDE.md diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000000..e892eeea804e --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,158 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Important Rules + +- **Do NOT modify code directly.** Only analyze, explain, and suggest changes. The user writes all code themselves. + +## Repository Overview + +This is the Linux kernel tree with the **sched_ext** subsystem — a BPF-based extensible scheduler class that allows scheduling policies to be implemented as BPF programs and loaded/unloaded at runtime. The kernel falls back to the default fair-class scheduler on any error or when the BPF scheduler exits. + +## Build Commands + +### Kernel (requires CONFIG_SCHED_CLASS_EXT=y) +```bash +# Required Kconfig options: +# CONFIG_BPF=y CONFIG_SCHED_CLASS_EXT=y CONFIG_BPF_SYSCALL=y +# CONFIG_BPF_JIT=y CONFIG_DEBUG_INFO_BTF=y +make -j$(nproc) +``` + +### Example BPF schedulers (tools/sched_ext/) +```bash +make -j$(nproc) -C tools/sched_ext # build all +make -C tools/sched_ext scx_simple # build one scheduler +make -C tools/sched_ext clean +``` +Output goes to `tools/sched_ext/build/bin/`. Requires clang >= 16, pahole >= 1.25. The build auto-generates `vmlinux.h` from the first available vmlinux (kernel tree root, `/sys/kernel/btf/vmlinux`, or `/boot/vmlinux-$(uname -r)`). + +### Selftests (tools/testing/selftests/sched_ext/) +```bash +make -j$(nproc) -C tools/testing/selftests/sched_ext +# Run all tests: +tools/testing/selftests/sched_ext/runner +``` + +## Architecture + +### Kernel-side (kernel/sched/) +- **`ext.c`** — Core sched_ext implementation: BPF scheduler loading/unloading, dispatch queue (DSQ) management, all `scx_bpf_*` kfunc helpers callable from BPF +- **`ext_idle.c`** — Built-in idle CPU tracking and selection (per-node/global idle cpumasks) +- **`ext_internal.h`** — Internal data structures: `struct scx_dispatch_q`, task states, exit codes, config flags +- **`ext.h`** — Kernel-internal header with scheduler hook declarations (`scx_tick`, `scx_enqueue`, etc.) and no-op stubs when `CONFIG_SCHED_CLASS_EXT` is disabled + +### Public header +- **`include/linux/sched/ext.h`** — Defines `struct sched_ext_ops` (the BPF struct_ops table), `struct sched_ext_entity` (per-task state), and all constants/flags + +### BPF scheduler interface +BPF schedulers implement callbacks in `struct sched_ext_ops` via `SEC(".struct_ops")`. Key callbacks: `select_cpu`, `enqueue`, `dequeue`, `dispatch`, `init`, `exit`. The kernel communicates with BPF through kfuncs prefixed `scx_bpf_*` (e.g., `scx_bpf_dsq_insert()`, `scx_bpf_select_cpu_dfl()`, `scx_bpf_pick_idle_cpu()`). + +### Dispatch Queues (DSQs) +Central abstraction bridging the scheduler core and BPF: +- `SCX_DSQ_GLOBAL` — Global FIFO queue +- `SCX_DSQ_LOCAL` / `SCX_DSQ_LOCAL_ON | cpu` — Per-CPU local queues +- Custom DSQs created with `scx_bpf_create_dsq()` + +A CPU runs tasks from its local DSQ; if empty, it pulls from the global DSQ, then calls `ops.dispatch()`. + +### Example schedulers (tools/sched_ext/) +Each scheduler is a pair: `scx_foo.bpf.c` (BPF program) + `scx_foo.c` (userspace loader). Available schedulers: `scx_simple`, `scx_qmap`, `scx_central`, `scx_flatcg`, `scx_pair`, `scx_sdt`, `scx_cpu0`, `scx_userland`. + +Shared headers live in `tools/sched_ext/include/scx/`: +- `common.bpf.h` — BPF kfunc declarations, helper macros +- `common.h` — Userspace utilities (loading, stats printing) +- `compat.bpf.h` / `compat.h` — Cross-kernel-version compatibility +- `user_exit_info.h` / `user_exit_info.bpf.h` — Exit info shared between BPF and userspace + +### Selftest framework (tools/testing/selftests/sched_ext/) +Tests follow a `*.bpf.c` + `*.c` pair pattern. Each test registers via `REGISTER_SCX_TEST()` (ELF constructor) and implements `setup`/`run`/`cleanup` returning `SCX_TEST_PASS`/`SCX_TEST_SKIP`/`SCX_TEST_FAIL`. The `runner` binary aggregates and executes all registered tests. Assertion macros: `SCX_FAIL_IF`, `SCX_EQ`, `SCX_GT`, `SCX_GE`, `SCX_LT`, `SCX_LE`, `SCX_ASSERT`. + +## Key Conventions + +- BPF struct_ops callbacks use `BPF_STRUCT_OPS()` / `BPF_STRUCT_OPS_SLEEPABLE()` macros +- The sched_ext ABI between kernel and BPF schedulers has **no stability guarantees** across kernel versions +- Schedulers must be compiled with `-target bpf` and linked through bpftool skeleton generation (`.bpf.c` → `.bpf.o` → `.bpf.skel.h`) +- CFLAGS include `-Wall -Werror` for both tools and selftests +- Production-ready schedulers live in the separate [sched-ext/scx](https://github.com/sched-ext/scx) repository; the in-tree ones are examples +- Commit messages must include a `Signed-off-by:` line (use `git commit -s`) + +## Known Bugs in tools/sched_ext/ + +### `common.h` `SCX_BUG` reads errno after fprintf +The `SCX_BUG` macro calls `fprintf` before checking `errno`, but `fprintf` itself may clobber `errno`. The value should be saved before the first `fprintf` call. + +### `scx_simple` / `scx_cpu0` VLA in `read_stats` +`__u64 cnts[2][nr_cpus]` on the stack; problematic at very high CPU counts (e.g. 4096+ CPUs = 64 KB stack). + +## Submitted Patches (pending upstream review) + +### `scx_idle_init_masks()` NUMA OOB fix +`scx_idle_node_masks` was allocated with `num_possible_nodes()` (count) but indexed by node IDs via `for_each_node()`. On non-contiguous NUMA topologies, node IDs can exceed the array size. Fixed by allocating with `nr_node_ids`. Branch: `numa_id_alloc_fix`. + +### `sched_ext_entity` cache line layout optimization +Reordered `ops_state`, `ddsp_dsq_id`, and `ddsp_enq_flags` to sit immediately after `dsq` in `struct sched_ext_entity` (`include/linux/sched/ext.h`). These fields are accessed together in the `do_enqueue_task()` and `finish_dispatch()` hot paths but were previously spread across three different cache lines. Branch: `sched_ext_entity_layout_upd`. + +### TOCTOU on `p->scx.dsq` in `scx_dump_task()` fix +Used `READ_ONCE()` to capture `p->scx.dsq` into a local variable before dereferencing, preventing another CPU from NULLing the pointer between check and use. Branch: `scx_dump_concur_fix`. + +### `SCX_EFLAG_INITIALIZED` no-op flag fix +`SCX_EFLAG_INITIALIZED` in `enum scx_exit_flags` defaulted to 0, making the `|=` in `scx_ops_init()` a no-op. BPF schedulers could not distinguish whether `ops.init()` completed. Assigned `1LLU << 0`. Branch: `SCX_EFLAG_INITIALIZED_value`. + +### Direct `scx_root` dereference without RCU in dump paths fix +`scx_dump_task()` and `scx_dump_state()` now use `rcu_dereference()` to read `scx_root` under RCU protection, with an early return if NULL, preventing NULL-deref during concurrent scheduler teardown. Branch: `scx_dump_concur_fix`. + +## Analyzing Struct Cache Line Layouts with pahole + +To verify cache line placement of struct fields (e.g. when reviewing or proposing layout optimizations), use `pahole` on a compiled `.o` file from the kernel tree. + +### Setup +```bash +# Need: pahole (from dwarves package), libdw-dev, CONFIG_DEBUG_INFO_DWARF5=y +make defconfig +scripts/config --enable CONFIG_SCHED_CLASS_EXT --enable CONFIG_DEBUG_INFO \ + --enable CONFIG_DEBUG_INFO_DWARF5 --enable CONFIG_SCHED_CORE \ + --enable CONFIG_EXT_GROUP_SCHED +make olddefconfig +``` + +### Build a single .o and inspect +```bash +make prepare -j$(nproc) +# ext.c may fail with newer GCC; core.o also includes sched_ext_entity +make kernel/sched/core.o -j$(nproc) +pahole -C sched_ext_entity kernel/sched/core.o +``` + +### Before/after comparison workflow +1. Build the `.o` on the current branch, save pahole output +2. Checkout the patched header (`git checkout -- include/linux/sched/ext.h`) +3. Rebuild the same `.o`, run pahole again +4. Restore with `git checkout master -- include/linux/sched/ext.h` + +The output shows field offsets, sizes, and cacheline boundaries — look for hot-path fields that cross `/* --- cacheline N boundary --- */` markers. + +### Running kernel (alternative) +```bash +# If the struct exists in the running kernel's BTF: +pahole -C sched_ext_entity /sys/kernel/btf/vmlinux +``` + +## Identified Optimization Opportunities + +### `struct scx_dispatch_q` false sharing (`include/linux/sched/ext.h`) +`lock` (write-heavy) and `first_task` (read-mostly, lockless RCU peek) share the same cache line. Separating them with `____cacheline_aligned_in_smp` would eliminate false sharing on dispatch. + +### Repeated `idle_cpumask(node)` indirection (`kernel/sched/ext_idle.c`) +Multiple calls within the same function re-evaluate the conditional pointer dereference; result should be cached in a local variable. + +### O(N^2) NUMA node traversal in `pick_idle_cpu_from_online_nodes()` (`kernel/sched/ext_idle.c`) +Pre-computing per-CPU distance-ordered node arrays at init time would reduce this to O(N). + +### `flush_dispatch_buf` lock cycling (`kernel/sched/ext.c`) +Each buffered task dispatched to a remote local DSQ causes a separate rq lock release/acquire cycle; batching by destination CPU would amortize lock overhead. + +### Missing `__always_inline` on hot helpers (`kernel/sched/ext_idle.c`) +`idle_cpumask()`, `scx_cpu_node_if_enabled()`, `task_affinity_all()` are `static inline` but not `__always_inline`. + diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c index c18e81e8ef51..1048bb9934c5 100644 --- a/kernel/sched/ext.c +++ b/kernel/sched/ext.c @@ -1360,10 +1360,10 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, * is offline and are just running the hotplug path. Don't bother the * BPF scheduler. */ - if (!scx_rq_online(rq)) + if (unlikely(!scx_rq_online(rq))) goto local; - if (scx_rq_bypassing(rq)) { + if (unlikely(scx_rq_bypassing(rq))) { __scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1); goto bypass; } @@ -1372,15 +1372,15 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags, goto direct; /* see %SCX_OPS_ENQ_EXITING */ - if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) && - unlikely(p->flags & PF_EXITING)) { + if (unlikely(!(sch->ops.flags & SCX_OPS_ENQ_EXITING) && + p->flags & PF_EXITING)) { __scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1); goto local; } /* see %SCX_OPS_ENQ_MIGRATION_DISABLED */ - if (!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) && - is_migration_disabled(p)) { + if (unlikely(!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) && + is_migration_disabled(p))) { __scx_add_event(sch, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED, 1); goto local; } -- 2.51.0