All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET RESEND sched_ext/for-7.2] sched_ext: cmask improvements
@ 2026-05-17 18:36 Tejun Heo
  2026-05-17 18:36 ` [PATCH 1/3] sched_ext: Rename scx_cmask.nr_bits to nr_cids Tejun Heo
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Tejun Heo @ 2026-05-17 18:36 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min
  Cc: sched-ext, Emil Tsalapatis, linux-kernel, Tejun Heo

Hello,

Resend with the correct cover letter. The earlier posting
(20260517181022.1184056-1-tj@kernel.org) went out under a stale cover
from an unrelated draft; the three patches themselves were the right
ones and are unchanged in this resend. Apologies for the noise.

Three patches for cmask: tidy active-range bookkeeping and add the
mask-on-mask op helpers the sub-sched series will use.

Not backward-compat with the current scx_cmask layout/API, but cmask
landed in for-7.2 and hasn't been released; scx_qmap is the only user.

 0001 - sched_ext: Rename scx_cmask.nr_bits to nr_cids
 0002 - sched_ext: Track bits[] storage size in struct scx_cmask
 0003 - sched_ext: Add cmask mask ops

Based on sched_ext/for-7.2 (c9017d335aab).

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git cmask-prep

 kernel/sched/ext_cid.c                | 307 +++++++++++++++++++++++++++++++++-
 kernel/sched/ext_cid.h                |  71 +++++++-
 kernel/sched/ext_types.h              |  64 +++++--
 tools/sched_ext/include/scx/cid.bpf.h | 117 +++++++++----
 4 files changed, 506 insertions(+), 53 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread
* [PATCHSET v2 INTERNAL] bpf/arena: Direct kernel-side access
@ 2026-05-17 18:10 Tejun Heo
  2026-05-17 18:10 ` [PATCH 1/3] sched_ext: Rename scx_cmask.nr_bits to nr_cids Tejun Heo
  0 siblings, 1 reply; 14+ messages in thread
From: Tejun Heo @ 2026-05-17 18:10 UTC (permalink / raw)
  To: David Vernet, Andrea Righi, Changwoo Min, Kumar Kartikeya Dwivedi,
	Alexei Starovoitov, Emil Tsalapatis
  Cc: sched-ext, linux-kernel, Tejun Heo

Hello,

Internal preview of v2 before public posting. Recipients are only the
three of you for review.

Motivation
----------

sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct
scx_cmask *. The kernel translates a kernel cpumask to a cmask, but it
had no way to write into the arena, so the cmask lived in kernel memory
and was passed as a trusted pointer. BPF cmask helpers all operate on
arena cmasks though, so the BPF side had to word-by-word probe-read the
kernel cmask into an arena cmask via cmask_copy_from_kernel() before
any helper could touch it. It works, but is clumsy.

The shape isn't unique to set_cmask. Sub-scheduler support is on the
way and more sched_ext callbacks will want to pass structured data to
BPF. Anywhere a kfunc or struct_ops callback wants to hand a struct to
a BPF program, arena residence is the natural answer.

Approach
--------

Each arena gets a per-arena scratch page. Arenas stay sparsely mapped
as today - PTEs are populated only for allocated pages. A new arch
fault hook (bpf_arena_handle_page_fault) is wired into x86
page_fault_oops() and arm64 __do_kernel_fault(), after KFENCE. When a
kernel-side access faults inside an arena's kern_vm range, the helper
walks the stack to find the BPF program responsible, range-checks the
fault address against prog->aux->arena, and atomically installs the
scratch page into the empty PTE via a new ptep_try_install() wrapper.
The kernel instruction then retries and reads/writes the scratch page.
Real allocations naturally overwrite scratch PTEs; free paths and map
destruction treat scratch as non-owned.

The mechanism is default behavior - no UAPI flag.

What this preserves
-------------------

All the debugging properties of today's sparse-PTE design are
preserved:

* BPF programs still fault on unmapped arena accesses. The fault
  semantics (instruction retry with rdst = 0) and the violation
  report through bpf_streams are unchanged for prog-side accesses.
* The first kernel-side touch of an unmapped address is reported via
  bpf_streams the same way as a prog-side fault, with the stack walk
  attributing it to the originating prog.
* User-side fault semantics are unchanged. arena_vm_fault() treats a
  scratch PTE as absent and lazy-allocates a real page (or returns
  SIGSEGV under BPF_F_SEGV_ON_FAULT) the same as before.
* Repeat kernel-side faults on the same address after a free still
  re-install scratch, so the slot's "bad access" is reported on every
  fresh occurrence; only repeat faults without an intervening free
  are absorbed.

What changes for the kernel-side caller is just that an unmapped
deref no longer oopses - it retries through the scratch page and
emits a violation report. The same shape today's BPF instruction
faults have.

Patches 1-2 (atomic PTE install + arena scratch-page recovery)
--------------------------------------------------------------

  mm: Add ptep_try_install() for lockless empty-slot installs
  bpf: Recover arena kernel faults with scratch page

Patches 3-5 (helpers used by struct_ops registration)
-----------------------------------------------------

  bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers
  bpf: Add bpf_struct_ops_for_each_prog()
  bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena()

Patches 6-8 (sched_ext: arena auto-discovery, allocator, set_cmask)
-------------------------------------------------------------------

  sched_ext: Require an arena for cid-form schedulers
  sched_ext: Sub-allocator over kernel-claimed BPF arena pages
  sched_ext: Convert ops.set_cmask() to arena-resident cmask

Patch 6 reads each member prog's prog->aux->arena via bpf_prog_arena()
and requires the cid-form struct_ops to reference exactly one arena.
Patch 7 builds a gen_pool sub-allocator inside that arena. Patch 8
converts set_cmask() to write into arena memory; BPF dereferences via
__arena like any other arena struct, no probe-reads.

v1 -> v2
--------

* Dropped the BPF_F_ARENA_MAP_ALWAYS uapi flag and pre-populated
  garbage page. Replaced by lazy scratch-page install on first
  kernel-side fault.
* Dropped bpf-arena-pte-cb-prep: callbacks read scratch_page from a
  field in their data struct, not via arena indirection.
* Dropped bpf-prog-for-each-used-map: verifier already enforces one
  arena per prog (prog->aux->arena), so iteration over used_maps is
  unnecessary. bpf_prog_arena() exposes the prog's arena directly.
* New ptep_try_install() in <linux/pgtable.h>: the atomic install
  primitive used by the arena fault helper. Generic stub returns
  false; x86 and arm64 override with try_cmpxchg.

Base
----

sched_ext/for-7.2 with for-7.1-fixes merged in (de79a6cb3c3e).

v1 RFC for reference:
https://lore.kernel.org/all/20260427105109.2554518-1-tj@kernel.org/

 arch/arm64/include/asm/pgtable.h      |   8 ++
 arch/arm64/mm/fault.c                 |   4 +
 arch/x86/include/asm/pgtable.h        |   8 ++
 arch/x86/mm/fault.c                   |   5 ++
 include/linux/bpf.h                   |  14 ++++
 include/linux/pgtable.h               |  16 ++++
 kernel/bpf/arena.c                    | 141 +++++++++++++++++++++++++++++++---
 kernel/bpf/bpf_struct_ops.c           |  36 +++++++++
 kernel/bpf/core.c                     |   5 ++
 kernel/sched/build_policy.c           |   4 +
 kernel/sched/ext.c                    | 132 +++++++++++++++++++++++++++++--
 kernel/sched/ext_arena.c              | 128 ++++++++++++++++++++++++++++++
 kernel/sched/ext_arena.h              |  18 +++++
 kernel/sched/ext_cid.c                |  16 +---
 kernel/sched/ext_internal.h           |  24 +++++-
 kernel/sched/ext_types.h              |  10 +++
 tools/sched_ext/include/scx/cid.bpf.h |  52 -------------
 tools/sched_ext/scx_qmap.bpf.c        |   6 +-
 18 files changed, 540 insertions(+), 87 deletions(-)

Thanks.
--
tejun

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-05-19  5:59 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-17 18:36 [PATCHSET RESEND sched_ext/for-7.2] sched_ext: cmask improvements Tejun Heo
2026-05-17 18:36 ` [PATCH 1/3] sched_ext: Rename scx_cmask.nr_bits to nr_cids Tejun Heo
2026-05-17 18:43   ` sashiko-bot
2026-05-17 19:02   ` [PATCH v2 " Tejun Heo
2026-05-17 18:36 ` [PATCH 2/3] sched_ext: Track bits[] storage size in struct scx_cmask Tejun Heo
2026-05-17 19:14   ` sashiko-bot
2026-05-17 19:29   ` [PATCH v2 " Tejun Heo
2026-05-18 22:11     ` Andrea Righi
2026-05-18 22:53       ` Tejun Heo
2026-05-19  5:59         ` Andrea Righi
2026-05-17 18:36 ` [PATCH 3/3] sched_ext: Add cmask mask ops Tejun Heo
2026-05-18 23:58   ` [PATCH v2 " Tejun Heo
  -- strict thread matches above, loose matches on Subject: below --
2026-05-17 18:10 [PATCHSET v2 INTERNAL] bpf/arena: Direct kernel-side access Tejun Heo
2026-05-17 18:10 ` [PATCH 1/3] sched_ext: Rename scx_cmask.nr_bits to nr_cids Tejun Heo
2026-05-17 18:20   ` sashiko-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.