From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 10CBBCD5BAB for ; Wed, 20 May 2026 23:51:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:Content-Transfer-Encoding: MIME-Version:Message-ID:Date:Subject:Cc:To:From:Reply-To:Content-Type: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References:List-Owner; bh=w+8GIsH7FJtbJmH25aCxZ9OJv0REBwu5JFNBuaItB8Q=; b=Dg0Ja0rk6gCadFFlLFoPSj4cUG mW1rPj4n0gl5eV4i16dDcpsl35yY4WEsiEzLtFFugJabaNjbrFGfXmny8FQTy2IWjgNB7Qf9qFId4 efoqizTwHaY73UAwkq/9UOEXAJtmn/FccQqfjXJmzQ4h6PVNh2J5GfWr/IMf/680HuDhsdRA6Sg1h yvMKsD6Iz0L5KrgLRiQT7xyBj9iCupT1yDkedIXW72H3Lxv97H7Y4wlid2ArIzokxnJuk5DREHrPC bHjS5UahM0nyExeGArfaG46tT/wBksgwcwYwGbKSh7WTuHdQmfHXXLRWoKTAGwkFkq9xgE1vkdCDT EuBQnkkg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.99.1 #2 (Red Hat Linux)) id 1wPqgi-00000006Cd2-3MoY; Wed, 20 May 2026 23:50:56 +0000 Received: from sea.source.kernel.org ([2600:3c0a:e001:78e:0:1991:8:25]) by bombadil.infradead.org with esmtps (Exim 4.99.1 #2 (Red Hat Linux)) id 1wPqgg-00000006CcJ-1tje for linux-arm-kernel@lists.infradead.org; Wed, 20 May 2026 23:50:55 +0000 Received: from smtp.kernel.org (quasi.space.kernel.org [100.103.45.18]) by sea.source.kernel.org (Postfix) with ESMTP id DE409408C2; Wed, 20 May 2026 23:50:53 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9B8111F000E9; Wed, 20 May 2026 23:50:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779321053; bh=w+8GIsH7FJtbJmH25aCxZ9OJv0REBwu5JFNBuaItB8Q=; h=From:To:Cc:Subject:Date; b=c2R1m7zRWY9So0PUn+lD7Z1A6RPY6CVB7ZNl3Wuw1NcuHLn++W8wu1UaPMU09kEns 2O2LKxHI//fhmECvEaaBs/u1XpUVEccAzTYx6IFN/rd3MZMkNKoDUxOFHRFQRBZIFw 1RrJ9JIWmkuQiRqAz0iK6LCgSZ7Aebzr7fGrq8Sx+OqCHcc3vJmva6FNos60gLwy9/ d5TvKNly+d5UwCZAcmu+iy5Fd/uxKlHDnNRl/ptS5cBARJWo+5Sy4W38YmDbQ/qwsR b4l3vns3Ee7jeQkEFsJ4eBJG8OM9uXbV6CvJBMDebDVDxCFFfS5AU7LYKqBEeWjejr 2EAkqPoyMhssQ== From: Tejun Heo To: David Vernet , Andrea Righi , Changwoo Min , Alexei Starovoitov , Andrii Nakryiko , Daniel Borkmann , Martin KaFai Lau , Kumar Kartikeya Dwivedi Cc: Peter Zijlstra , Catalin Marinas , Will Deacon , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , Andrew Morton , David Hildenbrand , Mike Rapoport , Emil Tsalapatis , sched-ext@lists.linux.dev, bpf@vger.kernel.org, x86@kernel.org, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Tejun Heo Subject: [PATCHSET v3 sched_ext/for-7.2] bpf/arena: Direct kernel-side access Date: Wed, 20 May 2026 13:50:44 -1000 Message-ID: <20260520235052.4180316-1-tj@kernel.org> X-Mailer: git-send-email 2.54.0 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.9.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20260520_165054_530990_34489290 X-CRM114-Status: GOOD ( 16.19 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org Hello, This makes BPF arena memory directly dereferenceable from kernel code (struct_ops callbacks, kfuncs). Each arena gets a per-arena scratch page that an arch fault hook installs into empty PTEs on kernel-side faults, after KFENCE. The faulting instruction retries and the violation is reported through the program's BPF stream. v3: - Patch 1: rename ptep_try_install() to ptep_try_set(). Tighten kerneldoc for kernel-PTE use. (David Hildenbrand, Alexei) - Patch 2: apply_range_clear_cb() uses ptep_get_and_clear() so the install and clear sides race through atomic accessors. (David) v2: https://lore.kernel.org/r/20260517211232.1670594-1-tj@kernel.org v1 (RFC): https://lore.kernel.org/r/20260427105109.2554518-1-tj@kernel.org Motivation ---------- sched_ext's ops_cid.set_cmask() hands the BPF scheduler a struct scx_cmask *. The kernel translates a kernel cpumask to a cmask, but it had no way to write into the arena, so the cmask lived in kernel memory and was passed as a trusted pointer. BPF cmask helpers all operate on arena cmasks though, so the BPF side had to word-by-word probe-read the kernel cmask into an arena cmask via cmask_copy_from_kernel() before any helper could touch it. It works, but is clumsy. The shape isn't unique to set_cmask. Sub-scheduler support is on the way and more sched_ext callbacks will want to pass structured data to BPF. Anywhere a kfunc or struct_ops callback wants to hand a struct to a BPF program, arena residence is the natural answer. Approach -------- Each arena gets a per-arena scratch page. Arenas stay sparsely mapped as today - PTEs are populated only for allocated pages. A new arch fault hook (bpf_arena_handle_page_fault) is wired into x86 page_fault_oops() and arm64 __do_kernel_fault(), after KFENCE. When a kernel-side access faults inside an arena's kern_vm range, the helper walks the stack to find the BPF program responsible, range-checks the fault address against prog->aux->arena, and atomically installs the scratch page into the empty PTE via the new ptep_try_set() wrapper. The kernel instruction retries and reads/writes the scratch page. Free paths and map destruction treat scratch as non-owned. Real allocation refuses to overwrite scratch (apply_range_set_cb returns -EBUSY). A scratched address stays dead until map destroy, since its presence means the BPF program has already malfunctioned. The mechanism is default behavior - no UAPI flag. What this preserves ------------------- All the debugging properties of today's sparse-PTE design are preserved: * BPF programs still fault on unmapped arena accesses. The fault semantics (instruction retry with rdst = 0) and the violation report through bpf_streams are unchanged for prog-side accesses. * The first kernel-side touch of an unmapped address is reported via bpf_streams the same way as a prog-side fault, with the stack walk attributing it to the originating prog. * User-side fault on a never-scratched address still lazy-allocates a real page (or returns SIGSEGV under BPF_F_SEGV_ON_FAULT). User-side fault on a scratched address SIGSEGVs. What changes for the kernel-side caller is just that an unmapped deref no longer oopses - it retries through the scratch page and emits a violation report. The same shape today's BPF instruction faults have. Patches 1-2 (atomic PTE install + arena scratch-page recovery) -------------------------------------------------------------- mm: Add ptep_try_set() for lockless empty-slot installs bpf: Recover arena kernel faults with scratch page Patches 3-5 (helpers used by struct_ops registration) ----------------------------------------------------- bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers bpf: Add bpf_struct_ops_for_each_prog() bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena() Patches 6-8 (sched_ext: arena auto-discovery, allocator, set_cmask) ------------------------------------------------------------------- sched_ext: Require an arena for cid-form schedulers sched_ext: Sub-allocator over kernel-claimed BPF arena pages sched_ext: Convert ops.set_cmask() to arena-resident cmask Patch 6 reads each member prog's prog->aux->arena via bpf_prog_arena() and requires the cid-form struct_ops to reference exactly one arena. Patch 7 builds a gen_pool sub-allocator inside that arena. Patch 8 converts set_cmask() to write into arena memory; BPF dereferences via __arena like any other arena struct, no probe-reads. Base ---- sched_ext/for-7.2 (1136fb1213d1) with cmask-prep-v2.3 applied: https://lore.kernel.org/r/20260519075838.2706712-1-tj@kernel.org Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git arena-direct-v3 Documentation/bpf/kfuncs.rst | 14 +++ arch/arm64/include/asm/pgtable.h | 8 ++ arch/arm64/mm/fault.c | 10 +- arch/x86/include/asm/pgtable.h | 8 ++ arch/x86/mm/fault.c | 12 +- include/linux/bpf.h | 14 +++ include/linux/bpf_defs.h | 11 ++ include/linux/pgtable.h | 26 ++++ kernel/bpf/arena.c | 216 +++++++++++++++++++++++++++------- kernel/bpf/bpf_struct_ops.c | 36 ++++++ kernel/bpf/core.c | 5 + kernel/sched/build_policy.c | 4 + kernel/sched/ext.c | 135 ++++++++++++++++++++- kernel/sched/ext_arena.c | 127 ++++++++++++++++++++ kernel/sched/ext_arena.h | 18 +++ kernel/sched/ext_cid.c | 20 +--- kernel/sched/ext_internal.h | 23 +++- tools/sched_ext/include/scx/cid.bpf.h | 52 -------- tools/sched_ext/scx_qmap.bpf.c | 5 +- 19 files changed, 616 insertions(+), 128 deletions(-) Thanks. -- tejun