From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Tejun Heo <tj@kernel.org>, David Vernet <void@manifault.com>,
Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@linux.dev>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>, Thomas Gleixner <tglx@kernel.org>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
Dave Hansen <dave.hansen@linux.intel.com>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
Emil Tsalapatis <emil@etsalapatis.com>,
sched-ext@lists.linux.dev, bpf@vger.kernel.org, x86@kernel.org,
linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/8] bpf: Recover arena kernel faults with scratch page
Date: Tue, 26 May 2026 14:45:25 +0200 [thread overview]
Message-ID: <7fd673df-22f3-4d70-a779-ea0b878188b3@kernel.org> (raw)
In-Reply-To: <20260522172219.1423324-3-tj@kernel.org>
On 5/22/26 19:22, Tejun Heo wrote:
> From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>
> BPF arena usage is becoming more prevalent, but kernel <-> BPF communication
> over arena memory is awkward today. Data has to be staged through a trusted
> kernel pointer with extra code and copying on the BPF side. While reads
> through arena pointers can use a fault-safe helper, writes don't have a good
> solution. The in-line alternative would need instruction emulation or asm
> fixup labels.
>
> Enable direct kernel-side reads and writes within GUARD_SZ / 2 of any
> handed-in arena pointer, without bounds checking. A per-arena scratch page
> is installed by the arch fault path into empty arena kernel PTEs - x86 from
> page_fault_oops() for not-present faults, arm64 from __do_kernel_fault() for
> translation faults, both after the existing exception-table and KFENCE
> handling. The faulting instruction retries and the access is also reported
> through the program's BPF stream, preserving error reporting.
>
> bpf_prog_find_from_stack() resolves the current BPF program (and its arena)
> from the kernel stack - no new bpf_run_ctx state is added. Recovery covers
> the 4 GiB arena plus the upper half-guard (GUARD_SZ / 2). The lower
> half-guard is excluded because well-behaved kfuncs only access forward from
> arena pointers. The kfunc-author contract - access at most GUARD_SZ / 2 past
> a handed-in pointer - is documented in Documentation/bpf/kfuncs.rst.
>
> The install is lock-free via ptep_try_set(). On race-loss the winning
> installer's PTE is already valid, so the access retry succeeds. The arena
> clear path uses ptep_get_and_clear() so installer and clearer race through
> atomic accessors. No flush_tlb_kernel_range() afterwards. Stale "not mapped"
> entries just cause one extra re-fault, cheaper than a global IPI on every
> install.
>
> Scratch exists only to keep the kernel from oopsing on an in-line arena
> access. Its presence at a PTE means the BPF program has already
> malfunctioned, and the violation is reported through the program's BPF
> stream. The only requirement for behavior on a scratched PTE is that the
> kernel doesn't crash. In particular, any user-side access through such a PTE
> may segfault. The shared scratch page is freed once during map destruction.
>
> BPF instruction faults continue to use the existing JIT exception-table
> path. This patch changes only the kernel-text fault path. No UAPI flag is
> added. The new behavior is the default.
>
> v2: Use ptep_get_and_clear() in apply_range_clear_cb(). (David)
> v3: Stub bpf_arena_handle_page_fault() for !CONFIG_BPF_SYSCALL. (lkp)
>
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
> Cc: David Hildenbrand <david@kernel.org>
> ---
> Documentation/bpf/kfuncs.rst | 14 +++
> arch/arm64/mm/fault.c | 10 +-
> arch/x86/mm/fault.c | 12 ++-
> include/linux/bpf.h | 1 +
> include/linux/bpf_defs.h | 19 ++++
> kernel/bpf/arena.c | 177 +++++++++++++++++++++++++++--------
> kernel/bpf/core.c | 5 +
> 7 files changed, 191 insertions(+), 47 deletions(-)
> create mode 100644 include/linux/bpf_defs.h
>
> diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
> index 75e6c078e0e7..6d497e720998 100644
> --- a/Documentation/bpf/kfuncs.rst
> +++ b/Documentation/bpf/kfuncs.rst
> @@ -462,6 +462,20 @@ In order to accommodate such requirements, the verifier will enforce strict
> PTR_TO_BTF_ID type matching if two types have the exact same name, with one
> being suffixed with ``___init``.
>
> +2.8 Accessing arena memory through kfunc arguments
> +--------------------------------------------------
> +
> +A read or write at any address inside an arena does not oops the kernel.
> +Unallocated arena pages are lazily backed by a scratch page and the
> +access is reported through the program's BPF stream as an error. Only
> +the BPF program's correctness is affected; the kernel itself remains
> +intact.
> +
> +The arena is followed by a ``GUARD_SZ / 2`` (32 KiB) guard region that
> +is also covered by this recovery. A kfunc handed an arena pointer may
> +therefore access up to ``GUARD_SZ / 2`` past it without bounds-checking
> +against the arena. Larger accesses must verify the range explicitly.
> +
> .. _BPF_kfunc_lifecycle_expectations:
>
> 3. kfunc lifecycle expectations
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 920a8b244d59..0d58d667fcd8 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -9,6 +9,7 @@
>
> #include <linux/acpi.h>
> #include <linux/bitfield.h>
> +#include <linux/bpf_defs.h>
> #include <linux/extable.h>
> #include <linux/kfence.h>
> #include <linux/signal.h>
> @@ -416,9 +417,12 @@ static void __do_kernel_fault(unsigned long addr, unsigned long esr,
> } else if (addr < PAGE_SIZE) {
> msg = "NULL pointer dereference";
> } else {
> - if (esr_fsc_is_translation_fault(esr) &&
> - kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> - return;
> + if (esr_fsc_is_translation_fault(esr)) {
> + if (kfence_handle_page_fault(addr, esr & ESR_ELx_WNR, regs))
> + return;
> + if (bpf_arena_handle_page_fault(addr, esr & ESR_ELx_WNR, regs->pc))
> + return;
> + }
>
> msg = "paging request";
> }
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f0e77e084482..b0f103ddbd23 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -8,6 +8,7 @@
> #include <linux/sched/task_stack.h> /* task_stack_*(), ... */
> #include <linux/kdebug.h> /* oops_begin/end, ... */
> #include <linux/memblock.h> /* max_low_pfn */
> +#include <linux/bpf_defs.h> /* bpf_arena_handle_page_fault */
> #include <linux/kfence.h> /* kfence_handle_page_fault */
> #include <linux/kprobes.h> /* NOKPROBE_SYMBOL, ... */
> #include <linux/mmiotrace.h> /* kmmio_handler, ... */
> @@ -688,10 +689,13 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
> if (IS_ENABLED(CONFIG_EFI))
> efi_crash_gracefully_on_page_fault(address);
>
> - /* Only not-present faults should be handled by KFENCE. */
> - if (!(error_code & X86_PF_PROT) &&
> - kfence_handle_page_fault(address, error_code & X86_PF_WRITE, regs))
> - return;
> + /* Only not-present faults should be handled by KFENCE or BPF arena. */
> + if (!(error_code & X86_PF_PROT)) {
> + if (kfence_handle_page_fault(address, error_code & X86_PF_WRITE, regs))
> + return;
> + if (bpf_arena_handle_page_fault(address, error_code & X86_PF_WRITE, regs->ip))
> + return;
> + }
>
> oops:
> /*
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 0136a108d083..831996c411cf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -6,6 +6,7 @@
>
> #include <uapi/linux/bpf.h>
> #include <uapi/linux/filter.h>
> +#include <linux/bpf_defs.h>
>
> #include <crypto/sha2.h>
> #include <linux/workqueue.h>
> diff --git a/include/linux/bpf_defs.h b/include/linux/bpf_defs.h
> new file mode 100644
> index 000000000000..2185cd3966d4
> --- /dev/null
> +++ b/include/linux/bpf_defs.h
> @@ -0,0 +1,19 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Subset of bpf.h declarations, split out so files that need only these
> + * declarations can avoid bpf.h's full include cost.
> + */
> +#ifndef _LINUX_BPF_DEFS_H
> +#define _LINUX_BPF_DEFS_H
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +bool bpf_arena_handle_page_fault(unsigned long addr, bool is_write, unsigned long fault_ip);
> +#else
> +static inline bool bpf_arena_handle_page_fault(unsigned long addr, bool is_write,
> + unsigned long fault_ip)
> +{
> + return false;
> +}
> +#endif
> +
> +#endif /* _LINUX_BPF_DEFS_H */
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 08d008cc471e..1c0b87ecc817 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -53,6 +53,7 @@ struct bpf_arena {
> u64 user_vm_start;
> u64 user_vm_end;
> struct vm_struct *kern_vm;
> + struct page *scratch_page;
> struct range_tree rt;
> /* protects rt */
> rqspinlock_t spinlock;
> @@ -118,6 +119,11 @@ struct apply_range_data {
> int i;
> };
>
> +struct clear_range_data {
> + struct llist_head *free_pages;
> + struct page *scratch_page;
> +};
> +
> static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> {
> struct apply_range_data *d = data;
> @@ -144,33 +150,59 @@ static void flush_vmap_cache(unsigned long start, unsigned long size)
> flush_cache_vmap(start, start + size);
> }
There is still the chance that apply_range_set_cb() could race with scratch
insertion, right?
Shouldn't we also be using ptep_try_set() there?
The nasty thing is handling whether ptep_try_set() actually works.
Something like the following on top, maybe?
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 49a8f7b1beef5..086bea3f3698e 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -122,19 +122,27 @@ static int apply_range_set_cb(pte_t *pte, unsigned long
addr, void *data)
{
struct apply_range_data *d = data;
struct page *page;
+ pte_t pteval;
if (!data)
return 0;
- /* sanity check */
- if (unlikely(!pte_none(ptep_get(pte))))
- return -EBUSY;
page = d->pages[d->i];
/* paranoia, similar to vmap_pages_pte_range() */
if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
return -EINVAL;
- set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+ pteval = mk_pte(page, PAGE_KERNEL);
+#ifdef ptep_try_set
+ if (unlikely(!ptep_try_set(pte, pteval)))
+ return -EBUSY;
+#else
+ if (unlikely(!pte_none(ptep_get(pte))))
+ return -EBUSY;
+
+ set_pte_at(&init_mm, addr, pte, pteval);
+#endif
d->i++;
return 0;
}
--
Cheers,
David
next prev parent reply other threads:[~2026-05-26 12:45 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-22 17:22 [PATCHSET v4 sched_ext/for-7.2] bpf/arena: Direct kernel-side access Tejun Heo
2026-05-22 17:22 ` [PATCH 1/8] mm: Add ptep_try_set() for lockless empty-slot installs Tejun Heo
2026-05-22 22:07 ` David Hildenbrand (Arm)
2026-05-25 15:50 ` patchwork-bot+netdevbpf
2026-05-22 17:22 ` [PATCH 2/8] bpf: Recover arena kernel faults with scratch page Tejun Heo
2026-05-26 12:45 ` David Hildenbrand (Arm) [this message]
2026-05-22 17:22 ` [PATCH 3/8] bpf: Add sleepable variant of bpf_arena_alloc_pages for kernel callers Tejun Heo
2026-05-22 17:22 ` [PATCH 4/8] bpf: Add bpf_struct_ops_for_each_prog() Tejun Heo
2026-05-22 17:22 ` [PATCH 5/8] bpf/arena: Add bpf_arena_map_kern_vm_start() and bpf_prog_arena() Tejun Heo
2026-05-22 17:22 ` [PATCH 6/8] sched_ext: Require an arena for cid-form schedulers Tejun Heo
2026-05-22 17:22 ` [PATCH 7/8] sched_ext: Sub-allocator over kernel-claimed BPF arena pages Tejun Heo
2026-05-22 17:22 ` [PATCH 8/8] sched_ext: Convert ops.set_cmask() to arena-resident cmask Tejun Heo
2026-05-25 15:45 ` [PATCHSET v4 sched_ext/for-7.2] bpf/arena: Direct kernel-side access Alexei Starovoitov
2026-05-25 19:54 ` Tejun Heo
-- strict thread matches above, loose matches on Subject: below --
2026-05-20 23:50 [PATCHSET v3 " Tejun Heo
2026-05-20 23:50 ` [PATCH 2/8] bpf: Recover arena kernel faults with scratch page Tejun Heo
2026-05-21 3:16 ` Emil Tsalapatis
2026-05-21 9:42 ` Alexei Starovoitov
2026-05-21 17:39 ` Tejun Heo
2026-05-17 21:12 [PATCHSET v2 sched_ext/for-7.2] bpf/arena: Direct kernel-side access Tejun Heo
2026-05-17 21:12 ` [PATCH 2/8] bpf: Recover arena kernel faults with scratch page Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7fd673df-22f3-4d70-a779-ea0b878188b3@kernel.org \
--to=david@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=arighi@nvidia.com \
--cc=ast@kernel.org \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=changwoo@igalia.com \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=emil@etsalapatis.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rppt@kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tglx@kernel.org \
--cc=tj@kernel.org \
--cc=void@manifault.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox