Re: [RFC PATCH 5/9] bpf: Add bpf_prog_for_each_used

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* Re: [RFC PATCH 5/9] bpf: Add bpf_prog_for_each_used_map()
       [not found] ` <20260427105109.2554518-6-tj@kernel.org>
@ 2026-05-11 21:44   ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-11 21:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Emil Tsalapatis, Eduard Zingerman,
	Andrii Nakryiko, David Vernet, Andrea Righi, Changwoo Min, bpf,
	sched-ext, linux-kernel

On Mon, 27 Apr 2026 at 12:51, Tejun Heo <tj@kernel.org> wrote:
>
> Wrap the prog->aux->used_maps[] walk and its used_maps_mutex behind a
> helper. Existing in-tree callers open-code the same lock + iterate pattern
> (e.g. bpf_check_tail_call in core.c, the verifier and syscall paths); a
> sched_ext follow-up needs the same loop and would otherwise reach into
> bpf_prog_aux directly.
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
>  include/linux/bpf.h |  3 +++
>  kernel/bpf/core.c   | 29 +++++++++++++++++++++++++++++
>  2 files changed, 32 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index f4e4360b81f6..587e5ff387bf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2338,6 +2338,9 @@ static inline bool map_type_contains_progs(struct bpf_map *map)
>
>  bool bpf_prog_map_compatible(struct bpf_map *map, const struct bpf_prog *fp);
>  int bpf_prog_calc_tag(struct bpf_prog *fp);
> +int bpf_prog_for_each_used_map(struct bpf_prog *prog,
> +                              int (*cb)(struct bpf_map *map, void *data),
> +                              void *data);
>
>  const struct bpf_func_proto *bpf_get_trace_printk_proto(void);
>  const struct bpf_func_proto *bpf_get_trace_vprintk_proto(void);
> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
> index 066b86e7233c..aa590a817176 100644
> --- a/kernel/bpf/core.c
> +++ b/kernel/bpf/core.c
> @@ -2510,6 +2510,35 @@ static int bpf_check_tail_call(const struct bpf_prog *fp)
>         return ret;
>  }
>
> +/**
> + * bpf_prog_for_each_used_map - Invoke @cb for each map @prog references
> + * @prog: BPF program whose used_maps to walk
> + * @cb: callback invoked once per map; non-zero return stops iteration
> + * @data: opaque argument passed to @cb
> + *
> + * Holds prog->aux->used_maps_mutex across the walk.
> + *
> + * Return 0 if iteration completed, otherwise the first non-zero @cb return.
> + */
> +int bpf_prog_for_each_used_map(struct bpf_prog *prog,
> +                              int (*cb)(struct bpf_map *map, void *data),
> +                              void *data)
> +{
> +       struct bpf_prog_aux *aux = prog->aux;
> +       int ret = 0;
> +       u32 i;
> +
> +       mutex_lock(&aux->used_maps_mutex);
> +       for (i = 0; i < aux->used_map_cnt; i++) {
> +               ret = cb(aux->used_maps[i], data);
> +               if (ret)
> +                       break;
> +       }
> +       mutex_unlock(&aux->used_maps_mutex);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(bpf_prog_for_each_used_map);
> +

Since each program only has one arena, and you use this to determine
whether the program's arena has a flag, why not just add a
bpf_prog_arena() accessor and check the result's flag directly? You
can do prog->aux->arena inside it to return the bpf_map pointer.

>  static bool bpf_prog_select_interpreter(struct bpf_prog *fp)
>  {
>         bool select_interpreter = false;
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
       [not found] ` <20260427105109.2554518-3-tj@kernel.org>
@ 2026-05-12  0:31   ` Kumar Kartikeya Dwivedi
  2026-05-12  2:05     ` Emil Tsalapatis
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  0:31 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Alexei Starovoitov, Emil Tsalapatis, Eduard Zingerman,
	Andrii Nakryiko, David Vernet, Andrea Righi, Changwoo Min, bpf,
	sched-ext, linux-kernel

On Mon, 27 Apr 2026 at 12:51, Tejun Heo <tj@kernel.org> wrote:
>
> bpf_arena's kern_vm range is selectively populated: only allocated pages
> have PTEs. This catches a narrow class of buggy BPF programs that
> dereference unmapped arena addresses, but the protection is shallow - within
> the allocated set there are countless ways for a buggy program to corrupt
> arena memory.
>
> It does, however, impose cost on the kernel side accesses. A kfunc or
> struct_ops callback that wants to consume an arena pointer cannot simply
> load through it; the page may have been freed underneath, so the access has
> to go through copy_from_kernel_nofault(). Out-parameter writes currently
> have no equivalent.
>
> Arena is becoming the primary memory model for BPF programs, and more kfunc
> / struct_ops surfaces will want to read and write arena memory directly. The
> actual answer for catching arena memory bugs is arena ASAN, which addresses
> all memory access bugs meaningfully. Given that, it's worth offering an
> opt-in mode that drops the partial fault protection in exchange for cheap
> direct kernel-side access.
>
> Add BPF_F_ARENA_MAP_ALWAYS. Arenas created with this flag allocate a
> per-arena "garbage" page and pre-populate every PTE in the kern_vm range to
> point at it. arena_alloc_pages() replaces the garbage PTE with a real page;
> arena_free_pages() restores the garbage PTE instead of clearing.
> arena_vm_fault() ignores the garbage page so user-side fault semantics are
> unchanged.
>
> Stores into garbage-backed addresses are silently absorbed; loads return
> indeterminate bytes. Userspace mappings are unaffected. The flag is opt-in -
> arenas without it behave exactly as before.
>
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---

If we go down this route, we should probably make this flag the
default behavior. Otherwise, we cannot universally enable passing
arena memory into kfuncs. Every subsystem will have to check the flag,
we'll have to gate being able to pass memory based on the flag's
presence, etc., which just adds complexity everywhere. It will
eliminate a few patches in this set too. From the programmer's
perspective, program behavior isn't changing much, so we can use
zeroed page (to guarantee faulting loads return 0) instead of setting
the PTE to NULL. While at it we should drop
bpf_prog_report_arena_violation, and its various users.

Summarizing past discussions on all this, with more details on various
pros/cons:

Currently, the semantics for a fault dictate that the program simply
continues, and the destination register becomes 0. One could argue the
ideal form should have been to abort the program on fault, but that
wasn't possible at the time of implementation. We added fault
reporting to the program's streams to improve debuggability. Now since
we have an ASAN implementation, you can likely run that to catch
memory safety problems. An argument against this is that it doesn't
help surface a class of issues for production programs. We don't have
data on whether stray faults or memory corruption within present pages
is the more common occurrence of bugs in the small set of programs
using arenas, so it hard to pass any clear judgement. One thing we do
lose is faults on NULL-derefs, which are likely common, but Emil had
some ideas on that.

Another thing we lose is the ability to build something like GWP-Asan
[0] that we can run in production programs without paying much of the
performance cost by sampling allocations we want to detect bugs for.
But between ASAN and Rust-BPF plans, I am not sure how compelling it
will be going forward. So while it's sort of sad to lose the ability
to fault feedback, it is also non-trivial to enable direct access to
arena memory for the kernel while preserving faults (I won't go into
the details here) without using fault-safe memcpy to move data from/to
arena on the kernel side.

[0]: https://llvm.org/docs/GwpAsan.html

> [...]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  0:31   ` [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access Kumar Kartikeya Dwivedi
@ 2026-05-12  2:05     ` Emil Tsalapatis
  2026-05-12  2:43       ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Emil Tsalapatis @ 2026-05-12  2:05 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Tejun Heo
  Cc: Alexei Starovoitov, Emil Tsalapatis, Eduard Zingerman,
	Andrii Nakryiko, David Vernet, Andrea Righi, Changwoo Min, bpf,
	sched-ext, linux-kernel

On Mon May 11, 2026 at 8:31 PM EDT, Kumar Kartikeya Dwivedi wrote:
> On Mon, 27 Apr 2026 at 12:51, Tejun Heo <tj@kernel.org> wrote:
>>
>> bpf_arena's kern_vm range is selectively populated: only allocated pages
>> have PTEs. This catches a narrow class of buggy BPF programs that
>> dereference unmapped arena addresses, but the protection is shallow - within
>> the allocated set there are countless ways for a buggy program to corrupt
>> arena memory.
>>
>> It does, however, impose cost on the kernel side accesses. A kfunc or
>> struct_ops callback that wants to consume an arena pointer cannot simply
>> load through it; the page may have been freed underneath, so the access has
>> to go through copy_from_kernel_nofault(). Out-parameter writes currently
>> have no equivalent.
>>
>> Arena is becoming the primary memory model for BPF programs, and more kfunc
>> / struct_ops surfaces will want to read and write arena memory directly. The
>> actual answer for catching arena memory bugs is arena ASAN, which addresses
>> all memory access bugs meaningfully. Given that, it's worth offering an
>> opt-in mode that drops the partial fault protection in exchange for cheap
>> direct kernel-side access.
>>
>> Add BPF_F_ARENA_MAP_ALWAYS. Arenas created with this flag allocate a
>> per-arena "garbage" page and pre-populate every PTE in the kern_vm range to
>> point at it. arena_alloc_pages() replaces the garbage PTE with a real page;
>> arena_free_pages() restores the garbage PTE instead of clearing.
>> arena_vm_fault() ignores the garbage page so user-side fault semantics are
>> unchanged.
>>
>> Stores into garbage-backed addresses are silently absorbed; loads return
>> indeterminate bytes. Userspace mappings are unaffected. The flag is opt-in -
>> arenas without it behave exactly as before.
>>
>> Suggested-by: Alexei Starovoitov <ast@kernel.org>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> ---
>
> If we go down this route, we should probably make this flag the
> default behavior. Otherwise, we cannot universally enable passing
> arena memory into kfuncs. Every subsystem will have to check the flag,
> we'll have to gate being able to pass memory based on the flag's
> presence, etc., which just adds complexity everywhere. It will
> eliminate a few patches in this set too. From the programmer's
> perspective, program behavior isn't changing much, so we can use
> zeroed page (to guarantee faulting loads return 0) instead of setting
> the PTE to NULL. While at it we should drop
> bpf_prog_report_arena_violation, and its various users.
>
> Summarizing past discussions on all this, with more details on various
> pros/cons:
>
> Currently, the semantics for a fault dictate that the program simply
> continues, and the destination register becomes 0. One could argue the
> ideal form should have been to abort the program on fault, but that
> wasn't possible at the time of implementation. We added fault
> reporting to the program's streams to improve debuggability. Now since
> we have an ASAN implementation, you can likely run that to catch
> memory safety problems. An argument against this is that it doesn't
> help surface a class of issues for production programs. We don't have
> data on whether stray faults or memory corruption within present pages
> is the more common occurrence of bugs in the small set of programs
> using arenas, so it hard to pass any clear judgement. One thing we do
> lose is faults on NULL-derefs, which are likely common, but Emil had
> some ideas on that.
>
> Another thing we lose is the ability to build something like GWP-Asan
> [0] that we can run in production programs without paying much of the
> performance cost by sampling allocations we want to detect bugs for.
> But between ASAN and Rust-BPF plans, I am not sure how compelling it
> will be going forward. So while it's sort of sad to lose the ability
> to fault feedback, it is also non-trivial to enable direct access to
> arena memory for the kernel while preserving faults (I won't go into
> the details here) without using fault-safe memcpy to move data from/to
> arena on the kernel side.

I completely agree with the discussion points, though imo we do not 
need to make this flag the default if we support it. The complexity is
mostly checking whether a kfunc that takes arena arguments accepts the
burden of validating them, or if it depends on the new flag to prevent
faults. Any new kfuncs should have clear semantics on that, and we can
validate proper behavior with selftests.

Whatever we choose, I am strongly in favor of keeping some kind of error
reporting when touching the first page in the arena. This has been by
far the biggest indicator of bugs, and if we only keep ASAN then we lose
our strongest signal for most use cases. This is made even worse by the
fact the new flag is incompatible with GWP-Asan, making it too costly to
run sanitization at scale.

For the flag, the solution would be to move reserving the low addresses
of arenas from libarena to the arena itself. The arena would have a low
watermark below which it would retain the existing faulting behavior.
The kfunc would bounds check check the arguments to ensure they're not
below the low watermark, and fail if they are.

It's not ideal - it adds the burden of bounds checking into the
kfunc - but it's reasonable that arena-related kfuncs should take into
account the arena's semantics.

>
> [0]: https://llvm.org/docs/GwpAsan.html
>
>> [...]


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  2:05     ` Emil Tsalapatis
@ 2026-05-12  2:43       ` Kumar Kartikeya Dwivedi
  2026-05-12  3:25         ` Alexei Starovoitov
  2026-05-12  3:42         ` Emil Tsalapatis
  0 siblings, 2 replies; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  2:43 UTC (permalink / raw)
  To: Emil Tsalapatis
  Cc: Tejun Heo, Alexei Starovoitov, Eduard Zingerman, Andrii Nakryiko,
	David Vernet, Andrea Righi, Changwoo Min, bpf, sched-ext,
	linux-kernel

On Tue, 12 May 2026 at 04:05, Emil Tsalapatis <emil@etsalapatis.com> wrote:
>
> On Mon May 11, 2026 at 8:31 PM EDT, Kumar Kartikeya Dwivedi wrote:
> > On Mon, 27 Apr 2026 at 12:51, Tejun Heo <tj@kernel.org> wrote:
> >>
> >> bpf_arena's kern_vm range is selectively populated: only allocated pages
> >> have PTEs. This catches a narrow class of buggy BPF programs that
> >> dereference unmapped arena addresses, but the protection is shallow - within
> >> the allocated set there are countless ways for a buggy program to corrupt
> >> arena memory.
> >>
> >> It does, however, impose cost on the kernel side accesses. A kfunc or
> >> struct_ops callback that wants to consume an arena pointer cannot simply
> >> load through it; the page may have been freed underneath, so the access has
> >> to go through copy_from_kernel_nofault(). Out-parameter writes currently
> >> have no equivalent.
> >>
> >> Arena is becoming the primary memory model for BPF programs, and more kfunc
> >> / struct_ops surfaces will want to read and write arena memory directly. The
> >> actual answer for catching arena memory bugs is arena ASAN, which addresses
> >> all memory access bugs meaningfully. Given that, it's worth offering an
> >> opt-in mode that drops the partial fault protection in exchange for cheap
> >> direct kernel-side access.
> >>
> >> Add BPF_F_ARENA_MAP_ALWAYS. Arenas created with this flag allocate a
> >> per-arena "garbage" page and pre-populate every PTE in the kern_vm range to
> >> point at it. arena_alloc_pages() replaces the garbage PTE with a real page;
> >> arena_free_pages() restores the garbage PTE instead of clearing.
> >> arena_vm_fault() ignores the garbage page so user-side fault semantics are
> >> unchanged.
> >>
> >> Stores into garbage-backed addresses are silently absorbed; loads return
> >> indeterminate bytes. Userspace mappings are unaffected. The flag is opt-in -
> >> arenas without it behave exactly as before.
> >>
> >> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> >> Signed-off-by: Tejun Heo <tj@kernel.org>
> >> ---
> >
> > If we go down this route, we should probably make this flag the
> > default behavior. Otherwise, we cannot universally enable passing
> > arena memory into kfuncs. Every subsystem will have to check the flag,
> > we'll have to gate being able to pass memory based on the flag's
> > presence, etc., which just adds complexity everywhere. It will
> > eliminate a few patches in this set too. From the programmer's
> > perspective, program behavior isn't changing much, so we can use
> > zeroed page (to guarantee faulting loads return 0) instead of setting
> > the PTE to NULL. While at it we should drop
> > bpf_prog_report_arena_violation, and its various users.
> >
> > Summarizing past discussions on all this, with more details on various
> > pros/cons:
> >
> > Currently, the semantics for a fault dictate that the program simply
> > continues, and the destination register becomes 0. One could argue the
> > ideal form should have been to abort the program on fault, but that
> > wasn't possible at the time of implementation. We added fault
> > reporting to the program's streams to improve debuggability. Now since
> > we have an ASAN implementation, you can likely run that to catch
> > memory safety problems. An argument against this is that it doesn't
> > help surface a class of issues for production programs. We don't have
> > data on whether stray faults or memory corruption within present pages
> > is the more common occurrence of bugs in the small set of programs
> > using arenas, so it hard to pass any clear judgement. One thing we do
> > lose is faults on NULL-derefs, which are likely common, but Emil had
> > some ideas on that.
> >
> > Another thing we lose is the ability to build something like GWP-Asan
> > [0] that we can run in production programs without paying much of the
> > performance cost by sampling allocations we want to detect bugs for.
> > But between ASAN and Rust-BPF plans, I am not sure how compelling it
> > will be going forward. So while it's sort of sad to lose the ability
> > to fault feedback, it is also non-trivial to enable direct access to
> > arena memory for the kernel while preserving faults (I won't go into
> > the details here) without using fault-safe memcpy to move data from/to
> > arena on the kernel side.
>
> I completely agree with the discussion points, though imo we do not
> need to make this flag the default if we support it. The complexity is
> mostly checking whether a kfunc that takes arena arguments accepts the
> burden of validating them, or if it depends on the new flag to prevent
> faults. Any new kfuncs should have clear semantics on that, and we can
> validate proper behavior with selftests.

The main problem is accessing the arena or arena flags etc. to decide
whether we can read / write the address. It needs to be passed around
or retrieved at runtime from within the kfunc. It also makes it
conditional on the flag, so depending on whether a flag is set my
program will load or not load, since verifier prevents me from passing
arena memory as argument to kfunc. In practice, once sched-ext
requires it for its programs it will defacto be the default since
that's where arenas are used (for now at least). At that point, why
bother with the flag.

>
> Whatever we choose, I am strongly in favor of keeping some kind of error
> reporting when touching the first page in the arena. This has been by
> far the biggest indicator of bugs, and if we only keep ASAN then we lose
> our strongest signal for most use cases. This is made even worse by the
> fact the new flag is incompatible with GWP-Asan, making it too costly to
> run sanitization at scale.
>
> For the flag, the solution would be to move reserving the low addresses
> of arenas from libarena to the arena itself. The arena would have a low
> watermark below which it would retain the existing faulting behavior.
> The kfunc would bounds check check the arguments to ensure they're not
> below the low watermark, and fail if they are.
>
> It's not ideal - it adds the burden of bounds checking into the
> kfunc - but it's reasonable that arena-related kfuncs should take into
> account the arena's semantics.

Another data point to consider is that if we omit this initial
faultable region for catching NULL-derefs, we gain the ability to
allow passing arena memory into any kfunc where memory arguments are
taken, which might be pretty useful. We won't be able to do it if we
have the initial region as faultable since we can't rely on kernel
writes hitting a page without any checks on the memory region. You
must treat arena arguments specially and cannot mix them with other
memory arguments.

The other way to keep the faultable region in the beginning would be
to emit some assertion/runtime check in the verifier and abort the
program if the arena memory being passed into the helper is accessible
for the size parameter used for the helper call, or fix it up to some
page that is likely to be present.

In practice, if most users set the flag then I think you likely lose
the benefit of the default behavior, or cannot rely on it anyway. When
it becomes a dependency for passing arena memory in the kernel to
helpers, most users will blindly set it.

So in the end, it boils down to whether we think retaining faults
(e.g., conditionally for the NULL case) is critical, and whether we
have some convincing evidence for it.

If not, the best course to me seems to be to make the flag behavior
default, and just rely on ASan (and Rust in the future) to prevent any
memory safety issues, and drop the stream based feedback on fault,
etc.

>
> >
> > [0]: https://llvm.org/docs/GwpAsan.html
> >
> >> [...]
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  2:43       ` Kumar Kartikeya Dwivedi
@ 2026-05-12  3:25         ` Alexei Starovoitov
  2026-05-12  3:48           ` Kumar Kartikeya Dwivedi
  2026-05-12  3:42         ` Emil Tsalapatis
  1 sibling, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2026-05-12  3:25 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Emil Tsalapatis
  Cc: Tejun Heo, Alexei Starovoitov, Eduard Zingerman, Andrii Nakryiko,
	David Vernet, Andrea Righi, Changwoo Min, bpf, sched-ext,
	linux-kernel

On Mon May 11, 2026 at 7:43 PM PDT, Kumar Kartikeya Dwivedi wrote:
>
> If not, the best course to me seems to be to make the flag behavior
> default, and just rely on ASan (and Rust in the future) to prevent any
> memory safety issues, and drop the stream based feedback on fault,
> etc.

Agree that this needs to be new default without new uapi flags.
How about we tweak the idea further.
Let all arena pages be unmapped initially. bpf progs will fault
on them and will be reported via bpf_streams.
But we also prepare one "scratch page". Let's use this name,
since "garbage page" reads too dirty.
When kernel faults we populate pte with that scratch page
and let the kernel code retry.
To implement it the page_fault_oops() can have a callback
into bpf/arena helper similar to kfence_handle_page_fault.
If fault address is in arena, do kfence_unprotect()-like.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  2:43       ` Kumar Kartikeya Dwivedi
  2026-05-12  3:25         ` Alexei Starovoitov
@ 2026-05-12  3:42         ` Emil Tsalapatis
  1 sibling, 0 replies; 11+ messages in thread
From: Emil Tsalapatis @ 2026-05-12  3:42 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Emil Tsalapatis
  Cc: Tejun Heo, Alexei Starovoitov, Eduard Zingerman, Andrii Nakryiko,
	David Vernet, Andrea Righi, Changwoo Min, bpf, sched-ext,
	linux-kernel

On Mon May 11, 2026 at 10:43 PM EDT, Kumar Kartikeya Dwivedi wrote:
> On Tue, 12 May 2026 at 04:05, Emil Tsalapatis <emil@etsalapatis.com> wrote:
>>
>> On Mon May 11, 2026 at 8:31 PM EDT, Kumar Kartikeya Dwivedi wrote:
>> > On Mon, 27 Apr 2026 at 12:51, Tejun Heo <tj@kernel.org> wrote:
>> >>
>> >> bpf_arena's kern_vm range is selectively populated: only allocated pages
>> >> have PTEs. This catches a narrow class of buggy BPF programs that
>> >> dereference unmapped arena addresses, but the protection is shallow - within
>> >> the allocated set there are countless ways for a buggy program to corrupt
>> >> arena memory.
>> >>
>> >> It does, however, impose cost on the kernel side accesses. A kfunc or
>> >> struct_ops callback that wants to consume an arena pointer cannot simply
>> >> load through it; the page may have been freed underneath, so the access has
>> >> to go through copy_from_kernel_nofault(). Out-parameter writes currently
>> >> have no equivalent.
>> >>
>> >> Arena is becoming the primary memory model for BPF programs, and more kfunc
>> >> / struct_ops surfaces will want to read and write arena memory directly. The
>> >> actual answer for catching arena memory bugs is arena ASAN, which addresses
>> >> all memory access bugs meaningfully. Given that, it's worth offering an
>> >> opt-in mode that drops the partial fault protection in exchange for cheap
>> >> direct kernel-side access.
>> >>
>> >> Add BPF_F_ARENA_MAP_ALWAYS. Arenas created with this flag allocate a
>> >> per-arena "garbage" page and pre-populate every PTE in the kern_vm range to
>> >> point at it. arena_alloc_pages() replaces the garbage PTE with a real page;
>> >> arena_free_pages() restores the garbage PTE instead of clearing.
>> >> arena_vm_fault() ignores the garbage page so user-side fault semantics are
>> >> unchanged.
>> >>
>> >> Stores into garbage-backed addresses are silently absorbed; loads return
>> >> indeterminate bytes. Userspace mappings are unaffected. The flag is opt-in -
>> >> arenas without it behave exactly as before.
>> >>
>> >> Suggested-by: Alexei Starovoitov <ast@kernel.org>
>> >> Signed-off-by: Tejun Heo <tj@kernel.org>
>> >> ---
>> >
>> > If we go down this route, we should probably make this flag the
>> > default behavior. Otherwise, we cannot universally enable passing
>> > arena memory into kfuncs. Every subsystem will have to check the flag,
>> > we'll have to gate being able to pass memory based on the flag's
>> > presence, etc., which just adds complexity everywhere. It will
>> > eliminate a few patches in this set too. From the programmer's
>> > perspective, program behavior isn't changing much, so we can use
>> > zeroed page (to guarantee faulting loads return 0) instead of setting
>> > the PTE to NULL. While at it we should drop
>> > bpf_prog_report_arena_violation, and its various users.
>> >
>> > Summarizing past discussions on all this, with more details on various
>> > pros/cons:
>> >
>> > Currently, the semantics for a fault dictate that the program simply
>> > continues, and the destination register becomes 0. One could argue the
>> > ideal form should have been to abort the program on fault, but that
>> > wasn't possible at the time of implementation. We added fault
>> > reporting to the program's streams to improve debuggability. Now since
>> > we have an ASAN implementation, you can likely run that to catch
>> > memory safety problems. An argument against this is that it doesn't
>> > help surface a class of issues for production programs. We don't have
>> > data on whether stray faults or memory corruption within present pages
>> > is the more common occurrence of bugs in the small set of programs
>> > using arenas, so it hard to pass any clear judgement. One thing we do
>> > lose is faults on NULL-derefs, which are likely common, but Emil had
>> > some ideas on that.
>> >
>> > Another thing we lose is the ability to build something like GWP-Asan
>> > [0] that we can run in production programs without paying much of the
>> > performance cost by sampling allocations we want to detect bugs for.
>> > But between ASAN and Rust-BPF plans, I am not sure how compelling it
>> > will be going forward. So while it's sort of sad to lose the ability
>> > to fault feedback, it is also non-trivial to enable direct access to
>> > arena memory for the kernel while preserving faults (I won't go into
>> > the details here) without using fault-safe memcpy to move data from/to
>> > arena on the kernel side.
>>
>> I completely agree with the discussion points, though imo we do not
>> need to make this flag the default if we support it. The complexity is
>> mostly checking whether a kfunc that takes arena arguments accepts the
>> burden of validating them, or if it depends on the new flag to prevent
>> faults. Any new kfuncs should have clear semantics on that, and we can
>> validate proper behavior with selftests.
>
> The main problem is accessing the arena or arena flags etc. to decide
> whether we can read / write the address. It needs to be passed around
> or retrieved at runtime from within the kfunc. It also makes it
> conditional on the flag, so depending on whether a flag is set my
> program will load or not load, since verifier prevents me from passing
> arena memory as argument to kfunc. In practice, once sched-ext
> requires it for its programs it will defacto be the default since
> that's where arenas are used (for now at least). At that point, why
> bother with the flag.
>

While sched_ext is currently the main user of arenas, there are other
potential users - the *_ext's being developed in MM, for example If 
we change the default behavior, we risk making arenas less useful for 
them until Rust-BPF or an equivalent solution prevents memory access
errors further up the BPF software stack. I think ASAN can only partly
help in terms of reporting since it won't be on by default.

As an aside, whether Rust-BPF would solve the problem depends on whether
we allow/require unsafe Rust to be compilable down to BPF, and how much
users end up writing and deploying unsafe Rust. Anecdotally, I've seen
arena-based data structure implementations in Rust that are full of unsafe
blocks.

>>
>> Whatever we choose, I am strongly in favor of keeping some kind of error
>> reporting when touching the first page in the arena. This has been by
>> far the biggest indicator of bugs, and if we only keep ASAN then we lose
>> our strongest signal for most use cases. This is made even worse by the
>> fact the new flag is incompatible with GWP-Asan, making it too costly to
>> run sanitization at scale.
>>
>> For the flag, the solution would be to move reserving the low addresses
>> of arenas from libarena to the arena itself. The arena would have a low
>> watermark below which it would retain the existing faulting behavior.
>> The kfunc would bounds check check the arguments to ensure they're not
>> below the low watermark, and fail if they are.
>>
>> It's not ideal - it adds the burden of bounds checking into the
>> kfunc - but it's reasonable that arena-related kfuncs should take into
>> account the arena's semantics.
>
> Another data point to consider is that if we omit this initial
> faultable region for catching NULL-derefs, we gain the ability to
> allow passing arena memory into any kfunc where memory arguments are
> taken, which might be pretty useful. We won't be able to do it if we
> have the initial region as faultable since we can't rely on kernel
> writes hitting a page without any checks on the memory region. You
> must treat arena arguments specially and cannot mix them with other
> memory arguments.
>
> The other way to keep the faultable region in the beginning would be
> to emit some assertion/runtime check in the verifier and abort the
> program if the arena memory being passed into the helper is accessible
> for the size parameter used for the helper call, or fix it up to some
> page that is likely to be present.
>
> In practice, if most users set the flag then I think you likely lose
> the benefit of the default behavior, or cannot rely on it anyway. When
> it becomes a dependency for passing arena memory in the kernel to
> helpers, most users will blindly set it.
>
> So in the end, it boils down to whether we think retaining faults
> (e.g., conditionally for the NULL case) is critical, and whether we
> have some convincing evidence for it.

Fair enough. While anecdotal, IME it makes a big difference to be able
to track NULL dereferences. Explicit checks within the program do help,
but at that point we are depending on implementing perfect error
handling for the program.

>
> If not, the best course to me seems to be to make the flag behavior
> default, and just rely on ASan (and Rust in the future) to prevent any
> memory safety issues, and drop the stream based feedback on fault,
> etc.
>
>>
>> >
>> > [0]: https://llvm.org/docs/GwpAsan.html
>> >
>> >> [...]
>>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  3:25         ` Alexei Starovoitov
@ 2026-05-12  3:48           ` Kumar Kartikeya Dwivedi
  2026-05-12  4:24             ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12  3:48 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Emil Tsalapatis, Tejun Heo, Alexei Starovoitov, Eduard Zingerman,
	Andrii Nakryiko, David Vernet, Andrea Righi, Changwoo Min, bpf,
	sched-ext, linux-kernel

On Tue, 12 May 2026 at 05:25, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon May 11, 2026 at 7:43 PM PDT, Kumar Kartikeya Dwivedi wrote:
> >
> > If not, the best course to me seems to be to make the flag behavior
> > default, and just rely on ASan (and Rust in the future) to prevent any
> > memory safety issues, and drop the stream based feedback on fault,
> > etc.
>
> Agree that this needs to be new default without new uapi flags.
> How about we tweak the idea further.
> Let all arena pages be unmapped initially. bpf progs will fault
> on them and will be reported via bpf_streams.
> But we also prepare one "scratch page". Let's use this name,
> since "garbage page" reads too dirty.
> When kernel faults we populate pte with that scratch page
> and let the kernel code retry.
> To implement it the page_fault_oops() can have a callback
> into bpf/arena helper similar to kfence_handle_page_fault.
> If fault address is in arena, do kfence_unprotect()-like.

Interesting idea. So I guess this page remains mapped once kernel
faults on it. I guess we can still reset it to NULL if we alloc and
free a page at the same address, so it's just a drop-in to prevent
further faults inside the kernel, since emulating instructions is ugly
and we're not using asm wrappers that have fixup labels etc. If we end
up allocating and freeing something at the same address it will likely
get reset to NULL (that would be ideal). But even if this happens in
parallel we may fault again and then will just fix up the NULL pte
with scratch page again. We can likely also preserve fault reporting
into streams when such scratch pages are brought in.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  3:48           ` Kumar Kartikeya Dwivedi
@ 2026-05-12  4:24             ` Alexei Starovoitov
  2026-05-12 12:29               ` Emil Tsalapatis
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2026-05-12  4:24 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Emil Tsalapatis, Tejun Heo, Alexei Starovoitov, Eduard Zingerman,
	Andrii Nakryiko, David Vernet, Andrea Righi, Changwoo Min, bpf,
	sched-ext, LKML

On Mon, May 11, 2026 at 8:49 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Tue, 12 May 2026 at 05:25, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon May 11, 2026 at 7:43 PM PDT, Kumar Kartikeya Dwivedi wrote:
> > >
> > > If not, the best course to me seems to be to make the flag behavior
> > > default, and just rely on ASan (and Rust in the future) to prevent any
> > > memory safety issues, and drop the stream based feedback on fault,
> > > etc.
> >
> > Agree that this needs to be new default without new uapi flags.
> > How about we tweak the idea further.
> > Let all arena pages be unmapped initially. bpf progs will fault
> > on them and will be reported via bpf_streams.
> > But we also prepare one "scratch page". Let's use this name,
> > since "garbage page" reads too dirty.
> > When kernel faults we populate pte with that scratch page
> > and let the kernel code retry.
> > To implement it the page_fault_oops() can have a callback
> > into bpf/arena helper similar to kfence_handle_page_fault.
> > If fault address is in arena, do kfence_unprotect()-like.
>
> Interesting idea. So I guess this page remains mapped once kernel
> faults on it. I guess we can still reset it to NULL if we alloc and
> free a page at the same address, so it's just a drop-in to prevent
> further faults inside the kernel, since emulating instructions is ugly
> and we're not using asm wrappers that have fixup labels etc. If we end
> up allocating and freeing something at the same address it will likely
> get reset to NULL (that would be ideal). But even if this happens in
> parallel we may fault again and then will just fix up the NULL pte
> with scratch page again. We can likely also preserve fault reporting
> into streams when such scratch pages are brought in.

Yep. All makes sense.
The hope is that faults from kfuncs should be rare
compared to faults from regular arena bugs.
So the stuck scratch page shouldn't happen often and
faults on unmapped will still be seen most of the time.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12  4:24             ` Alexei Starovoitov
@ 2026-05-12 12:29               ` Emil Tsalapatis
  2026-05-12 14:07                 ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 11+ messages in thread
From: Emil Tsalapatis @ 2026-05-12 12:29 UTC (permalink / raw)
  To: Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Emil Tsalapatis, Tejun Heo, Alexei Starovoitov, Eduard Zingerman,
	Andrii Nakryiko, David Vernet, Andrea Righi, Changwoo Min, bpf,
	sched-ext, LKML

On Tue May 12, 2026 at 12:24 AM EDT, Alexei Starovoitov wrote:
> On Mon, May 11, 2026 at 8:49 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
>>
>> On Tue, 12 May 2026 at 05:25, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> >
>> > On Mon May 11, 2026 at 7:43 PM PDT, Kumar Kartikeya Dwivedi wrote:
>> > >
>> > > If not, the best course to me seems to be to make the flag behavior
>> > > default, and just rely on ASan (and Rust in the future) to prevent any
>> > > memory safety issues, and drop the stream based feedback on fault,
>> > > etc.
>> >
>> > Agree that this needs to be new default without new uapi flags.
>> > How about we tweak the idea further.
>> > Let all arena pages be unmapped initially. bpf progs will fault
>> > on them and will be reported via bpf_streams.
>> > But we also prepare one "scratch page". Let's use this name,
>> > since "garbage page" reads too dirty.
>> > When kernel faults we populate pte with that scratch page
>> > and let the kernel code retry.
>> > To implement it the page_fault_oops() can have a callback
>> > into bpf/arena helper similar to kfence_handle_page_fault.
>> > If fault address is in arena, do kfence_unprotect()-like.
>>
>> Interesting idea. So I guess this page remains mapped once kernel
>> faults on it. I guess we can still reset it to NULL if we alloc and
>> free a page at the same address, so it's just a drop-in to prevent
>> further faults inside the kernel, since emulating instructions is ugly
>> and we're not using asm wrappers that have fixup labels etc. If we end
>> up allocating and freeing something at the same address it will likely
>> get reset to NULL (that would be ideal). But even if this happens in
>> parallel we may fault again and then will just fix up the NULL pte
>> with scratch page again. We can likely also preserve fault reporting
>> into streams when such scratch pages are brought in.
>
> Yep. All makes sense.
> The hope is that faults from kfuncs should be rare
> compared to faults from regular arena bugs.
> So the stuck scratch page shouldn't happen often and
> faults on unmapped will still be seen most of the time.

This sounds great, it pretty much retains all arena behavior that we
care about. The most important part is that it reliably reports the
first memory access error, which even now is the only one that is
meaningful. The delta with current behavior is that subsequent accesses
are not caught, but we don't care about those because they are very
likely caused by reading zeros during the initial buggy access.

Would the scratch page be actually mapped into the arena radix tree, or 
just the pte? Because if it doesn't then I think we don't even need to
worry about resetting it from the arena side. Just allocating it at
a later time will overwrite the scratch page PTE with new valid page,
Until then the page is accessing the scratch page, but again we only
care about the first buggy access.

Small nit: Maybe default page instead of scratch page? Scratch page
sounds a bit like scratch space but we don't actually use the page to
store any data.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12 12:29               ` Emil Tsalapatis
@ 2026-05-12 14:07                 ` Kumar Kartikeya Dwivedi
  2026-05-12 15:59                   ` Emil Tsalapatis
  0 siblings, 1 reply; 11+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2026-05-12 14:07 UTC (permalink / raw)
  To: Emil Tsalapatis, Alexei Starovoitov, Kumar Kartikeya Dwivedi
  Cc: Tejun Heo, Alexei Starovoitov, Eduard Zingerman, Andrii Nakryiko,
	David Vernet, Andrea Righi, Changwoo Min, bpf, sched-ext, LKML

On Tue May 12, 2026 at 2:29 PM CEST, Emil Tsalapatis wrote:
> On Tue May 12, 2026 at 12:24 AM EDT, Alexei Starovoitov wrote:
>> On Mon, May 11, 2026 at 8:49 PM Kumar Kartikeya Dwivedi
>> <memxor@gmail.com> wrote:
>>>
>>> On Tue, 12 May 2026 at 05:25, Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>> >
>>> > On Mon May 11, 2026 at 7:43 PM PDT, Kumar Kartikeya Dwivedi wrote:
>>> > >
>>> > > If not, the best course to me seems to be to make the flag behavior
>>> > > default, and just rely on ASan (and Rust in the future) to prevent any
>>> > > memory safety issues, and drop the stream based feedback on fault,
>>> > > etc.
>>> >
>>> > Agree that this needs to be new default without new uapi flags.
>>> > How about we tweak the idea further.
>>> > Let all arena pages be unmapped initially. bpf progs will fault
>>> > on them and will be reported via bpf_streams.
>>> > But we also prepare one "scratch page". Let's use this name,
>>> > since "garbage page" reads too dirty.
>>> > When kernel faults we populate pte with that scratch page
>>> > and let the kernel code retry.
>>> > To implement it the page_fault_oops() can have a callback
>>> > into bpf/arena helper similar to kfence_handle_page_fault.
>>> > If fault address is in arena, do kfence_unprotect()-like.
>>>
>>> Interesting idea. So I guess this page remains mapped once kernel
>>> faults on it. I guess we can still reset it to NULL if we alloc and
>>> free a page at the same address, so it's just a drop-in to prevent
>>> further faults inside the kernel, since emulating instructions is ugly
>>> and we're not using asm wrappers that have fixup labels etc. If we end
>>> up allocating and freeing something at the same address it will likely
>>> get reset to NULL (that would be ideal). But even if this happens in
>>> parallel we may fault again and then will just fix up the NULL pte
>>> with scratch page again. We can likely also preserve fault reporting
>>> into streams when such scratch pages are brought in.
>>
>> Yep. All makes sense.
>> The hope is that faults from kfuncs should be rare
>> compared to faults from regular arena bugs.
>> So the stuck scratch page shouldn't happen often and
>> faults on unmapped will still be seen most of the time.
>
> This sounds great, it pretty much retains all arena behavior that we
> care about. The most important part is that it reliably reports the
> first memory access error, which even now is the only one that is
> meaningful. The delta with current behavior is that subsequent accesses
> are not caught, but we don't care about those because they are very
> likely caused by reading zeros during the initial buggy access.
>
> Would the scratch page be actually mapped into the arena radix tree, or
> just the pte? Because if it doesn't then I think we don't even need to

Just the PTE.

> worry about resetting it from the arena side. Just allocating it at
> a later time will overwrite the scratch page PTE with new valid page,

Which is fine IMO, and how it should be. Alloc and free cycle sets it to NULL,
so be it. Users can also do it in parallel, that case will just cause a fault in
the kernel again and we'll reset the PTE to the scratch page again.

> Until then the page is accessing the scratch page, but again we only
> care about the first buggy access.

Right.

>
> Small nit: Maybe default page instead of scratch page? Scratch page
> sounds a bit like scratch space but we don't actually use the page to
> store any data.

It likely should also be zeroed out, to preserve the idea that reading
'faulting' regions returns zeroes. Let's just go with scratch page term.

I think the main idea is we install a page fault handler after the KCSAN one,
from the fault handler, use bpf_prog_find_from_stack() to obtain the first
program in the stack trace, which will be the one originating the fault inside
the kernel. Then make sure the faulting address lies in the prog->aux->arena,
(likely including guard pages in its range), and just install the PTE for the
zeroed out scratch page at that point and continue.

I thought about various races, to me it seems it should be ok. If parallel
installation wins over us, it either installed a valid page replacing scratch
PTE, at which point we just let the kernel retry, or installed a scratch page.
If it races and replaces existing scratch or valid page with NULL after we
checked, we fault again and retry. In any case, either the kernel continues or
it ends up faulting again, at which point we can handle the fault again and
attempt to fix it up.

We likely need to make sure the existing thing is pte_none()  only install if
pte_none(), otherwise leave things as is. If racy attempts unmap and set scratch
or valid page to none, we will fault again and reinstall. If racy attempts
install scratch page or valid page, we let it be as is. More importantly we
shouldn't install scratch page over a valid page, I think.

Our PTE installation likely takes the form try_cmpxchg(pte, NULL, scratch_page).

One corner case is that we may have cached scratch page TLB translations for a
range we are trying to alloc pages over. Typically the way to eliminate stale
TLBs would be to just do flush_tlb_kernel_range(). In this case I wonder whether
we just skip it to avoid the cost and let the stale TLB stay, since it likely
came due to program passing faultable memory into kernel.

That said, a cheaper fix would be to install PTEs under the lock not with
WRITE_ONCE() but xchg() so that we can inspect if we overwrote an entry that
had scratch page and only do the extra TLB flush in that case. I would be fine
with either option (leaving it as is, or the above), as long as we document it
somewhere (either in the commit log or a comment in the code), just so we don't
forget.

The main question is, what are the next steps? Do you want to take a stab at
implementing this?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access
  2026-05-12 14:07                 ` Kumar Kartikeya Dwivedi
@ 2026-05-12 15:59                   ` Emil Tsalapatis
  0 siblings, 0 replies; 11+ messages in thread
From: Emil Tsalapatis @ 2026-05-12 15:59 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi, Emil Tsalapatis, Alexei Starovoitov
  Cc: Tejun Heo, Alexei Starovoitov, Eduard Zingerman, Andrii Nakryiko,
	David Vernet, Andrea Righi, Changwoo Min, bpf, sched-ext, LKML

On Tue May 12, 2026 at 10:07 AM EDT, Kumar Kartikeya Dwivedi wrote:
> On Tue May 12, 2026 at 2:29 PM CEST, Emil Tsalapatis wrote:
>> On Tue May 12, 2026 at 12:24 AM EDT, Alexei Starovoitov wrote:
>>> On Mon, May 11, 2026 at 8:49 PM Kumar Kartikeya Dwivedi
>>> <memxor@gmail.com> wrote:
>>>>
>>>> On Tue, 12 May 2026 at 05:25, Alexei Starovoitov
>>>> <alexei.starovoitov@gmail.com> wrote:
>>>> >
>>>> > On Mon May 11, 2026 at 7:43 PM PDT, Kumar Kartikeya Dwivedi wrote:
>>>> > >
>>>> > > If not, the best course to me seems to be to make the flag behavior
>>>> > > default, and just rely on ASan (and Rust in the future) to prevent any
>>>> > > memory safety issues, and drop the stream based feedback on fault,
>>>> > > etc.
>>>> >
>>>> > Agree that this needs to be new default without new uapi flags.
>>>> > How about we tweak the idea further.
>>>> > Let all arena pages be unmapped initially. bpf progs will fault
>>>> > on them and will be reported via bpf_streams.
>>>> > But we also prepare one "scratch page". Let's use this name,
>>>> > since "garbage page" reads too dirty.
>>>> > When kernel faults we populate pte with that scratch page
>>>> > and let the kernel code retry.
>>>> > To implement it the page_fault_oops() can have a callback
>>>> > into bpf/arena helper similar to kfence_handle_page_fault.
>>>> > If fault address is in arena, do kfence_unprotect()-like.
>>>>
>>>> Interesting idea. So I guess this page remains mapped once kernel
>>>> faults on it. I guess we can still reset it to NULL if we alloc and
>>>> free a page at the same address, so it's just a drop-in to prevent
>>>> further faults inside the kernel, since emulating instructions is ugly
>>>> and we're not using asm wrappers that have fixup labels etc. If we end
>>>> up allocating and freeing something at the same address it will likely
>>>> get reset to NULL (that would be ideal). But even if this happens in
>>>> parallel we may fault again and then will just fix up the NULL pte
>>>> with scratch page again. We can likely also preserve fault reporting
>>>> into streams when such scratch pages are brought in.
>>>
>>> Yep. All makes sense.
>>> The hope is that faults from kfuncs should be rare
>>> compared to faults from regular arena bugs.
>>> So the stuck scratch page shouldn't happen often and
>>> faults on unmapped will still be seen most of the time.
>>
>> This sounds great, it pretty much retains all arena behavior that we
>> care about. The most important part is that it reliably reports the
>> first memory access error, which even now is the only one that is
>> meaningful. The delta with current behavior is that subsequent accesses
>> are not caught, but we don't care about those because they are very
>> likely caused by reading zeros during the initial buggy access.
>>
>> Would the scratch page be actually mapped into the arena radix tree, or
>> just the pte? Because if it doesn't then I think we don't even need to
>
> Just the PTE.
>
>> worry about resetting it from the arena side. Just allocating it at
>> a later time will overwrite the scratch page PTE with new valid page,
>
> Which is fine IMO, and how it should be. Alloc and free cycle sets it to NULL,
> so be it. Users can also do it in parallel, that case will just cause a fault in
> the kernel again and we'll reset the PTE to the scratch page again.

Yeah this is why this solution does not interfere with any BPF arena code.
The allocator does not need to know about the scratch PTEs at all, it
can just allocate over them and automatically turns the address valid.

>
>> Until then the page is accessing the scratch page, but again we only
>> care about the first buggy access.
>
> Right.
>
>>
>> Small nit: Maybe default page instead of scratch page? Scratch page
>> sounds a bit like scratch space but we don't actually use the page to
>> store any data.
>
> It likely should also be zeroed out, to preserve the idea that reading
> 'faulting' regions returns zeroes. Let's just go with scratch page term.
>
> I think the main idea is we install a page fault handler after the KCSAN one,
> from the fault handler, use bpf_prog_find_from_stack() to obtain the first
> program in the stack trace, which will be the one originating the fault inside
> the kernel. Then make sure the faulting address lies in the prog->aux->arena,
> (likely including guard pages in its range), and just install the PTE for the
> zeroed out scratch page at that point and continue.
>
> I thought about various races, to me it seems it should be ok. If parallel
> installation wins over us, it either installed a valid page replacing scratch
> PTE, at which point we just let the kernel retry, or installed a scratch page.
> If it races and replaces existing scratch or valid page with NULL after we
> checked, we fault again and retry. In any case, either the kernel continues or
> it ends up faulting again, at which point we can handle the fault again and
> attempt to fix it up.
>
> We likely need to make sure the existing thing is pte_none()  only install if
> pte_none(), otherwise leave things as is. If racy attempts unmap and set scratch
> or valid page to none, we will fault again and reinstall. If racy attempts
> install scratch page or valid page, we let it be as is. More importantly we
> shouldn't install scratch page over a valid page, I think.
>
> Our PTE installation likely takes the form try_cmpxchg(pte, NULL, scratch_page).
>
> One corner case is that we may have cached scratch page TLB translations for a
> range we are trying to alloc pages over. Typically the way to eliminate stale
> TLBs would be to just do flush_tlb_kernel_range(). In this case I wonder whether
> we just skip it to avoid the cost and let the stale TLB stay, since it likely
> came due to program passing faultable memory into kernel.
>
> That said, a cheaper fix would be to install PTEs under the lock not with
> WRITE_ONCE() but xchg() so that we can inspect if we overwrote an entry that
> had scratch page and only do the extra TLB flush in that case. I would be fine
> with either option (leaving it as is, or the above), as long as we document it
> somewhere (either in the commit log or a comment in the code), just so we don't
> forget.
>

Let's skip the flush. When we hit races like that during the kfunc we should
care more about completing the call than about the result since the
program is already buggy.

> The main question is, what are the next steps? Do you want to take a stab at
> implementing this?

Can do, I will send a patch.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-05-12 15:59 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260427105109.2554518-1-tj@kernel.org>
     [not found] ` <20260427105109.2554518-6-tj@kernel.org>
2026-05-11 21:44   ` [RFC PATCH 5/9] bpf: Add bpf_prog_for_each_used_map() Kumar Kartikeya Dwivedi
     [not found] ` <20260427105109.2554518-3-tj@kernel.org>
2026-05-12  0:31   ` [RFC PATCH 2/9] bpf/arena: Add BPF_F_ARENA_MAP_ALWAYS for direct kernel access Kumar Kartikeya Dwivedi
2026-05-12  2:05     ` Emil Tsalapatis
2026-05-12  2:43       ` Kumar Kartikeya Dwivedi
2026-05-12  3:25         ` Alexei Starovoitov
2026-05-12  3:48           ` Kumar Kartikeya Dwivedi
2026-05-12  4:24             ` Alexei Starovoitov
2026-05-12 12:29               ` Emil Tsalapatis
2026-05-12 14:07                 ` Kumar Kartikeya Dwivedi
2026-05-12 15:59                   ` Emil Tsalapatis
2026-05-12  3:42         ` Emil Tsalapatis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox