* [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages
@ 2026-05-31 16:58 Tejun Heo
2026-05-31 17:08 ` sashiko-bot
2026-05-31 17:35 ` bot+bpf-ci
0 siblings, 2 replies; 3+ messages in thread
From: Tejun Heo @ 2026-05-31 16:58 UTC (permalink / raw)
To: ast, andrii, daniel, martin.lau
Cc: memxor, david, emil, bpf, sched-ext, linux-kernel
apply_range_set_cb() maps the pages for a new arena allocation. It returned
-EBUSY when the target PTE was already populated.
Kernel-fault recovery can leave the per-arena scratch page in an otherwise
unallocated arena PTE: a BPF program that touched an unallocated address
gets the scratch page installed there. A later bpf_arena_alloc_pages()
covering that page then finds the PTE populated, returns -EBUSY, and leaves
the scratch page in place. Every subsequent allocation of that page fails
the same way.
Drop the must-be-empty check so set_pte_at() installs the real page
unconditionally, overwriting a scratch (or empty) PTE.
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
---
kernel/bpf/arena.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -131,15 +131,13 @@ static int apply_range_set_cb(pte_t *pte
if (!data)
return 0;
- /* sanity check */
- if (unlikely(!pte_none(ptep_get(pte))))
- return -EBUSY;
page = d->pages[d->i];
/* paranoia, similar to vmap_pages_pte_range() */
if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
return -EINVAL;
+ /* May be none or the scratch page, overwrite either way */
set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
d->i++;
return 0;
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages
2026-05-31 16:58 [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages Tejun Heo
@ 2026-05-31 17:08 ` sashiko-bot
2026-05-31 17:35 ` bot+bpf-ci
1 sibling, 0 replies; 3+ messages in thread
From: sashiko-bot @ 2026-05-31 17:08 UTC (permalink / raw)
To: Tejun Heo; +Cc: bpf, sched-ext
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Critical] Overwriting a present PTE (the scratch page) with a new page via set_pte_at() without first clearing it and flushing the TLB violates Break-Before-Make (BBM) architecture rules and causes silent data corruption via stale TLB translations.
--
commit f66bc91b18a1403f8777f0486173da0584f46ca3
Author: Tejun Heo <tj@kernel.org>
bpf: Overwrite scratch PTE when allocating arena pages
apply_range_set_cb() previously returned -EBUSY when encountering a
populated PTE. This patch removes that check to allow unconditionally
overwriting the scratch page left behind by kernel-fault recovery during
new arena allocations.
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index a641b19255aad..9e0b9f2170204 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -157,15 +157,13 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>
> if (!data)
> return 0;
> - /* sanity check */
> - if (unlikely(!pte_none(ptep_get(pte))))
> - return -EBUSY;
>
> page = d->pages[d->i];
> /* paranoia, similar to vmap_pages_pte_range() */
> if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> return -EINVAL;
>
> + /* May be none or the scratch page, overwrite either way */
> set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
[Severity: Critical]
Does unconditionally overwriting a present PTE with set_pte_at() without
first clearing it violate Break-Before-Make architecture rules?
On architectures like arm64, overwriting a valid, present PTE without a
sequence like ptep_get_and_clear() can trigger hardware TLB conflict aborts
or panics.
Additionally, since this changes the PFN of a present PTE (the scratch page),
doesn't this require a TLB flush?
Looking at the caller arena_alloc_pages(), it appears to only invoke
flush_vmap_cache() after the apply_to_page_range() loop:
ret = apply_to_page_range(&init_mm,
kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT),
this_batch << PAGE_SHIFT, apply_range_set_cb, &data);
...
flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
Since flush_vmap_cache() flushes data caches and not the TLB, could CPUs
retain stale TLB mappings to the scratch page? If so, subsequent accesses
might hit the shared scratch page instead of the newly allocated page,
bypassing memory isolation.
> d->i++;
> return 0;
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260531165852.555930-1-tj@kernel.org?part=1
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages
2026-05-31 16:58 [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages Tejun Heo
2026-05-31 17:08 ` sashiko-bot
@ 2026-05-31 17:35 ` bot+bpf-ci
1 sibling, 0 replies; 3+ messages in thread
From: bot+bpf-ci @ 2026-05-31 17:35 UTC (permalink / raw)
To: tj, ast, andrii, daniel, martin.lau
Cc: memxor, david, emil, bpf, sched-ext, linux-kernel, ast, andrii,
daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 3693 bytes --]
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 1727503b25d8..520224dc45c9 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -157,15 +157,13 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>
> if (!data)
> return 0;
> - /* sanity check */
> - if (unlikely(!pte_none(ptep_get(pte))))
> - return -EBUSY;
>
> page = d->pages[d->i];
> /* paranoia, similar to vmap_pages_pte_range() */
> if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> return -EINVAL;
>
> + /* May be none or the scratch page, overwrite either way */
> set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
^^^^^^^^^^
When set_pte_at() overwrites a present PTE (the scratch page) with a
different PFN (the new real page), does the arena allocation path flush
the TLB?
The caller arena_alloc_pages() only calls flush_vmap_cache() after
apply_to_page_range(), which expands to flush_cache_vmap() (a no-op on
x86). This flushes data caches but not TLB entries.
Before this patch, apply_range_set_cb() only transitioned not-present to
present, which requires no TLB flush. But now it can transition
present(scratch) to present(real), changing the underlying PFN.
When a BPF program faults on an unallocated arena address, the
kernel-fault recovery path installs the scratch page into that PTE:
bpf_arena_handle_page_fault()
-> apply_range_set_scratch_cb()
-> ptep_try_set()
The faulting CPU then retries the access, loading a TLB entry that maps
the arena kern_vm address to the scratch_page PFN. This entry becomes
stale when a later arena_alloc_pages() overwrites the PTE to point to a
real page.
The arena kern_vm region is mapped in init_mm and shared across all CPUs.
Without a TLB flush, can a CPU holding a stale entry continue
reading/writing the shared scratch page instead of the new real page?
The free path in arena_free_pages() and arena_free_worker() already calls
flush_tlb_kernel_range() after clearing arena PTEs:
arena_free_pages():
ptep_get_and_clear(...)
flush_tlb_kernel_range(start, end);
arena_free_worker():
apply_to_page_range(...)
flush_tlb_kernel_range(...)
Does the alloc path need a similar flush when overwriting a present PTE?
> d->i++;
> return 0;
---
> commit 066af9e73183b42ad5ee27805163ce25053ed9bf
> Author: Tejun Heo <tj@kernel.org>
>
> bpf: Overwrite scratch PTE when allocating arena pages
>
> apply_range_set_cb() maps the pages for a new arena allocation. It returned
> -EBUSY when the target PTE was already populated.
>
> Kernel-fault recovery can leave the per-arena scratch page in an otherwise
> unallocated arena PTE: a BPF program that touched an unallocated address
> gets the scratch page installed there. A later bpf_arena_alloc_pages()
> covering that page then finds the PTE populated, returns -EBUSY, and leaves
> the scratch page in place. Every subsequent allocation of that page fails
> the same way.
>
> Drop the must-be-empty check so set_pte_at() installs the real page
> unconditionally, overwriting a scratch (or empty) PTE.
>
> Suggested-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: David Hildenbrand <david@kernel.org>
This looks like a bug fix for the scratch page allocation failure.
Should this include:
Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page")
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/26718992256
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-05-31 17:35 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-31 16:58 [PATCH bpf-next] bpf: Overwrite scratch PTE when allocating arena pages Tejun Heo
2026-05-31 17:08 ` sashiko-bot
2026-05-31 17:35 ` bot+bpf-ci
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox