From: sashiko-bot@kernel.org
To: "Tejun Heo" <tj@kernel.org>
Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages
Date: Mon, 01 Jun 2026 18:58:02 +0000 [thread overview]
Message-ID: <20260601185802.B22171F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260601183728.1800490-1-tj@kernel.org>
Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
New issues:
- [Critical] Replacing the populated scratch page PTE with a newly allocated page lacks a TLB flush on architectures where `flush_tlb_before_set()` is a no-op, causing silent data corruption.
- [High] Calling `pte_page(old)` without checking `pte_present(old)` can cause an invalid pointer dereference or kernel panic.
Pre-existing issues:
- [High] Calling memory allocation and page table walking functions with interrupts disabled causes lock nesting violations and scheduling while atomic bugs.
--
commit b6f9fe34365955853d165d4a39abcf4bf12f7b70
Author: Tejun Heo <tj@kernel.org>
bpf: Replace scratch PTE atomically when allocating arena pages
This commit modifies the BPF arena allocation to replace the shared scratch
page with a newly allocated page. To support architectures like arm64, it
uses ptep_try_set() to do a valid->invalid->valid transition, providing a
new flush_tlb_before_set() hook for the break-before-make sequence. For
architectures without ptep_try_set(), it falls back to checking pte_none()
and setting the PTE directly.
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5739bb99fc15..4c6c4081ef715 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
[ ... ]
> @@ -1061,6 +1061,24 @@ static inline bool ptep_try_set(pte_t *ptep, pte_t new_pte)
> }
> #endif
>
> +#ifndef flush_tlb_before_set
> +/**
> + * flush_tlb_before_set - invalidate a kernel PTE's TLB before re-setting it
> + * @addr: kernel virtual address whose PTE was just cleared
> + *
> + * Some architectures (e.g. arm64) do not allow a live page-table entry to be
> + * repointed at a different page in one step. The old entry must first be made
> + * invalid and its translation flushed from every TLB, and only then may the new
> + * entry be written.
> + *
> + * This is only for the lockless atomic kernel-PTE installers (ptep_try_set()).
> + * It must be callable with interrupts disabled.
> + */
> +static inline void flush_tlb_before_set(unsigned long addr)
> +{
> +}
[Severity: Critical]
Since the generic implementation of flush_tlb_before_set() is an empty
function, does this leave a stale TLB entry on architectures like x86?
When the scratch page is replaced, ptep_get_and_clear() clears the PTE, and
then ptep_try_set() installs the newly allocated page. If the architecture
uses the generic fallback for flush_tlb_before_set(), there is no TLB flush
before the new entry is written, and the allocation loop doesn't seem to
perform a deferred flush either.
Could this result in BPF programs continuing to read and write the shared
scratch page instead of the new memory?
> +#endif
> +
> #ifndef wrprotect_ptes
> /**
> * wrprotect_ptes - Write-protect PTEs that map consecutive pages of the same
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index a641b19255aad..799fb49768e69 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
[ ... ]
> @@ -154,19 +155,44 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> {
> struct apply_range_data *d = data;
> struct page *page;
> + pte_t pteval;
>
> if (!data)
> return 0;
>
> page = d->pages[d->i];
> /* paranoia, similar to vmap_pages_pte_range() */
> if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> return -EINVAL;
>
> - set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> + pteval = mk_pte(page, PAGE_KERNEL);
> +#ifdef ptep_try_set
> + /*
> + * Kernel-fault recovery may have installed the scratch page here, and
> + * some architectures (arm64) prohibit valid->valid PTE transitions.
> + * Install atomically into a none slot. If scratch is present, clear it
> + * and flush_tlb_before_set() (break-before-make) before retrying.
> + */
> + while (!ptep_try_set(pte, pteval)) {
> + pte_t old = ptep_get(pte);
> +
> + if (pte_none(old))
> + continue;
> + if (WARN_ON_ONCE(pte_page(old) != d->scratch_page))
[Severity: High]
Can calling pte_page(old) here without first checking pte_present(old)
cause an invalid pointer dereference?
The function pte_none() only guarantees the PTE is not completely empty, but
it doesn't ensure it is present. If the PTE contains a non-present entry
like a swap entry or a poison marker, pte_page() will interpret the swap
metadata as a physical page frame number.
Additionally, on architectures like x86 with CONFIG_DEBUG_VM enabled,
pte_pfn() asserts pte_present(), which would trigger a kernel panic. Should
this check pte_present(old) before extracting the page?
> + return -EBUSY;
> + ptep_get_and_clear(&init_mm, addr, pte);
> + flush_tlb_before_set(addr);
> + }
[ ... ]
> @@ -475,7 +501,8 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> if (ret)
> goto out_sigsegv_memcg;
>
> - struct apply_range_data data = { .pages = &page, .i = 0 };
> + struct apply_range_data data = { .pages = &page, .i = 0,
> + .scratch_page = arena->scratch_page };
> /* Account into memcg of the process that created bpf_arena */
> ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
[Severity: High]
This is a pre-existing issue, but does calling bpf_map_alloc_pages() here
with interrupts disabled cause lock nesting violations or scheduling while
atomic bugs?
Looking at arena_vm_fault(), raw_res_spin_lock_irqsave() is called earlier,
disabling interrupts. Then bpf_map_alloc_pages() is called, followed by
apply_to_page_range(). The apply_to_page_range() function can dynamically
allocate page tables using GFP_KERNEL (which can sleep) and acquires the
page table lock, which is a sleepable rt_mutex on PREEMPT_RT kernels.
A similar pattern exists in arena_alloc_pages(), where
raw_res_spin_lock_irqsave() is taken before the while loop that calls
apply_to_page_range(). Is it safe to call these sleepable functions inside
an IRQ-disabled critical section?
> if (ret) {
> range_tree_set(&arena->rt, vmf->pgoff, 1);
> goto out_sigsegv_memcg;
> }
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260601183728.1800490-1-tj@kernel.org?part=1
next prev parent reply other threads:[~2026-06-01 18:58 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-01 18:37 [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages Tejun Heo
2026-06-01 18:58 ` sashiko-bot [this message]
2026-06-01 19:19 ` Tejun Heo
[not found] ` <f45f02410d23c99d14f3577da9b27a327816b5a0904bdeefc8e229eac760d1cf@mail.kernel.org>
2026-06-02 22:09 ` Tejun Heo
2026-06-06 16:06 ` Catalin Marinas
2026-06-07 7:59 ` [PATCH bpf-next] arm64: mm: Complete the PTE store in ptep_try_set() Tejun Heo
2026-06-07 8:12 ` sashiko-bot
2026-06-07 8:38 ` bot+bpf-ci
2026-06-07 20:04 ` Tejun Heo
2026-06-07 20:31 ` Catalin Marinas
2026-06-08 7:25 ` [PATCH v2 " Tejun Heo
2026-06-08 8:43 ` Catalin Marinas
2026-06-08 12:00 ` patchwork-bot+netdevbpf
2026-06-02 22:29 ` [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages Kumar Kartikeya Dwivedi
2026-06-05 15:30 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260601185802.B22171F00893@smtp.kernel.org \
--to=sashiko-bot@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=sashiko-reviews@lists.linux.dev \
--cc=sched-ext@lists.linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox