Re: [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: "Tejun Heo" <tj@kernel.org>
Cc: sched-ext@lists.linux.dev, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages
Date: Mon, 01 Jun 2026 18:58:02 +0000	[thread overview]
Message-ID: <20260601185802.B22171F00893@smtp.kernel.org> (raw)
In-Reply-To: <20260601183728.1800490-1-tj@kernel.org>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:

New issues:
- [Critical] Replacing the populated scratch page PTE with a newly allocated page lacks a TLB flush on architectures where `flush_tlb_before_set()` is a no-op, causing silent data corruption.
- [High] Calling `pte_page(old)` without checking `pte_present(old)` can cause an invalid pointer dereference or kernel panic.

Pre-existing issues:
- [High] Calling memory allocation and page table walking functions with interrupts disabled causes lock nesting violations and scheduling while atomic bugs.
--

commit b6f9fe34365955853d165d4a39abcf4bf12f7b70
Author: Tejun Heo <tj@kernel.org>

bpf: Replace scratch PTE atomically when allocating arena pages

This commit modifies the BPF arena allocation to replace the shared scratch
page with a newly allocated page. To support architectures like arm64, it
uses ptep_try_set() to do a valid->invalid->valid transition, providing a
new flush_tlb_before_set() hook for the break-before-make sequence. For
architectures without ptep_try_set(), it falls back to checking pte_none()
and setting the PTE directly.

> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b5739bb99fc15..4c6c4081ef715 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h

[ ... ]

> @@ -1061,6 +1061,24 @@ static inline bool ptep_try_set(pte_t *ptep, pte_t new_pte)
>  }
>  #endif
>  
> +#ifndef flush_tlb_before_set
> +/**
> + * flush_tlb_before_set - invalidate a kernel PTE's TLB before re-setting it
> + * @addr: kernel virtual address whose PTE was just cleared
> + *
> + * Some architectures (e.g. arm64) do not allow a live page-table entry to be
> + * repointed at a different page in one step. The old entry must first be made
> + * invalid and its translation flushed from every TLB, and only then may the new
> + * entry be written.
> + *
> + * This is only for the lockless atomic kernel-PTE installers (ptep_try_set()).
> + * It must be callable with interrupts disabled.
> + */
> +static inline void flush_tlb_before_set(unsigned long addr)
> +{
> +}

[Severity: Critical]
Since the generic implementation of flush_tlb_before_set() is an empty
function, does this leave a stale TLB entry on architectures like x86?

When the scratch page is replaced, ptep_get_and_clear() clears the PTE, and
then ptep_try_set() installs the newly allocated page. If the architecture
uses the generic fallback for flush_tlb_before_set(), there is no TLB flush
before the new entry is written, and the allocation loop doesn't seem to
perform a deferred flush either.

Could this result in BPF programs continuing to read and write the shared
scratch page instead of the new memory?

> +#endif
> +
>  #ifndef wrprotect_ptes
>  /**
>   * wrprotect_ptes - Write-protect PTEs that map consecutive pages of the same
> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index a641b19255aad..799fb49768e69 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -154,19 +155,44 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>  {
>  	struct apply_range_data *d = data;
>  	struct page *page;
> +	pte_t pteval;
>  
>  	if (!data)
>  		return 0;
>  
>  	page = d->pages[d->i];
>  	/* paranoia, similar to vmap_pages_pte_range() */
>  	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
>  		return -EINVAL;
>  
> -	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> +	pteval = mk_pte(page, PAGE_KERNEL);
> +#ifdef ptep_try_set
> +	/*
> +	 * Kernel-fault recovery may have installed the scratch page here, and
> +	 * some architectures (arm64) prohibit valid->valid PTE transitions.
> +	 * Install atomically into a none slot. If scratch is present, clear it
> +	 * and flush_tlb_before_set() (break-before-make) before retrying.
> +	 */
> +	while (!ptep_try_set(pte, pteval)) {
> +		pte_t old = ptep_get(pte);
> +
> +		if (pte_none(old))
> +			continue;
> +		if (WARN_ON_ONCE(pte_page(old) != d->scratch_page))

[Severity: High]
Can calling pte_page(old) here without first checking pte_present(old)
cause an invalid pointer dereference?

The function pte_none() only guarantees the PTE is not completely empty, but
it doesn't ensure it is present. If the PTE contains a non-present entry
like a swap entry or a poison marker, pte_page() will interpret the swap
metadata as a physical page frame number.

Additionally, on architectures like x86 with CONFIG_DEBUG_VM enabled,
pte_pfn() asserts pte_present(), which would trigger a kernel panic. Should
this check pte_present(old) before extracting the page?

> +			return -EBUSY;
> +		ptep_get_and_clear(&init_mm, addr, pte);
> +		flush_tlb_before_set(addr);
> +	}

[ ... ]

> @@ -475,7 +501,8 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>  	if (ret)
>  		goto out_sigsegv_memcg;
>  
> -	struct apply_range_data data = { .pages = &page, .i = 0 };
> +	struct apply_range_data data = { .pages = &page, .i = 0,
> +					 .scratch_page = arena->scratch_page };
>  	/* Account into memcg of the process that created bpf_arena */
>  	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);

[Severity: High]
This is a pre-existing issue, but does calling bpf_map_alloc_pages() here
with interrupts disabled cause lock nesting violations or scheduling while
atomic bugs?

Looking at arena_vm_fault(), raw_res_spin_lock_irqsave() is called earlier,
disabling interrupts. Then bpf_map_alloc_pages() is called, followed by
apply_to_page_range(). The apply_to_page_range() function can dynamically
allocate page tables using GFP_KERNEL (which can sleep) and acquires the
page table lock, which is a sleepable rt_mutex on PREEMPT_RT kernels.

A similar pattern exists in arena_alloc_pages(), where
raw_res_spin_lock_irqsave() is taken before the while loop that calls
apply_to_page_range(). Is it safe to call these sleepable functions inside
an IRQ-disabled critical section?

>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
>  		goto out_sigsegv_memcg;
>  	}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260601183728.1800490-1-tj@kernel.org?part=1

next prev parent reply	other threads:[~2026-06-01 18:58 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-01 18:37 [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages Tejun Heo
2026-06-01 18:58 ` sashiko-bot [this message]
2026-06-01 19:19   ` Tejun Heo
2026-06-01 20:15 ` bot+bpf-ci
2026-06-02 22:09   ` Tejun Heo
2026-06-06 16:06     ` Catalin Marinas
2026-06-02 22:29 ` Kumar Kartikeya Dwivedi
2026-06-05 15:30 ` patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260601185802.B22171F00893@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.