From: Tejun Heo <tj@kernel.org>
To: void@manifault.com, arighi@nvidia.com, changwoo@igalia.com,
ast@kernel.org, andrii@kernel.org, daniel@iogearbox.net,
martin.lau@linux.dev, memxor@gmail.com
Cc: peterz@infradead.org, catalin.marinas@arm.com, will@kernel.org,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
dave.hansen@linux.intel.com, akpm@linux-foundation.org,
david@kernel.org, rppt@kernel.org, emil@etsalapatis.com,
sched-ext@lists.linux.dev, bpf@vger.kernel.org, x86@kernel.org,
linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Tejun Heo <tj@kernel.org>
Subject: [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages
Date: Mon, 1 Jun 2026 08:37:28 -1000 [thread overview]
Message-ID: <20260601183728.1800490-1-tj@kernel.org> (raw)
apply_range_set_cb() maps the pages for a new arena allocation and returned
-EBUSY when the target PTE was already populated. Kernel-fault recovery
leaves the per-arena scratch page in unallocated arena PTEs, so a later
bpf_arena_alloc_pages() over such a page hits that -EBUSY, and every
subsequent allocation of it fails the same way. Allocation must install the
real page over scratch instead.
Overwriting the scratch PTE in place is a valid->valid change, which arm64
forbids without break-before-make. Route through an invalid entry instead:
ptep_try_set() fills only a none slot, so the PTE goes scratch->none->page.
On finding scratch, clear it and flush_tlb_before_set() before retrying. The
new flush_tlb_before_set() is a no-op except on arches like arm64 that need
the break-before-make TLB invalidate. The loop also copes with a concurrent
fault re-scratching the slot.
Arches without ptep_try_set() never install the scratch page, so keep the
must-be-empty check and set_pte_at() for them.
Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page")
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
---
arch/arm64/include/asm/pgtable.h | 11 +++++++++++
include/linux/pgtable.h | 18 ++++++++++++++++++
kernel/bpf/arena.c | 38 +++++++++++++++++++++++++++++++++-----
3 files changed, 62 insertions(+), 5 deletions(-)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 984f050..3ce0f2a 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1842,6 +1842,17 @@ static inline bool ptep_try_set(pte_t *ptep, pte_t new_pte)
}
#define ptep_try_set ptep_try_set
+/*
+ * arm64 mandates break-before-make: a cleared kernel PTE must have its TLB
+ * invalidated before a different page is installed in its place. The broadcast
+ * TLBI is an instruction, not an IPI, so this is safe with interrupts disabled.
+ */
+static inline void flush_tlb_before_set(unsigned long addr)
+{
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+}
+#define flush_tlb_before_set flush_tlb_before_set
+
#define test_and_clear_young_ptes test_and_clear_young_ptes
static inline bool test_and_clear_young_ptes(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep, unsigned int nr)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b5739bb..4c6c408 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1061,6 +1061,24 @@ static inline bool ptep_try_set(pte_t *ptep, pte_t new_pte)
}
#endif
+#ifndef flush_tlb_before_set
+/**
+ * flush_tlb_before_set - invalidate a kernel PTE's TLB before re-setting it
+ * @addr: kernel virtual address whose PTE was just cleared
+ *
+ * Some architectures (e.g. arm64) do not allow a live page-table entry to be
+ * repointed at a different page in one step. The old entry must first be made
+ * invalid and its translation flushed from every TLB, and only then may the new
+ * entry be written.
+ *
+ * This is only for the lockless atomic kernel-PTE installers (ptep_try_set()).
+ * It must be callable with interrupts disabled.
+ */
+static inline void flush_tlb_before_set(unsigned long addr)
+{
+}
+#endif
+
#ifndef wrprotect_ptes
/**
* wrprotect_ptes - Write-protect PTEs that map consecutive pages of the same
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 1727503..b6ac5a9 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -142,6 +142,7 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
struct apply_range_data {
struct page **pages;
+ struct page *scratch_page;
int i;
};
@@ -154,19 +155,44 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
{
struct apply_range_data *d = data;
struct page *page;
+ pte_t pteval;
if (!data)
return 0;
- /* sanity check */
- if (unlikely(!pte_none(ptep_get(pte))))
- return -EBUSY;
page = d->pages[d->i];
/* paranoia, similar to vmap_pages_pte_range() */
if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
return -EINVAL;
- set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+ pteval = mk_pte(page, PAGE_KERNEL);
+#ifdef ptep_try_set
+ /*
+ * Kernel-fault recovery may have installed the scratch page here, and
+ * some architectures (arm64) prohibit valid->valid PTE transitions.
+ * Install atomically into a none slot. If scratch is present, clear it
+ * and flush_tlb_before_set() (break-before-make) before retrying.
+ */
+ while (!ptep_try_set(pte, pteval)) {
+ pte_t old = ptep_get(pte);
+
+ if (pte_none(old))
+ continue;
+ if (WARN_ON_ONCE(pte_page(old) != d->scratch_page))
+ return -EBUSY;
+ ptep_get_and_clear(&init_mm, addr, pte);
+ flush_tlb_before_set(addr);
+ }
+#else
+ /*
+ * Without ptep_try_set() there is no atomic installer, but such arches
+ * also do not wire up bpf_arena_handle_page_fault(), so no scratch page
+ * is ever installed and the slot is always none here.
+ */
+ if (unlikely(!pte_none(ptep_get(pte))))
+ return -EBUSY;
+ set_pte_at(&init_mm, addr, pte, pteval);
+#endif
d->i++;
return 0;
}
@@ -475,7 +501,8 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
if (ret)
goto out_sigsegv_memcg;
- struct apply_range_data data = { .pages = &page, .i = 0 };
+ struct apply_range_data data = { .pages = &page, .i = 0,
+ .scratch_page = arena->scratch_page };
/* Account into memcg of the process that created bpf_arena */
ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
if (ret) {
@@ -665,6 +692,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
return 0;
}
data.pages = pages;
+ data.scratch_page = arena->scratch_page;
if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
goto out_free_pages;
next reply other threads:[~2026-06-01 18:37 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-01 18:37 Tejun Heo [this message]
2026-06-01 18:58 ` [PATCH bpf-next] bpf: Replace scratch PTE atomically when allocating arena pages sashiko-bot
2026-06-01 19:19 ` Tejun Heo
2026-06-01 20:15 ` bot+bpf-ci
2026-06-02 22:09 ` Tejun Heo
2026-06-06 16:06 ` Catalin Marinas
2026-06-02 22:29 ` Kumar Kartikeya Dwivedi
2026-06-05 15:30 ` patchwork-bot+netdevbpf
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260601183728.1800490-1-tj@kernel.org \
--to=tj@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=arighi@nvidia.com \
--cc=ast@kernel.org \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=changwoo@igalia.com \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=david@kernel.org \
--cc=emil@etsalapatis.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=martin.lau@linux.dev \
--cc=memxor@gmail.com \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rppt@kernel.org \
--cc=sched-ext@lists.linux.dev \
--cc=tglx@kernel.org \
--cc=void@manifault.com \
--cc=will@kernel.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.