* [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs
@ 2025-11-14 11:16 Puranjay Mohan
2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
` (3 more replies)
0 siblings, 4 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
- Import tlbflush.h to fix build issue in loongarch. (kernel
test robot)
- Fix unused variable error in apply_range_clear_cb() (kernel
test robot)
- Call bpf_map_area_free() on error path of
populate_pgtable_except_pte() (AI)
- Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
- Cap allocation made by kmalloc_nolock() for pages array to
KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
to overcome this limit. (AI)
Patch 3:
- Do page_ref_add(page, 1); under the spinlock to mitigate a
race (AI)
Patch 4:
- Add a new testcase big_alloc3() verifier_arena_large.c that
tries to allocate a large number of pages at once, this is to
trigger the kmalloc_nolock() limit in Patch 2 and see if the
loop logic works correctly.
This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:
The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.
bpf_arena_alloc_pages() had four points where it could sleep:
1. Mutex to protect range_tree: now replaced with rqspinlock
2. kvcalloc() for allocations: now replaced with kmalloc_nolock()
3. Allocating pages with bpf_map_alloc_pages(): this already calls
alloc_pages_nolock() in non-sleepable contexts and therefore is safe.
4. Setting up kernel page tables with vm_area_map_pages():
vm_area_map_pages() may allocate memory while inserting pages into
bpf arena's vm_area. Now, at arena creation time populate all page
table levels except the last level and when new pages need to be
inserted call apply_to_page_range() again which will only do
set_pte_at() for those pages and will not allocate memory.
The above four changes make bpf_arena_alloc_pages() any context safe.
bpf_arena_free_pages() has to do the following steps:
1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.
The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
table levels, apply_to_existing_page_range() with a callback is used
that only does pte_clear() on the last level and leaves the intermediate
page table levels intact. This is needed to make sure that
bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
intermediate page tables.
When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.
arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.
apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
NOTE: The arena list selftest fails to load on s390x, this is due to an
unrelated bug in the verifier that is being exposed by the selftest that
I add in this set. I have already sent a patch[1] to fix this.
[1] https://lore.kernel.org/all/20251111160949.45623-1-puranjay@kernel.org/
Puranjay Mohan (4):
bpf: arena: populate vm_area without allocating memory
bpf: arena: use kmalloc_nolock() in place of kvcalloc()
bpf: arena: make arena kfuncs any context safe
selftests: bpf: test non-sleepable arena allocations
include/linux/bpf.h | 2 +
kernel/bpf/arena.c | 350 +++++++++++++++---
kernel/bpf/verifier.c | 5 +
.../selftests/bpf/prog_tests/arena_list.c | 20 +-
.../testing/selftests/bpf/progs/arena_list.c | 11 +
.../selftests/bpf/progs/verifier_arena.c | 185 +++++++++
.../bpf/progs/verifier_arena_large.c | 24 ++
7 files changed, 541 insertions(+), 56 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 22+ messages in thread* [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan @ 2025-11-14 11:16 ` Puranjay Mohan 2025-11-14 11:47 ` bot+bpf-ci 2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan ` (2 subsequent siblings) 3 siblings, 1 reply; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- kernel/bpf/arena.c | 76 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 70 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c index 1074ac4459f2..48b8ffba3c88 100644 --- a/kernel/bpf/arena.c +++ b/kernel/bpf/arena.c @@ -7,6 +7,7 @@ #include <linux/btf_ids.h> #include <linux/vmalloc.h> #include <linux/pagemap.h> +#include <asm/tlbflush.h> #include "range_tree.h" /* @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; } +struct apply_range_data { + struct page **pages; + int i; +}; + +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) +{ + struct apply_range_data *d = data; + struct page *page; + + if (!data) + return 0; + /* sanity check */ + if (unlikely(!pte_none(ptep_get(pte)))) + return -EBUSY; + + page = d->pages[d->i++]; + /* paranoia, similar to vmap_pages_pte_range() */ + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) + return -EINVAL; + + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); + return 0; +} + +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) +{ + pte_t old_pte; + struct page *page; + + /* sanity check */ + old_pte = ptep_get(pte); + if (pte_none(old_pte) || !pte_present(old_pte)) + return 0; /* nothing to do */ + + /* get page and free it */ + page = pte_page(old_pte); + if (WARN_ON_ONCE(!page)) + return -EINVAL; + + pte_clear(&init_mm, addr, pte); + + /* ensure no stale TLB entries */ + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + __free_page(page); + + return 0; +} + +static int populate_pgtable_except_pte(struct bpf_arena *arena) +{ + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); +} + static struct bpf_map *arena_map_alloc(union bpf_attr *attr) { struct vm_struct *kern_vm; @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) goto err; } mutex_init(&arena->lock); + err = populate_pgtable_except_pte(arena); + if (err) { + bpf_map_area_free(arena); + goto err; + } return &arena->map; err: @@ -286,6 +348,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) if (ret) return VM_FAULT_SIGSEGV; + struct apply_range_data data = { .pages = &page, .i = 0 }; /* Account into memcg of the process that created bpf_arena */ ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); if (ret) { @@ -293,7 +356,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) return VM_FAULT_SIGSEGV; } - ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page); + ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); __free_page(page); @@ -428,7 +491,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt /* user_vm_end/start are fixed before bpf prog runs */ long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); - struct page **pages; + struct page **pages = NULL; long pgoff = 0; u32 uaddr32; int ret, i; @@ -465,6 +528,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt if (ret) goto out_free_pages; + struct apply_range_data data = { .pages = pages, .i = 0 }; ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); if (ret) goto out; @@ -477,8 +541,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow * lower 32-bit and it's ok. */ - ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32, - kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages); + ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, + page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); if (ret) { for (i = 0; i < page_cnt; i++) __free_page(pages[i]); @@ -545,8 +609,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) * page_cnt is big it's faster to do the batched zap. */ zap_pages(arena, full_uaddr, 1); - vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE); - __free_page(page); + apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb, + NULL); } } -- 2.47.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan @ 2025-11-14 11:47 ` bot+bpf-ci 2025-11-14 14:57 ` Puranjay Mohan 0 siblings, 1 reply; 22+ messages in thread From: bot+bpf-ci @ 2025-11-14 11:47 UTC (permalink / raw) To: puranjay, bpf Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 2793 bytes --] > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 1074ac445..48b8ffba3 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c [ ... ] > @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) > return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; > } > > +struct apply_range_data { > + struct page **pages; > + int i; > +}; > + > +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) > +{ > + struct apply_range_data *d = data; > + struct page *page; > + > + if (!data) > + return 0; > + /* sanity check */ > + if (unlikely(!pte_none(ptep_get(pte)))) > + return -EBUSY; > + > + page = d->pages[d->i++]; > + /* paranoia, similar to vmap_pages_pte_range() */ > + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) > + return -EINVAL; > + > + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); > + return 0; > +} > + > +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) > +{ > + pte_t old_pte; > + struct page *page; > + > + /* sanity check */ > + old_pte = ptep_get(pte); > + if (pte_none(old_pte) || !pte_present(old_pte)) > + return 0; /* nothing to do */ > + > + /* get page and free it */ > + page = pte_page(old_pte); > + if (WARN_ON_ONCE(!page)) > + return -EINVAL; > + > + pte_clear(&init_mm, addr, pte); > + > + /* ensure no stale TLB entries */ > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > + > + __free_page(page); > + > + return 0; > +} > + > +static int populate_pgtable_except_pte(struct bpf_arena *arena) > +{ > + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), > + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); > +} > + > static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > { > struct vm_struct *kern_vm; [ ... ] > @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > goto err; > } > mutex_init(&arena->lock); > + err = populate_pgtable_except_pte(arena); > + if (err) { > + bpf_map_area_free(arena); > + goto err; > + } ^^^^ Can this leak the range tree? Looking at the code, range_tree_set() was called earlier in this function and can allocate range_node structures via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte() fails here, the error path calls bpf_map_area_free(arena) but doesn't call range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which always calls range_tree_destroy() before freeing the arena. > > return &arena->map; > err: [ ... ] --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-14 11:47 ` bot+bpf-ci @ 2025-11-14 14:57 ` Puranjay Mohan 2025-11-14 21:21 ` Alexei Starovoitov 0 siblings, 1 reply; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 14:57 UTC (permalink / raw) To: bot+bpf-ci, bpf Cc: ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai bot+bpf-ci@kernel.org writes: >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c >> index 1074ac445..48b8ffba3 100644 >> --- a/kernel/bpf/arena.c >> +++ b/kernel/bpf/arena.c > > [ ... ] > >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) >> return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; >> } >> >> +struct apply_range_data { >> + struct page **pages; >> + int i; >> +}; >> + >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) >> +{ >> + struct apply_range_data *d = data; >> + struct page *page; >> + >> + if (!data) >> + return 0; >> + /* sanity check */ >> + if (unlikely(!pte_none(ptep_get(pte)))) >> + return -EBUSY; >> + >> + page = d->pages[d->i++]; >> + /* paranoia, similar to vmap_pages_pte_range() */ >> + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) >> + return -EINVAL; >> + >> + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); >> + return 0; >> +} >> + >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) >> +{ >> + pte_t old_pte; >> + struct page *page; >> + >> + /* sanity check */ >> + old_pte = ptep_get(pte); >> + if (pte_none(old_pte) || !pte_present(old_pte)) >> + return 0; /* nothing to do */ >> + >> + /* get page and free it */ >> + page = pte_page(old_pte); >> + if (WARN_ON_ONCE(!page)) >> + return -EINVAL; >> + >> + pte_clear(&init_mm, addr, pte); >> + >> + /* ensure no stale TLB entries */ >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >> + >> + __free_page(page); >> + >> + return 0; >> +} >> + >> +static int populate_pgtable_except_pte(struct bpf_arena *arena) >> +{ >> + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), >> + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); >> +} >> + >> static struct bpf_map *arena_map_alloc(union bpf_attr *attr) >> { >> struct vm_struct *kern_vm; > > [ ... ] > >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) >> goto err; >> } >> mutex_init(&arena->lock); >> + err = populate_pgtable_except_pte(arena); >> + if (err) { >> + bpf_map_area_free(arena); >> + goto err; >> + } > ^^^^ > > Can this leak the range tree? Looking at the code, range_tree_set() was > called earlier in this function and can allocate range_node structures > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte() > fails here, the error path calls bpf_map_area_free(arena) but doesn't call > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which > always calls range_tree_destroy() before freeing the arena. As the range tree is empty at this point, we can be sure that range_tree_clear() in range_tree_set() will not allocate anything. >> >> return &arena->map; >> err: > > [ ... ] > > > --- > AI reviewed your patch. Please fix the bug or email reply why it's not a bug. > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-14 14:57 ` Puranjay Mohan @ 2025-11-14 21:21 ` Alexei Starovoitov 2025-11-15 0:52 ` Puranjay Mohan 0 siblings, 1 reply; 22+ messages in thread From: Alexei Starovoitov @ 2025-11-14 21:21 UTC (permalink / raw) To: Puranjay Mohan Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard, Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason, Ihor Solodrai On Fri, Nov 14, 2025 at 6:57 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > bot+bpf-ci@kernel.org writes: > > >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > >> index 1074ac445..48b8ffba3 100644 > >> --- a/kernel/bpf/arena.c > >> +++ b/kernel/bpf/arena.c > > > > [ ... ] > > > >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) > >> return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; > >> } > >> > >> +struct apply_range_data { > >> + struct page **pages; > >> + int i; > >> +}; > >> + > >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) > >> +{ > >> + struct apply_range_data *d = data; > >> + struct page *page; > >> + > >> + if (!data) > >> + return 0; > >> + /* sanity check */ > >> + if (unlikely(!pte_none(ptep_get(pte)))) > >> + return -EBUSY; > >> + > >> + page = d->pages[d->i++]; > >> + /* paranoia, similar to vmap_pages_pte_range() */ > >> + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) > >> + return -EINVAL; > >> + > >> + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); > >> + return 0; > >> +} > >> + > >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) > >> +{ > >> + pte_t old_pte; > >> + struct page *page; > >> + > >> + /* sanity check */ > >> + old_pte = ptep_get(pte); > >> + if (pte_none(old_pte) || !pte_present(old_pte)) > >> + return 0; /* nothing to do */ > >> + > >> + /* get page and free it */ > >> + page = pte_page(old_pte); > >> + if (WARN_ON_ONCE(!page)) > >> + return -EINVAL; > >> + > >> + pte_clear(&init_mm, addr, pte); > >> + > >> + /* ensure no stale TLB entries */ > >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >> + > >> + __free_page(page); > >> + > >> + return 0; > >> +} > >> + > >> +static int populate_pgtable_except_pte(struct bpf_arena *arena) > >> +{ > >> + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), > >> + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); > >> +} > >> + > >> static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > >> { > >> struct vm_struct *kern_vm; > > > > [ ... ] > > > >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > >> goto err; > >> } > >> mutex_init(&arena->lock); > >> + err = populate_pgtable_except_pte(arena); > >> + if (err) { > >> + bpf_map_area_free(arena); > >> + goto err; > >> + } > > ^^^^ > > > > Can this leak the range tree? Looking at the code, range_tree_set() was > > called earlier in this function and can allocate range_node structures > > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte() > > fails here, the error path calls bpf_map_area_free(arena) but doesn't call > > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which > > always calls range_tree_destroy() before freeing the arena. > > As the range tree is empty at this point, we can be sure that > range_tree_clear() in range_tree_set() will not allocate anything. range_tree_clear() won't clear anything, but AI pointed in the right direction. Look at what range_tree_set() does. It will allocate for sure. pw-bot: cr ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-14 21:21 ` Alexei Starovoitov @ 2025-11-15 0:52 ` Puranjay Mohan 2025-11-15 1:26 ` Alexei Starovoitov 0 siblings, 1 reply; 22+ messages in thread From: Puranjay Mohan @ 2025-11-15 0:52 UTC (permalink / raw) To: Alexei Starovoitov Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard, Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason, Ihor Solodrai Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Fri, Nov 14, 2025 at 6:57 AM Puranjay Mohan <puranjay@kernel.org> wrote: >> >> bot+bpf-ci@kernel.org writes: >> >> >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c >> >> index 1074ac445..48b8ffba3 100644 >> >> --- a/kernel/bpf/arena.c >> >> +++ b/kernel/bpf/arena.c >> > >> > [ ... ] >> > >> >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) >> >> return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; >> >> } >> >> >> >> +struct apply_range_data { >> >> + struct page **pages; >> >> + int i; >> >> +}; >> >> + >> >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) >> >> +{ >> >> + struct apply_range_data *d = data; >> >> + struct page *page; >> >> + >> >> + if (!data) >> >> + return 0; >> >> + /* sanity check */ >> >> + if (unlikely(!pte_none(ptep_get(pte)))) >> >> + return -EBUSY; >> >> + >> >> + page = d->pages[d->i++]; >> >> + /* paranoia, similar to vmap_pages_pte_range() */ >> >> + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) >> >> + return -EINVAL; >> >> + >> >> + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); >> >> + return 0; >> >> +} >> >> + >> >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) >> >> +{ >> >> + pte_t old_pte; >> >> + struct page *page; >> >> + >> >> + /* sanity check */ >> >> + old_pte = ptep_get(pte); >> >> + if (pte_none(old_pte) || !pte_present(old_pte)) >> >> + return 0; /* nothing to do */ >> >> + >> >> + /* get page and free it */ >> >> + page = pte_page(old_pte); >> >> + if (WARN_ON_ONCE(!page)) >> >> + return -EINVAL; >> >> + >> >> + pte_clear(&init_mm, addr, pte); >> >> + >> >> + /* ensure no stale TLB entries */ >> >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >> >> + >> >> + __free_page(page); >> >> + >> >> + return 0; >> >> +} >> >> + >> >> +static int populate_pgtable_except_pte(struct bpf_arena *arena) >> >> +{ >> >> + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), >> >> + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); >> >> +} >> >> + >> >> static struct bpf_map *arena_map_alloc(union bpf_attr *attr) >> >> { >> >> struct vm_struct *kern_vm; >> > >> > [ ... ] >> > >> >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) >> >> goto err; >> >> } >> >> mutex_init(&arena->lock); >> >> + err = populate_pgtable_except_pte(arena); >> >> + if (err) { >> >> + bpf_map_area_free(arena); >> >> + goto err; >> >> + } >> > ^^^^ >> > >> > Can this leak the range tree? Looking at the code, range_tree_set() was >> > called earlier in this function and can allocate range_node structures >> > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte() >> > fails here, the error path calls bpf_map_area_free(arena) but doesn't call >> > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which >> > always calls range_tree_destroy() before freeing the arena. >> >> As the range tree is empty at this point, we can be sure that >> range_tree_clear() in range_tree_set() will not allocate anything. > > range_tree_clear() won't clear anything, but AI pointed in > the right direction. > Look at what range_tree_set() does. It will allocate for sure. If I am understanding it correctly, range_tree_set() allocates memory using kmalloc_nolock() and it fails when this allocation fails, so in the error path we don't need to do anything as no allocation was successful. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-15 0:52 ` Puranjay Mohan @ 2025-11-15 1:26 ` Alexei Starovoitov 0 siblings, 0 replies; 22+ messages in thread From: Alexei Starovoitov @ 2025-11-15 1:26 UTC (permalink / raw) To: Puranjay Mohan Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard, Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason, Ihor Solodrai On Fri, Nov 14, 2025 at 4:52 PM Puranjay Mohan <puranjay@kernel.org> wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > > On Fri, Nov 14, 2025 at 6:57 AM Puranjay Mohan <puranjay@kernel.org> wrote: > >> > >> bot+bpf-ci@kernel.org writes: > >> > >> >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > >> >> index 1074ac445..48b8ffba3 100644 > >> >> --- a/kernel/bpf/arena.c > >> >> +++ b/kernel/bpf/arena.c > >> > > >> > [ ... ] > >> > > >> >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) > >> >> return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; > >> >> } > >> >> > >> >> +struct apply_range_data { > >> >> + struct page **pages; > >> >> + int i; > >> >> +}; > >> >> + > >> >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) > >> >> +{ > >> >> + struct apply_range_data *d = data; > >> >> + struct page *page; > >> >> + > >> >> + if (!data) > >> >> + return 0; > >> >> + /* sanity check */ > >> >> + if (unlikely(!pte_none(ptep_get(pte)))) > >> >> + return -EBUSY; > >> >> + > >> >> + page = d->pages[d->i++]; > >> >> + /* paranoia, similar to vmap_pages_pte_range() */ > >> >> + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) > >> >> + return -EINVAL; > >> >> + > >> >> + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); > >> >> + return 0; > >> >> +} > >> >> + > >> >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) > >> >> +{ > >> >> + pte_t old_pte; > >> >> + struct page *page; > >> >> + > >> >> + /* sanity check */ > >> >> + old_pte = ptep_get(pte); > >> >> + if (pte_none(old_pte) || !pte_present(old_pte)) > >> >> + return 0; /* nothing to do */ > >> >> + > >> >> + /* get page and free it */ > >> >> + page = pte_page(old_pte); > >> >> + if (WARN_ON_ONCE(!page)) > >> >> + return -EINVAL; > >> >> + > >> >> + pte_clear(&init_mm, addr, pte); > >> >> + > >> >> + /* ensure no stale TLB entries */ > >> >> + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > >> >> + > >> >> + __free_page(page); > >> >> + > >> >> + return 0; > >> >> +} > >> >> + > >> >> +static int populate_pgtable_except_pte(struct bpf_arena *arena) > >> >> +{ > >> >> + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), > >> >> + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); > >> >> +} > >> >> + > >> >> static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > >> >> { > >> >> struct vm_struct *kern_vm; > >> > > >> > [ ... ] > >> > > >> >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > >> >> goto err; > >> >> } > >> >> mutex_init(&arena->lock); > >> >> + err = populate_pgtable_except_pte(arena); > >> >> + if (err) { > >> >> + bpf_map_area_free(arena); > >> >> + goto err; > >> >> + } > >> > ^^^^ > >> > > >> > Can this leak the range tree? Looking at the code, range_tree_set() was > >> > called earlier in this function and can allocate range_node structures > >> > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte() > >> > fails here, the error path calls bpf_map_area_free(arena) but doesn't call > >> > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which > >> > always calls range_tree_destroy() before freeing the arena. > >> > >> As the range tree is empty at this point, we can be sure that > >> range_tree_clear() in range_tree_set() will not allocate anything. > > > > range_tree_clear() won't clear anything, but AI pointed in > > the right direction. > > Look at what range_tree_set() does. It will allocate for sure. > > If I am understanding it correctly, range_tree_set() allocates memory > using kmalloc_nolock() and it fails when this allocation fails, so in > the error path we don't need to do anything as no allocation was successful. Not following. Why would kmalloc_nolock() inside range tree fail? range_tree_set() will allocate memory and above hunk after failed populate_pgtable_except_pte() will leak it. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan 2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan @ 2025-11-14 11:16 ` Puranjay Mohan 2025-11-14 11:39 ` bot+bpf-ci 2025-11-14 21:25 ` Alexei Starovoitov 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan 2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan 3 siblings, 2 replies; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- kernel/bpf/arena.c | 76 +++++++++++++++++++++++++++++++--------------- 1 file changed, 52 insertions(+), 24 deletions(-) diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c index 48b8ffba3c88..7fa6e40ab3fc 100644 --- a/kernel/bpf/arena.c +++ b/kernel/bpf/arena.c @@ -43,6 +43,8 @@ #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) #define KERN_VM_SZ (SZ_4G + GUARD_SZ) +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); + struct bpf_arena { struct bpf_map map; u64 user_vm_start; @@ -491,7 +493,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt /* user_vm_end/start are fixed before bpf prog runs */ long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); + struct apply_range_data data; struct page **pages = NULL; + long remaining, mapped = 0; + long alloc_pages; long pgoff = 0; u32 uaddr32; int ret, i; @@ -508,12 +513,16 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt return 0; } - /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ - pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); + /* + * Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. + */ + alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *)); + pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE); if (!pages) return 0; + data.pages = pages; - guard(mutex)(&arena->lock); + mutex_lock(&arena->lock); if (uaddr) { ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); @@ -528,32 +537,51 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt if (ret) goto out_free_pages; - struct apply_range_data data = { .pages = pages, .i = 0 }; - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); - if (ret) - goto out; - + remaining = page_cnt; uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 - * will not overflow 32-bit. Lower 32-bit need to represent - * contiguous user address range. - * Map these pages at kern_vm_start base. - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow - * lower 32-bit and it's ok. - */ - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); - if (ret) { - for (i = 0; i < page_cnt; i++) - __free_page(pages[i]); - goto out; + + while(remaining) { + long this_batch = min(remaining, alloc_pages); + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ + memset(pages, 0, this_batch * sizeof(struct page *)); + data.i = 0; + + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); + if (ret) + goto out; + + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 + * will not overflow 32-bit. Lower 32-bit need to represent + * contiguous user address range. + * Map these pages at kern_vm_start base. + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow + * lower 32-bit and it's ok. + */ + ret = apply_to_page_range(&init_mm, + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); + if (ret) { + /* data.i pages were mapped, account them and free the remaining */ + mapped += data.i; + for (i = data.i; i < this_batch; i++) + __free_page(pages[i]); + goto out; + } + + mapped += this_batch; + remaining -= this_batch; } - kvfree(pages); + mutex_unlock(&arena->lock); + kfree_nolock(pages); return clear_lo32(arena->user_vm_start) + uaddr32; out: - range_tree_set(&arena->rt, pgoff, page_cnt); + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); + mutex_unlock(&arena->lock); + if (mapped) + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); out_free_pages: - kvfree(pages); + mutex_unlock(&arena->lock); + kfree_nolock(pages); return 0; } -- 2.47.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan @ 2025-11-14 11:39 ` bot+bpf-ci 2025-11-14 15:13 ` Puranjay Mohan 2025-11-14 21:25 ` Alexei Starovoitov 1 sibling, 1 reply; 22+ messages in thread From: bot+bpf-ci @ 2025-11-14 11:39 UTC (permalink / raw) To: puranjay, bpf Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 3480 bytes --] > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 48b8ffba3..7fa6e40ab 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c [ ... ] > @@ -528,32 +537,51 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > if (ret) > goto out_free_pages; > > - struct apply_range_data data = { .pages = pages, .i = 0 }; > - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); > - if (ret) > - goto out; > - > + remaining = page_cnt; > uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); > - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > - * will not overflow 32-bit. Lower 32-bit need to represent > - * contiguous user address range. > - * Map these pages at kern_vm_start base. > - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > - * lower 32-bit and it's ok. > - */ > - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, > - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); > - if (ret) { > - for (i = 0; i < page_cnt; i++) > - __free_page(pages[i]); > - goto out; > + > + while(remaining) { > + long this_batch = min(remaining, alloc_pages); > + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > + memset(pages, 0, this_batch * sizeof(struct page *)); > + data.i = 0; > + > + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); > + if (ret) > + goto out; > + > + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > + * will not overflow 32-bit. Lower 32-bit need to represent > + * contiguous user address range. > + * Map these pages at kern_vm_start base. > + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > + * lower 32-bit and it's ok. > + */ > + ret = apply_to_page_range(&init_mm, > + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), > + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); > + if (ret) { > + /* data.i pages were mapped, account them and free the remaining */ > + mapped += data.i; > + for (i = data.i; i < this_batch; i++) > + __free_page(pages[i]); > + goto out; > + } > + > + mapped += this_batch; > + remaining -= this_batch; > } > - kvfree(pages); > + mutex_unlock(&arena->lock); > + kfree_nolock(pages); > return clear_lo32(arena->user_vm_start) + uaddr32; > out: > - range_tree_set(&arena->rt, pgoff, page_cnt); > + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); > + mutex_unlock(&arena->lock); > + if (mapped) > + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); ^^^^ Can arena_free_pages() be called here while holding arena->lock? arena_free_pages() at line 618 acquires arena->lock using guard(mutex), which will deadlock since the lock was already acquired at line 525 in arena_alloc_pages(). > +out_free_pages: > + mutex_unlock(&arena->lock); ^^^^ Does this code unlock arena->lock twice in the error path? When the "goto out" executes from the while loop, the mutex is unlocked at the "out:" label above, then falls through to "out_free_pages:" and unlocks again here. > + kfree_nolock(pages); > return 0; > } --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-14 11:39 ` bot+bpf-ci @ 2025-11-14 15:13 ` Puranjay Mohan 0 siblings, 0 replies; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 15:13 UTC (permalink / raw) To: bot+bpf-ci, bpf Cc: ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai bot+bpf-ci@kernel.org writes: >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c >> index 48b8ffba3..7fa6e40ab 100644 >> --- a/kernel/bpf/arena.c >> +++ b/kernel/bpf/arena.c > > [ ... ] > >> @@ -528,32 +537,51 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt >> if (ret) >> goto out_free_pages; >> >> - struct apply_range_data data = { .pages = pages, .i = 0 }; >> - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); >> - if (ret) >> - goto out; >> - >> + remaining = page_cnt; >> uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); >> - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 >> - * will not overflow 32-bit. Lower 32-bit need to represent >> - * contiguous user address range. >> - * Map these pages at kern_vm_start base. >> - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow >> - * lower 32-bit and it's ok. >> - */ >> - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, >> - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); >> - if (ret) { >> - for (i = 0; i < page_cnt; i++) >> - __free_page(pages[i]); >> - goto out; >> + >> + while(remaining) { >> + long this_batch = min(remaining, alloc_pages); >> + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ >> + memset(pages, 0, this_batch * sizeof(struct page *)); >> + data.i = 0; >> + >> + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); >> + if (ret) >> + goto out; >> + >> + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 >> + * will not overflow 32-bit. Lower 32-bit need to represent >> + * contiguous user address range. >> + * Map these pages at kern_vm_start base. >> + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow >> + * lower 32-bit and it's ok. >> + */ >> + ret = apply_to_page_range(&init_mm, >> + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), >> + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); >> + if (ret) { >> + /* data.i pages were mapped, account them and free the remaining */ >> + mapped += data.i; >> + for (i = data.i; i < this_batch; i++) >> + __free_page(pages[i]); >> + goto out; >> + } >> + >> + mapped += this_batch; >> + remaining -= this_batch; >> } >> - kvfree(pages); >> + mutex_unlock(&arena->lock); >> + kfree_nolock(pages); >> return clear_lo32(arena->user_vm_start) + uaddr32; >> out: >> - range_tree_set(&arena->rt, pgoff, page_cnt); >> + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); >> + mutex_unlock(&arena->lock); >> + if (mapped) >> + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); > ^^^^ > > Can arena_free_pages() be called here while holding arena->lock? > arena_free_pages() at line 618 acquires arena->lock using guard(mutex), > which will deadlock since the lock was already acquired at line 525 in > arena_alloc_pages(). arena_free_pages() will not be called with arena->lock taken, there is mutex_unlock() before calling arena_free_pages() >> +out_free_pages: >> + mutex_unlock(&arena->lock); > ^^^^ > > Does this code unlock arena->lock twice in the error path? When the > "goto out" executes from the while loop, the mutex is unlocked at the > "out:" label above, then falls through to "out_free_pages:" and unlocks > again here. This is fixed by the next patch by adding another label, but I missed it here. Will fix it in the next version. >> + kfree_nolock(pages); >> return 0; >> } > > > --- > AI reviewed your patch. Please fix the bug or email reply why it's not a bug. > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan 2025-11-14 11:39 ` bot+bpf-ci @ 2025-11-14 21:25 ` Alexei Starovoitov 1 sibling, 0 replies; 22+ messages in thread From: Alexei Starovoitov @ 2025-11-14 21:25 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote: > + > + while(remaining) { > + long this_batch = min(remaining, alloc_pages); > + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > + memset(pages, 0, this_batch * sizeof(struct page *)); run checkpatch pls. Above needs extra space after while and empty line after 'long this_batch'. ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan 2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan 2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan @ 2025-11-14 11:16 ` Puranjay Mohan 2025-11-14 11:47 ` bot+bpf-ci ` (3 more replies) 2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan 3 siblings, 4 replies; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- include/linux/bpf.h | 2 + kernel/bpf/arena.c | 236 +++++++++++++++++++++++++++++++++++------- kernel/bpf/verifier.c | 5 + 3 files changed, 203 insertions(+), 40 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 09d5dc541d1c..5279212694b4 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -673,6 +673,8 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj); int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags, struct bpf_dynptr *ptr__uninit); +void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt); + extern const struct bpf_map_ops bpf_map_offload_ops; /* bpf_type_flag contains a set of flags that are applicable to the values of diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c index 7fa6e40ab3fc..ca443c113a1b 100644 --- a/kernel/bpf/arena.c +++ b/kernel/bpf/arena.c @@ -3,7 +3,9 @@ #include <linux/bpf.h> #include <linux/btf.h> #include <linux/err.h> +#include <linux/irq_work.h> #include "linux/filter.h" +#include <linux/llist.h> #include <linux/btf_ids.h> #include <linux/vmalloc.h> #include <linux/pagemap.h> @@ -43,7 +45,7 @@ #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) #define KERN_VM_SZ (SZ_4G + GUARD_SZ) -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable); struct bpf_arena { struct bpf_map map; @@ -51,8 +53,23 @@ struct bpf_arena { u64 user_vm_end; struct vm_struct *kern_vm; struct range_tree rt; + /* protects rt */ + rqspinlock_t spinlock; struct list_head vma_list; + /* protects vma_list */ struct mutex lock; + struct irq_work free_irq; + struct work_struct free_work; + struct llist_head free_spans; +}; + +static void arena_free_worker(struct work_struct *work); +static void arena_free_irq(struct irq_work *iw); + +struct arena_free_span { + struct llist_node node; + unsigned long uaddr; + u32 page_cnt; }; u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena) @@ -120,7 +137,7 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) return 0; } -static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages) { pte_t old_pte; struct page *page; @@ -130,17 +147,16 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) if (pte_none(old_pte) || !pte_present(old_pte)) return 0; /* nothing to do */ - /* get page and free it */ + /* get page and clear pte */ page = pte_page(old_pte); if (WARN_ON_ONCE(!page)) return -EINVAL; pte_clear(&init_mm, addr, pte); - /* ensure no stale TLB entries */ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); - - __free_page(page); + /* Add page to the list so it is freed later */ + if (free_pages) + __llist_add(&page->pcp_llist, free_pages); return 0; } @@ -195,6 +211,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) arena->user_vm_end = arena->user_vm_start + vm_range; INIT_LIST_HEAD(&arena->vma_list); + init_llist_head(&arena->free_spans); + init_irq_work(&arena->free_irq, arena_free_irq); + INIT_WORK(&arena->free_work, arena_free_worker); bpf_map_init_from_attr(&arena->map, attr); range_tree_init(&arena->rt); err = range_tree_set(&arena->rt, 0, attr->max_entries); @@ -203,6 +222,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) goto err; } mutex_init(&arena->lock); + raw_res_spin_lock_init(&arena->spinlock); err = populate_pgtable_except_pte(arena); if (err) { bpf_map_area_free(arena); @@ -248,6 +268,10 @@ static void arena_map_free(struct bpf_map *map) if (WARN_ON_ONCE(!list_empty(&arena->vma_list))) return; + /* Ensure no pending deferred frees */ + irq_work_sync(&arena->free_irq); + flush_work(&arena->free_work); + /* * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area(). * It unmaps everything from vmalloc area and clears pgtables. @@ -331,12 +355,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) struct bpf_arena *arena = container_of(map, struct bpf_arena, map); struct page *page; long kbase, kaddr; + unsigned long flags; int ret; kbase = bpf_arena_get_kern_vm_start(arena); kaddr = kbase + (u32)(vmf->address); - guard(mutex)(&arena->lock); + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) + /* + * This is an impossible case and would only trigger if res_spin_lock is buggy or + * due to another kernel bug. + */ + return VM_FAULT_RETRY; + page = vmalloc_to_page((void *)kaddr); if (page) /* already have a page vmap-ed */ @@ -348,26 +379,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); if (ret) - return VM_FAULT_SIGSEGV; + goto out_unlock_sigsegv; struct apply_range_data data = { .pages = &page, .i = 0 }; /* Account into memcg of the process that created bpf_arena */ ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); - return VM_FAULT_SIGSEGV; + goto out_unlock_sigsegv; } ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); - __free_page(page); - return VM_FAULT_SIGSEGV; + free_pages_nolock(page, 0); + goto out_unlock_sigsegv; } out: page_ref_add(page, 1); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); vmf->page = page; return 0; +out_unlock_sigsegv: + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + return VM_FAULT_SIGSEGV; } static const struct vm_operations_struct arena_vm_ops = { @@ -497,6 +532,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt struct page **pages = NULL; long remaining, mapped = 0; long alloc_pages; + unsigned long flags; long pgoff = 0; u32 uaddr32; int ret, i; @@ -522,12 +558,13 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt return 0; data.pages = pages; - mutex_lock(&arena->lock); + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) + goto out_free_pages; if (uaddr) { ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); if (ret) - goto out_free_pages; + goto out_unlock_free_pages; ret = range_tree_clear(&arena->rt, pgoff, page_cnt); } else { ret = pgoff = range_tree_find(&arena->rt, page_cnt); @@ -535,7 +572,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt ret = range_tree_clear(&arena->rt, pgoff, page_cnt); } if (ret) - goto out_free_pages; + goto out_unlock_free_pages; remaining = page_cnt; uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); @@ -564,23 +601,25 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt /* data.i pages were mapped, account them and free the remaining */ mapped += data.i; for (i = data.i; i < this_batch; i++) - __free_page(pages[i]); + free_pages_nolock(pages[i], 0); goto out; } mapped += this_batch; remaining -= this_batch; } - mutex_unlock(&arena->lock); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); kfree_nolock(pages); return clear_lo32(arena->user_vm_start) + uaddr32; out: range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); - mutex_unlock(&arena->lock); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); if (mapped) - arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped, false); + goto out_free_pages; +out_unlock_free_pages: + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); out_free_pages: - mutex_unlock(&arena->lock); kfree_nolock(pages); return 0; } @@ -594,42 +633,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) { struct vma_list *vml; + guard(mutex)(&arena->lock); + /* iterate link list under lock */ list_for_each_entry(vml, &arena->vma_list, head) zap_page_range_single(vml->vma, uaddr, PAGE_SIZE * page_cnt, NULL); } -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable) { u64 full_uaddr, uaddr_end; - long kaddr, pgoff, i; + long kaddr, pgoff; struct page *page; + struct llist_head free_pages; + struct llist_node *pos, *t; + struct arena_free_span *s; + unsigned long flags; + int ret = 0; /* only aligned lower 32-bit are relevant */ uaddr = (u32)uaddr; uaddr &= PAGE_MASK; + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); if (full_uaddr >= uaddr_end) return; page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; + pgoff = compute_pgoff(arena, uaddr); - guard(mutex)(&arena->lock); + if (!sleepable) + goto defer; + + ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags); + /* + * Can't proceed without holding the spinlock so defer the free + */ + if (ret) + goto defer; - pgoff = compute_pgoff(arena, uaddr); - /* clear range */ range_tree_set(&arena->rt, pgoff, page_cnt); + init_llist_head(&free_pages); + /* clear ptes and collect struct pages */ + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, + apply_range_clear_cb, &free_pages); + + /* drop the lock to do the tlb flush and zap pages */ + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + + /* ensure no stale TLB entries */ + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); + if (page_cnt > 1) /* bulk zap if multiple pages being freed */ zap_pages(arena, full_uaddr, page_cnt); - kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; - for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { - page = vmalloc_to_page((void *)kaddr); - if (!page) - continue; + llist_for_each_safe(pos, t, llist_del_all(&free_pages)) { + page = llist_entry(pos, struct page, pcp_llist); if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */ /* Optimization for the common case of page_cnt==1: * If page wasn't mapped into some user vma there @@ -637,9 +699,20 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) * page_cnt is big it's faster to do the batched zap. */ zap_pages(arena, full_uaddr, 1); - apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb, - NULL); + __free_page(page); } + + return; + +defer: + s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1); + if (!s) + return; + + s->page_cnt = page_cnt; + s->uaddr = uaddr; + llist_add(&s->node, &arena->free_spans); + irq_work_queue(&arena->free_irq); } /* @@ -649,6 +722,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt) { long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; + unsigned long flags; long pgoff; int ret; @@ -659,15 +733,87 @@ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt if (pgoff + page_cnt > page_cnt_max) return -EINVAL; - guard(mutex)(&arena->lock); + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) + return -EBUSY; /* Cannot guard already allocated pages. */ ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); - if (ret) - return -EBUSY; + if (ret) { + ret = -EBUSY; + goto out; + } /* "Allocate" the region to prevent it from being allocated. */ - return range_tree_clear(&arena->rt, pgoff, page_cnt); + ret = range_tree_clear(&arena->rt, pgoff, page_cnt); +out: + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + return ret; +} + +static void arena_free_worker(struct work_struct *work) +{ + struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work); + struct llist_node *list, *pos, *t; + struct arena_free_span *s; + u64 arena_vm_start, user_vm_start; + struct llist_head free_pages; + struct page *page; + unsigned long full_uaddr; + long kaddr, page_cnt, pgoff; + unsigned long flags; + + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) { + schedule_work(work); + return; + } + + init_llist_head(&free_pages); + arena_vm_start = bpf_arena_get_kern_vm_start(arena); + user_vm_start = bpf_arena_get_user_vm_start(arena); + + list = llist_del_all(&arena->free_spans); + llist_for_each(pos, list) { + s = llist_entry(pos, struct arena_free_span, node); + page_cnt = s->page_cnt; + kaddr = arena_vm_start + s->uaddr; + pgoff = compute_pgoff(arena, s->uaddr); + + /* clear ptes and collect pages in free_pages llist */ + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, + apply_range_clear_cb, &free_pages); + + range_tree_set(&arena->rt, pgoff, page_cnt); + } + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + + /* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */ + llist_for_each_safe(pos, t, list) { + s = llist_entry(pos, struct arena_free_span, node); + page_cnt = s->page_cnt; + full_uaddr = user_vm_start + s->uaddr; + kaddr = arena_vm_start + s->uaddr; + + /* ensure no stale TLB entries */ + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); + + /* remove pages from user vmas */ + zap_pages(arena, full_uaddr, page_cnt); + + kfree_nolock(s); + } + + /* free all pages collected by apply_to_existing_page_range() in the first loop */ + llist_for_each_safe(pos, t, llist_del_all(&free_pages)) { + page = llist_entry(pos, struct page, pcp_llist); + __free_page(page); + } +} + +static void arena_free_irq(struct irq_work *iw) +{ + struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq); + + schedule_work(&arena->free_work); } __bpf_kfunc_start_defs(); @@ -691,7 +837,17 @@ __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) return; - arena_free_pages(arena, (long)ptr__ign, page_cnt); + arena_free_pages(arena, (long)ptr__ign, page_cnt, true); +} + +void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt) +{ + struct bpf_map *map = p__map; + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); + + if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) + return; + arena_free_pages(arena, (long)ptr__ign, page_cnt, false); } __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt) @@ -710,9 +866,9 @@ __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_c __bpf_kfunc_end_defs(); BTF_KFUNCS_START(arena_kfuncs) -BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2) -BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) -BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) +BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2) +BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2) +BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2) BTF_KFUNCS_END(arena_kfuncs) static const struct btf_kfunc_id_set common_kfunc_set = { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 1268fa075d4c..407f75daa1cb 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12319,6 +12319,7 @@ enum special_kfunc_type { KF___bpf_trap, KF_bpf_task_work_schedule_signal, KF_bpf_task_work_schedule_resume, + KF_bpf_arena_free_pages, }; BTF_ID_LIST(special_kfunc_list) @@ -12393,6 +12394,7 @@ BTF_ID(func, bpf_dynptr_file_discard) BTF_ID(func, __bpf_trap) BTF_ID(func, bpf_task_work_schedule_signal) BTF_ID(func, bpf_task_work_schedule_resume) +BTF_ID(func, bpf_arena_free_pages) static bool is_task_work_add_kfunc(u32 func_id) { @@ -22350,6 +22352,9 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc } else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) { if (!env->insn_aux_data[insn_idx].non_sleepable) addr = (unsigned long)bpf_dynptr_from_file_sleepable; + } else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) { + if (env->insn_aux_data[insn_idx].non_sleepable) + addr = (unsigned long)bpf_arena_free_pages_non_sleepable; } set_imm: -- 2.47.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan @ 2025-11-14 11:47 ` bot+bpf-ci 2025-11-14 15:28 ` Puranjay Mohan 2025-11-14 21:27 ` Alexei Starovoitov ` (2 subsequent siblings) 3 siblings, 1 reply; 22+ messages in thread From: bot+bpf-ci @ 2025-11-14 11:47 UTC (permalink / raw) To: puranjay, bpf Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai [-- Attachment #1: Type: text/plain, Size: 5133 bytes --] > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 7fa6e40ab..ca443c113 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c [ ... ] > @@ -331,12 +355,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > struct bpf_arena *arena = container_of(map, struct bpf_arena, map); > struct page *page; > long kbase, kaddr; > + unsigned long flags; > int ret; > > kbase = bpf_arena_get_kern_vm_start(arena); > kaddr = kbase + (u32)(vmf->address); > > - guard(mutex)(&arena->lock); > + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) > + /* > + * This is an impossible case and would only trigger if res_spin_lock is buggy or > + * due to another kernel bug. > + */ > + return VM_FAULT_RETRY; > + > page = vmalloc_to_page((void *)kaddr); > if (page) > /* already have a page vmap-ed */ > @@ -348,26 +379,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); > if (ret) > - return VM_FAULT_SIGSEGV; > + goto out_unlock_sigsegv; > > struct apply_range_data data = { .pages = &page, .i = 0 }; > /* Account into memcg of the process that created bpf_arena */ > ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); > if (ret) { > range_tree_set(&arena->rt, vmf->pgoff, 1); > - return VM_FAULT_SIGSEGV; > + goto out_unlock_sigsegv; > } > > ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); > if (ret) { > range_tree_set(&arena->rt, vmf->pgoff, 1); > - __free_page(page); > - return VM_FAULT_SIGSEGV; > + free_pages_nolock(page, 0); > + goto out_unlock_sigsegv; > } > out: > page_ref_add(page, 1); > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > vmf->page = page; > return 0; In arena_vm_fault(), there's a path between acquiring the spinlock and the new goto labels that doesn't release the lock: > if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > return VM_FAULT_SIGSEGV; This return statement is reached after raw_res_spin_lock_irqsave() succeeds but before any of the unlock paths. If BPF_F_SEGV_ON_FAULT is set, subsequent page faults on this arena will deadlock. All other return paths in the function properly jump to out_unlock_sigsegv to release the lock. > +out_unlock_sigsegv: > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > + return VM_FAULT_SIGSEGV; > } [ ... ] > @@ -594,42 +633,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) > { > struct vma_list *vml; > > + guard(mutex)(&arena->lock); > + /* iterate link list under lock */ > list_for_each_entry(vml, &arena->vma_list, head) > zap_page_range_single(vml->vma, uaddr, > PAGE_SIZE * page_cnt, NULL); > } > > -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable) > { > u64 full_uaddr, uaddr_end; > - long kaddr, pgoff, i; > + long kaddr, pgoff; > struct page *page; > + struct llist_head free_pages; > + struct llist_node *pos, *t; > + struct arena_free_span *s; > + unsigned long flags; > + int ret = 0; > > /* only aligned lower 32-bit are relevant */ > uaddr = (u32)uaddr; > uaddr &= PAGE_MASK; > + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; > full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; > uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); > if (full_uaddr >= uaddr_end) > return; > > page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; > + pgoff = compute_pgoff(arena, uaddr); > > - guard(mutex)(&arena->lock); > + if (!sleepable) > + goto defer; > + > + ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags); > + /* > + * Can't proceed without holding the spinlock so defer the free > + */ > + if (ret) > + goto defer; [ ... ] > + return; > + > +defer: > + s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1); > + if (!s) > + return; In arena_free_pages(), when the deferred free path is taken but kmalloc_nolock() fails, the function returns silently without freeing the pages. These pages remain mapped in the kernel page tables and marked as allocated in the range tree, but become unusable. This can happen when: - The function is called from a non-sleepable context (sleepable=false) - Lock acquisition fails and falls through to defer - kmalloc_nolock() returns NULL (PREEMPT_RT+IRQ, lock contention, or unsupported arch) The deferred free path is called from arena_alloc_pages() error handling and bpf_arena_free_pages_non_sleepable(). In both cases, the pages that should be freed will leak. Should there be a fallback mechanism or at least a warning when deferral fails? > + > + s->page_cnt = page_cnt; > + s->uaddr = uaddr; > + llist_add(&s->node, &arena->free_spans); > + irq_work_queue(&arena->free_irq); > } --- AI reviewed your patch. Please fix the bug or email reply why it's not a bug. See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 11:47 ` bot+bpf-ci @ 2025-11-14 15:28 ` Puranjay Mohan 0 siblings, 0 replies; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 15:28 UTC (permalink / raw) To: bot+bpf-ci, bpf Cc: ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai bot+bpf-ci@kernel.org writes: >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c >> index 7fa6e40ab..ca443c113 100644 >> --- a/kernel/bpf/arena.c >> +++ b/kernel/bpf/arena.c > > [ ... ] > >> @@ -331,12 +355,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) >> struct bpf_arena *arena = container_of(map, struct bpf_arena, map); >> struct page *page; >> long kbase, kaddr; >> + unsigned long flags; >> int ret; >> >> kbase = bpf_arena_get_kern_vm_start(arena); >> kaddr = kbase + (u32)(vmf->address); >> >> - guard(mutex)(&arena->lock); >> + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) >> + /* >> + * This is an impossible case and would only trigger if res_spin_lock is buggy or >> + * due to another kernel bug. >> + */ >> + return VM_FAULT_RETRY; >> + >> page = vmalloc_to_page((void *)kaddr); >> if (page) >> /* already have a page vmap-ed */ >> @@ -348,26 +379,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) >> >> ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); >> if (ret) >> - return VM_FAULT_SIGSEGV; >> + goto out_unlock_sigsegv; >> >> struct apply_range_data data = { .pages = &page, .i = 0 }; >> /* Account into memcg of the process that created bpf_arena */ >> ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); >> if (ret) { >> range_tree_set(&arena->rt, vmf->pgoff, 1); >> - return VM_FAULT_SIGSEGV; >> + goto out_unlock_sigsegv; >> } >> >> ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); >> if (ret) { >> range_tree_set(&arena->rt, vmf->pgoff, 1); >> - __free_page(page); >> - return VM_FAULT_SIGSEGV; >> + free_pages_nolock(page, 0); >> + goto out_unlock_sigsegv; >> } >> out: >> page_ref_add(page, 1); >> + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); >> vmf->page = page; >> return 0; > > In arena_vm_fault(), there's a path between acquiring the spinlock and > the new goto labels that doesn't release the lock: > >> if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) >> return VM_FAULT_SIGSEGV; > > This return statement is reached after raw_res_spin_lock_irqsave() > succeeds but before any of the unlock paths. If BPF_F_SEGV_ON_FAULT is > set, subsequent page faults on this arena will deadlock. All other > return paths in the function properly jump to out_unlock_sigsegv to > release the lock. Good catch, this needs to be goto out_unlock_sigsegv; >> +out_unlock_sigsegv: >> + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); >> + return VM_FAULT_SIGSEGV; >> } > > [ ... ] > >> @@ -594,42 +633,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) >> { >> struct vma_list *vml; >> >> + guard(mutex)(&arena->lock); >> + /* iterate link list under lock */ >> list_for_each_entry(vml, &arena->vma_list, head) >> zap_page_range_single(vml->vma, uaddr, >> PAGE_SIZE * page_cnt, NULL); >> } >> >> -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) >> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable) >> { >> u64 full_uaddr, uaddr_end; >> - long kaddr, pgoff, i; >> + long kaddr, pgoff; >> struct page *page; >> + struct llist_head free_pages; >> + struct llist_node *pos, *t; >> + struct arena_free_span *s; >> + unsigned long flags; >> + int ret = 0; >> >> /* only aligned lower 32-bit are relevant */ >> uaddr = (u32)uaddr; >> uaddr &= PAGE_MASK; >> + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; >> full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; >> uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); >> if (full_uaddr >= uaddr_end) >> return; >> >> page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; >> + pgoff = compute_pgoff(arena, uaddr); >> >> - guard(mutex)(&arena->lock); >> + if (!sleepable) >> + goto defer; >> + >> + ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags); >> + /* >> + * Can't proceed without holding the spinlock so defer the free >> + */ >> + if (ret) >> + goto defer; > > [ ... ] > >> + return; >> + >> +defer: >> + s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1); >> + if (!s) >> + return; > > In arena_free_pages(), when the deferred free path is taken but > kmalloc_nolock() fails, the function returns silently without freeing > the pages. These pages remain mapped in the kernel page tables and > marked as allocated in the range tree, but become unusable. > > This can happen when: > - The function is called from a non-sleepable context (sleepable=false) > - Lock acquisition fails and falls through to defer > - kmalloc_nolock() returns NULL (PREEMPT_RT+IRQ, lock contention, or > unsupported arch) > > The deferred free path is called from arena_alloc_pages() error > handling and bpf_arena_free_pages_non_sleepable(). In both cases, the > pages that should be freed will leak. > > Should there be a fallback mechanism or at least a warning when > deferral fails? Yes, this is expected because if kmalloc_nolock() fails in non-sleepable context, we don't have another way to get memory for arena_free_span, we are accepting this trade-off, failing a 24 byte allocation here is highly unlikely, but we can have a WARN_ONCE here. We had an offline discussion about having debug counters from arena, maybe we can add a counter here. And for failure path of arena_alloc_pages, we could make arena_alloc_pages get a sleepable parameter and call arena_free_pages() with this parameter so we don't defer unnecessarily when arena_alloc_pages() is called in sleepable context. >> + >> + s->page_cnt = page_cnt; >> + s->uaddr = uaddr; >> + llist_add(&s->node, &arena->free_spans); >> + irq_work_queue(&arena->free_irq); >> } > > > --- > AI reviewed your patch. Please fix the bug or email reply why it's not a bug. > See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md > > CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319 ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan 2025-11-14 11:47 ` bot+bpf-ci @ 2025-11-14 21:27 ` Alexei Starovoitov 2025-11-15 0:56 ` Puranjay Mohan 2025-11-15 8:18 ` kernel test robot 2025-11-16 1:15 ` kernel test robot 3 siblings, 1 reply; 22+ messages in thread From: Alexei Starovoitov @ 2025-11-14 21:27 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > > + init_llist_head(&free_pages); > + /* clear ptes and collect struct pages */ > + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, > + apply_range_clear_cb, &free_pages); > + > + /* drop the lock to do the tlb flush and zap pages */ > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > + > + /* ensure no stale TLB entries */ > + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); > + > if (page_cnt > 1) > /* bulk zap if multiple pages being freed */ > zap_pages(arena, full_uaddr, page_cnt); > > - kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; > - for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { > - page = vmalloc_to_page((void *)kaddr); > - if (!page) > - continue; > + llist_for_each_safe(pos, t, llist_del_all(&free_pages)) { llist_del_all() ?! Why? it's a variable on stack. There is no race. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 21:27 ` Alexei Starovoitov @ 2025-11-15 0:56 ` Puranjay Mohan 2025-11-15 1:28 ` Alexei Starovoitov 0 siblings, 1 reply; 22+ messages in thread From: Puranjay Mohan @ 2025-11-15 0:56 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote: >> >> >> + init_llist_head(&free_pages); >> + /* clear ptes and collect struct pages */ >> + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, >> + apply_range_clear_cb, &free_pages); >> + >> + /* drop the lock to do the tlb flush and zap pages */ >> + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); >> + >> + /* ensure no stale TLB entries */ >> + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); >> + >> if (page_cnt > 1) >> /* bulk zap if multiple pages being freed */ >> zap_pages(arena, full_uaddr, page_cnt); >> >> - kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; >> - for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { >> - page = vmalloc_to_page((void *)kaddr); >> - if (!page) >> - continue; >> + llist_for_each_safe(pos, t, llist_del_all(&free_pages)) { > > llist_del_all() ?! Why? it's a variable on stack. There is no race. Yeah, I should have used __llist_del_all() which doesn't do an xchg() or in this case I can just use free_pages.first ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-15 0:56 ` Puranjay Mohan @ 2025-11-15 1:28 ` Alexei Starovoitov 0 siblings, 0 replies; 22+ messages in thread From: Alexei Starovoitov @ 2025-11-15 1:28 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Fri, Nov 14, 2025 at 4:56 PM Puranjay Mohan <puranjay@kernel.org> wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > > On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote: > >> > >> > >> + init_llist_head(&free_pages); > >> + /* clear ptes and collect struct pages */ > >> + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, > >> + apply_range_clear_cb, &free_pages); > >> + > >> + /* drop the lock to do the tlb flush and zap pages */ > >> + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > >> + > >> + /* ensure no stale TLB entries */ > >> + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); > >> + > >> if (page_cnt > 1) > >> /* bulk zap if multiple pages being freed */ > >> zap_pages(arena, full_uaddr, page_cnt); > >> > >> - kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; > >> - for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { > >> - page = vmalloc_to_page((void *)kaddr); > >> - if (!page) > >> - continue; > >> + llist_for_each_safe(pos, t, llist_del_all(&free_pages)) { > > > > llist_del_all() ?! Why? it's a variable on stack. There is no race. > > Yeah, I should have used __llist_del_all() which doesn't do an xchg() or > in this case I can just use free_pages.first Either one works. Slight preference for __llist_del_all() to avoid peaking into llist details. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan 2025-11-14 11:47 ` bot+bpf-ci 2025-11-14 21:27 ` Alexei Starovoitov @ 2025-11-15 8:18 ` kernel test robot 2025-11-16 1:15 ` kernel test robot 3 siblings, 0 replies; 22+ messages in thread From: kernel test robot @ 2025-11-15 8:18 UTC (permalink / raw) To: Puranjay Mohan, bpf Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team Hi Puranjay, kernel test robot noticed the following build errors: [auto build test ERROR on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251114-192509 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20251114111700.43292-4-puranjay%40kernel.org patch subject: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe config: xtensa-randconfig-r132-20251115 (https://download.01.org/0day-ci/archive/20251115/202511151534.L0gsQeTi-lkp@intel.com/config) compiler: xtensa-linux-gcc (GCC) 8.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251115/202511151534.L0gsQeTi-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202511151534.L0gsQeTi-lkp@intel.com/ All errors (new ones prefixed by >>): xtensa-linux-ld: kernel/bpf/verifier.o: in function `convert_ctx_accesses': >> kernel/bpf/verifier.c:21986: undefined reference to `bpf_arena_free_pages_non_sleepable' vim +21986 kernel/bpf/verifier.c a4b1d3c1ddf6cb Jiong Wang 2019-05-24 21682 c64b7983288e63 Joe Stringer 2018-10-02 21683 /* convert load instructions that access fields of a context type into a c64b7983288e63 Joe Stringer 2018-10-02 21684 * sequence of instructions that access fields of the underlying structure: c64b7983288e63 Joe Stringer 2018-10-02 21685 * struct __sk_buff -> struct sk_buff c64b7983288e63 Joe Stringer 2018-10-02 21686 * struct bpf_sock_ops -> struct sock 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21687 */ 58e2af8b3a6b58 Jakub Kicinski 2016-09-21 21688 static int convert_ctx_accesses(struct bpf_verifier_env *env) 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21689 { 169c31761c8d7f Martin KaFai Lau 2024-08-29 21690 struct bpf_subprog_info *subprogs = env->subprog_info; 00176a34d9e27a Jakub Kicinski 2017-10-16 21691 const struct bpf_verifier_ops *ops = env->ops; d519594ee2445d Amery Hung 2025-02-25 21692 int i, cnt, size, ctx_field_size, ret, delta = 0, epilogue_cnt = 0; 3df126f35f88dc Jakub Kicinski 2016-09-21 21693 const int insn_cnt = env->prog->len; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21694 struct bpf_insn *epilogue_buf = env->epilogue_buf; 6f606ffd6dd758 Martin KaFai Lau 2024-08-29 21695 struct bpf_insn *insn_buf = env->insn_buf; 6f606ffd6dd758 Martin KaFai Lau 2024-08-29 21696 struct bpf_insn *insn; 46f53a65d2de3e Andrey Ignatov 2018-11-10 21697 u32 target_size, size_default, off; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21698 struct bpf_prog *new_prog; d691f9e8d4405c Alexei Starovoitov 2015-06-04 21699 enum bpf_access_type type; f96da09473b52c Daniel Borkmann 2017-07-02 21700 bool is_narrower_load; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21701 int epilogue_idx = 0; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21702 169c31761c8d7f Martin KaFai Lau 2024-08-29 21703 if (ops->gen_epilogue) { 169c31761c8d7f Martin KaFai Lau 2024-08-29 21704 epilogue_cnt = ops->gen_epilogue(epilogue_buf, env->prog, 169c31761c8d7f Martin KaFai Lau 2024-08-29 21705 -(subprogs[0].stack_depth + 8)); 169c31761c8d7f Martin KaFai Lau 2024-08-29 21706 if (epilogue_cnt >= INSN_BUF_SIZE) { 0df1a55afa832f Paul Chaignon 2025-07-01 21707 verifier_bug(env, "epilogue is too long"); fd508bde5d646f Luis Gerhorst 2025-06-03 21708 return -EFAULT; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21709 } else if (epilogue_cnt) { 169c31761c8d7f Martin KaFai Lau 2024-08-29 21710 /* Save the ARG_PTR_TO_CTX for the epilogue to use */ 169c31761c8d7f Martin KaFai Lau 2024-08-29 21711 cnt = 0; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21712 subprogs[0].stack_depth += 8; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21713 insn_buf[cnt++] = BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_1, 169c31761c8d7f Martin KaFai Lau 2024-08-29 21714 -subprogs[0].stack_depth); 169c31761c8d7f Martin KaFai Lau 2024-08-29 21715 insn_buf[cnt++] = env->prog->insnsi[0]; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21716 new_prog = bpf_patch_insn_data(env, 0, insn_buf, cnt); 169c31761c8d7f Martin KaFai Lau 2024-08-29 21717 if (!new_prog) 169c31761c8d7f Martin KaFai Lau 2024-08-29 21718 return -ENOMEM; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21719 env->prog = new_prog; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21720 delta += cnt - 1; d519594ee2445d Amery Hung 2025-02-25 21721 d519594ee2445d Amery Hung 2025-02-25 21722 ret = add_kfunc_in_insns(env, epilogue_buf, epilogue_cnt - 1); d519594ee2445d Amery Hung 2025-02-25 21723 if (ret < 0) d519594ee2445d Amery Hung 2025-02-25 21724 return ret; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21725 } 169c31761c8d7f Martin KaFai Lau 2024-08-29 21726 } 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21727 b09928b976280d Daniel Borkmann 2018-10-24 21728 if (ops->gen_prologue || env->seen_direct_write) { b09928b976280d Daniel Borkmann 2018-10-24 21729 if (!ops->gen_prologue) { 0df1a55afa832f Paul Chaignon 2025-07-01 21730 verifier_bug(env, "gen_prologue is null"); fd508bde5d646f Luis Gerhorst 2025-06-03 21731 return -EFAULT; b09928b976280d Daniel Borkmann 2018-10-24 21732 } 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21733 cnt = ops->gen_prologue(insn_buf, env->seen_direct_write, 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21734 env->prog); 6f606ffd6dd758 Martin KaFai Lau 2024-08-29 21735 if (cnt >= INSN_BUF_SIZE) { 0df1a55afa832f Paul Chaignon 2025-07-01 21736 verifier_bug(env, "prologue is too long"); fd508bde5d646f Luis Gerhorst 2025-06-03 21737 return -EFAULT; 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21738 } else if (cnt) { 8041902dae5299 Alexei Starovoitov 2017-03-15 21739 new_prog = bpf_patch_insn_data(env, 0, insn_buf, cnt); 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21740 if (!new_prog) 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21741 return -ENOMEM; 8041902dae5299 Alexei Starovoitov 2017-03-15 21742 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21743 env->prog = new_prog; 3df126f35f88dc Jakub Kicinski 2016-09-21 21744 delta += cnt - 1; d519594ee2445d Amery Hung 2025-02-25 21745 d519594ee2445d Amery Hung 2025-02-25 21746 ret = add_kfunc_in_insns(env, insn_buf, cnt - 1); d519594ee2445d Amery Hung 2025-02-25 21747 if (ret < 0) d519594ee2445d Amery Hung 2025-02-25 21748 return ret; 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21749 } 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21750 } 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21751 d5c47719f24438 Martin KaFai Lau 2024-08-29 21752 if (delta) d5c47719f24438 Martin KaFai Lau 2024-08-29 21753 WARN_ON(adjust_jmp_off(env->prog, 0, delta)); d5c47719f24438 Martin KaFai Lau 2024-08-29 21754 9d03ebc71a027c Stanislav Fomichev 2023-01-19 21755 if (bpf_prog_is_offloaded(env->prog->aux)) 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21756 return 0; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21757 3df126f35f88dc Jakub Kicinski 2016-09-21 21758 insn = env->prog->insnsi + delta; 36bbef52c7eb64 Daniel Borkmann 2016-09-20 21759 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21760 for (i = 0; i < insn_cnt; i++, insn++) { c64b7983288e63 Joe Stringer 2018-10-02 21761 bpf_convert_ctx_access_t convert_ctx_access; 1f1e864b65554e Yonghong Song 2023-07-27 21762 u8 mode; c64b7983288e63 Joe Stringer 2018-10-02 21763 d6f1c85f22534d Luis Gerhorst 2025-06-03 21764 if (env->insn_aux_data[i + delta].nospec) { d6f1c85f22534d Luis Gerhorst 2025-06-03 21765 WARN_ON_ONCE(env->insn_aux_data[i + delta].alu_state); 45e9cd38aa8df9 Yonghong Song 2025-07-03 21766 struct bpf_insn *patch = insn_buf; d6f1c85f22534d Luis Gerhorst 2025-06-03 21767 45e9cd38aa8df9 Yonghong Song 2025-07-03 21768 *patch++ = BPF_ST_NOSPEC(); 45e9cd38aa8df9 Yonghong Song 2025-07-03 21769 *patch++ = *insn; 45e9cd38aa8df9 Yonghong Song 2025-07-03 21770 cnt = patch - insn_buf; 45e9cd38aa8df9 Yonghong Song 2025-07-03 21771 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); d6f1c85f22534d Luis Gerhorst 2025-06-03 21772 if (!new_prog) d6f1c85f22534d Luis Gerhorst 2025-06-03 21773 return -ENOMEM; d6f1c85f22534d Luis Gerhorst 2025-06-03 21774 d6f1c85f22534d Luis Gerhorst 2025-06-03 21775 delta += cnt - 1; d6f1c85f22534d Luis Gerhorst 2025-06-03 21776 env->prog = new_prog; d6f1c85f22534d Luis Gerhorst 2025-06-03 21777 insn = new_prog->insnsi + i + delta; d6f1c85f22534d Luis Gerhorst 2025-06-03 21778 /* This can not be easily merged with the d6f1c85f22534d Luis Gerhorst 2025-06-03 21779 * nospec_result-case, because an insn may require a d6f1c85f22534d Luis Gerhorst 2025-06-03 21780 * nospec before and after itself. Therefore also do not d6f1c85f22534d Luis Gerhorst 2025-06-03 21781 * 'continue' here but potentially apply further d6f1c85f22534d Luis Gerhorst 2025-06-03 21782 * patching to insn. *insn should equal patch[1] now. d6f1c85f22534d Luis Gerhorst 2025-06-03 21783 */ d6f1c85f22534d Luis Gerhorst 2025-06-03 21784 } d6f1c85f22534d Luis Gerhorst 2025-06-03 21785 62c7989b24dbd3 Daniel Borkmann 2017-01-12 21786 if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) || 62c7989b24dbd3 Daniel Borkmann 2017-01-12 21787 insn->code == (BPF_LDX | BPF_MEM | BPF_H) || 62c7989b24dbd3 Daniel Borkmann 2017-01-12 21788 insn->code == (BPF_LDX | BPF_MEM | BPF_W) || 1f9a1ea821ff25 Yonghong Song 2023-07-27 21789 insn->code == (BPF_LDX | BPF_MEM | BPF_DW) || 1f9a1ea821ff25 Yonghong Song 2023-07-27 21790 insn->code == (BPF_LDX | BPF_MEMSX | BPF_B) || 1f9a1ea821ff25 Yonghong Song 2023-07-27 21791 insn->code == (BPF_LDX | BPF_MEMSX | BPF_H) || 1f9a1ea821ff25 Yonghong Song 2023-07-27 21792 insn->code == (BPF_LDX | BPF_MEMSX | BPF_W)) { d691f9e8d4405c Alexei Starovoitov 2015-06-04 21793 type = BPF_READ; 2039f26f3aca5b Daniel Borkmann 2021-07-13 21794 } else if (insn->code == (BPF_STX | BPF_MEM | BPF_B) || 62c7989b24dbd3 Daniel Borkmann 2017-01-12 21795 insn->code == (BPF_STX | BPF_MEM | BPF_H) || 62c7989b24dbd3 Daniel Borkmann 2017-01-12 21796 insn->code == (BPF_STX | BPF_MEM | BPF_W) || 2039f26f3aca5b Daniel Borkmann 2021-07-13 21797 insn->code == (BPF_STX | BPF_MEM | BPF_DW) || 2039f26f3aca5b Daniel Borkmann 2021-07-13 21798 insn->code == (BPF_ST | BPF_MEM | BPF_B) || 2039f26f3aca5b Daniel Borkmann 2021-07-13 21799 insn->code == (BPF_ST | BPF_MEM | BPF_H) || 2039f26f3aca5b Daniel Borkmann 2021-07-13 21800 insn->code == (BPF_ST | BPF_MEM | BPF_W) || 2039f26f3aca5b Daniel Borkmann 2021-07-13 21801 insn->code == (BPF_ST | BPF_MEM | BPF_DW)) { d691f9e8d4405c Alexei Starovoitov 2015-06-04 21802 type = BPF_WRITE; 880442305a3908 Peilin Ye 2025-03-04 21803 } else if ((insn->code == (BPF_STX | BPF_ATOMIC | BPF_B) || 880442305a3908 Peilin Ye 2025-03-04 21804 insn->code == (BPF_STX | BPF_ATOMIC | BPF_H) || 880442305a3908 Peilin Ye 2025-03-04 21805 insn->code == (BPF_STX | BPF_ATOMIC | BPF_W) || d503a04f8bc0c7 Alexei Starovoitov 2024-04-05 21806 insn->code == (BPF_STX | BPF_ATOMIC | BPF_DW)) && d503a04f8bc0c7 Alexei Starovoitov 2024-04-05 21807 env->insn_aux_data[i + delta].ptr_type == PTR_TO_ARENA) { d503a04f8bc0c7 Alexei Starovoitov 2024-04-05 21808 insn->code = BPF_STX | BPF_PROBE_ATOMIC | BPF_SIZE(insn->code); d503a04f8bc0c7 Alexei Starovoitov 2024-04-05 21809 env->prog->aux->num_exentries++; d503a04f8bc0c7 Alexei Starovoitov 2024-04-05 21810 continue; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21811 } else if (insn->code == (BPF_JMP | BPF_EXIT) && 169c31761c8d7f Martin KaFai Lau 2024-08-29 21812 epilogue_cnt && 169c31761c8d7f Martin KaFai Lau 2024-08-29 21813 i + delta < subprogs[1].start) { 169c31761c8d7f Martin KaFai Lau 2024-08-29 21814 /* Generate epilogue for the main prog */ 169c31761c8d7f Martin KaFai Lau 2024-08-29 21815 if (epilogue_idx) { 169c31761c8d7f Martin KaFai Lau 2024-08-29 21816 /* jump back to the earlier generated epilogue */ 169c31761c8d7f Martin KaFai Lau 2024-08-29 21817 insn_buf[0] = BPF_JMP32_A(epilogue_idx - i - delta - 1); 169c31761c8d7f Martin KaFai Lau 2024-08-29 21818 cnt = 1; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21819 } else { 169c31761c8d7f Martin KaFai Lau 2024-08-29 21820 memcpy(insn_buf, epilogue_buf, 169c31761c8d7f Martin KaFai Lau 2024-08-29 21821 epilogue_cnt * sizeof(*epilogue_buf)); 169c31761c8d7f Martin KaFai Lau 2024-08-29 21822 cnt = epilogue_cnt; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21823 /* epilogue_idx cannot be 0. It must have at 169c31761c8d7f Martin KaFai Lau 2024-08-29 21824 * least one ctx ptr saving insn before the 169c31761c8d7f Martin KaFai Lau 2024-08-29 21825 * epilogue. 169c31761c8d7f Martin KaFai Lau 2024-08-29 21826 */ 169c31761c8d7f Martin KaFai Lau 2024-08-29 21827 epilogue_idx = i + delta; 169c31761c8d7f Martin KaFai Lau 2024-08-29 21828 } 169c31761c8d7f Martin KaFai Lau 2024-08-29 21829 goto patch_insn_buf; 2039f26f3aca5b Daniel Borkmann 2021-07-13 21830 } else { 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21831 continue; 2039f26f3aca5b Daniel Borkmann 2021-07-13 21832 } 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21833 af86ca4e3088fe Alexei Starovoitov 2018-05-15 21834 if (type == BPF_WRITE && 9124a4508007f1 Luis Gerhorst 2025-06-03 21835 env->insn_aux_data[i + delta].nospec_result) { d6f1c85f22534d Luis Gerhorst 2025-06-03 21836 /* nospec_result is only used to mitigate Spectre v4 and d6f1c85f22534d Luis Gerhorst 2025-06-03 21837 * to limit verification-time for Spectre v1. d6f1c85f22534d Luis Gerhorst 2025-06-03 21838 */ 45e9cd38aa8df9 Yonghong Song 2025-07-03 21839 struct bpf_insn *patch = insn_buf; af86ca4e3088fe Alexei Starovoitov 2018-05-15 21840 45e9cd38aa8df9 Yonghong Song 2025-07-03 21841 *patch++ = *insn; 45e9cd38aa8df9 Yonghong Song 2025-07-03 21842 *patch++ = BPF_ST_NOSPEC(); 45e9cd38aa8df9 Yonghong Song 2025-07-03 21843 cnt = patch - insn_buf; 45e9cd38aa8df9 Yonghong Song 2025-07-03 21844 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); af86ca4e3088fe Alexei Starovoitov 2018-05-15 21845 if (!new_prog) af86ca4e3088fe Alexei Starovoitov 2018-05-15 21846 return -ENOMEM; af86ca4e3088fe Alexei Starovoitov 2018-05-15 21847 af86ca4e3088fe Alexei Starovoitov 2018-05-15 21848 delta += cnt - 1; af86ca4e3088fe Alexei Starovoitov 2018-05-15 21849 env->prog = new_prog; af86ca4e3088fe Alexei Starovoitov 2018-05-15 21850 insn = new_prog->insnsi + i + delta; af86ca4e3088fe Alexei Starovoitov 2018-05-15 21851 continue; af86ca4e3088fe Alexei Starovoitov 2018-05-15 21852 } af86ca4e3088fe Alexei Starovoitov 2018-05-15 21853 6efe152d4061a8 Kumar Kartikeya Dwivedi 2022-04-25 21854 switch ((int)env->insn_aux_data[i + delta].ptr_type) { c64b7983288e63 Joe Stringer 2018-10-02 21855 case PTR_TO_CTX: c64b7983288e63 Joe Stringer 2018-10-02 21856 if (!ops->convert_ctx_access) 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21857 continue; c64b7983288e63 Joe Stringer 2018-10-02 21858 convert_ctx_access = ops->convert_ctx_access; c64b7983288e63 Joe Stringer 2018-10-02 21859 break; c64b7983288e63 Joe Stringer 2018-10-02 21860 case PTR_TO_SOCKET: 46f8bc92758c62 Martin KaFai Lau 2019-02-09 21861 case PTR_TO_SOCK_COMMON: c64b7983288e63 Joe Stringer 2018-10-02 21862 convert_ctx_access = bpf_sock_convert_ctx_access; c64b7983288e63 Joe Stringer 2018-10-02 21863 break; 655a51e536c09d Martin KaFai Lau 2019-02-09 21864 case PTR_TO_TCP_SOCK: 655a51e536c09d Martin KaFai Lau 2019-02-09 21865 convert_ctx_access = bpf_tcp_sock_convert_ctx_access; 655a51e536c09d Martin KaFai Lau 2019-02-09 21866 break; fada7fdc83c0bf Jonathan Lemon 2019-06-06 21867 case PTR_TO_XDP_SOCK: fada7fdc83c0bf Jonathan Lemon 2019-06-06 21868 convert_ctx_access = bpf_xdp_sock_convert_ctx_access; fada7fdc83c0bf Jonathan Lemon 2019-06-06 21869 break; 2a02759ef5f8a3 Alexei Starovoitov 2019-10-15 21870 case PTR_TO_BTF_ID: 6efe152d4061a8 Kumar Kartikeya Dwivedi 2022-04-25 21871 case PTR_TO_BTF_ID | PTR_UNTRUSTED: 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21872 /* PTR_TO_BTF_ID | MEM_ALLOC always has a valid lifetime, unlike 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21873 * PTR_TO_BTF_ID, and an active ref_obj_id, but the same cannot 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21874 * be said once it is marked PTR_UNTRUSTED, hence we must handle 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21875 * any faults for loads into such types. BPF_WRITE is disallowed 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21876 * for this case. 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21877 */ 282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18 21878 case PTR_TO_BTF_ID | MEM_ALLOC | PTR_UNTRUSTED: f2362a57aefff5 Eduard Zingerman 2025-06-25 21879 case PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED: 27ae7997a66174 Martin KaFai Lau 2020-01-08 21880 if (type == BPF_READ) { 1f9a1ea821ff25 Yonghong Song 2023-07-27 21881 if (BPF_MODE(insn->code) == BPF_MEM) 27ae7997a66174 Martin KaFai Lau 2020-01-08 21882 insn->code = BPF_LDX | BPF_PROBE_MEM | 27ae7997a66174 Martin KaFai Lau 2020-01-08 21883 BPF_SIZE((insn)->code); 1f9a1ea821ff25 Yonghong Song 2023-07-27 21884 else 1f9a1ea821ff25 Yonghong Song 2023-07-27 21885 insn->code = BPF_LDX | BPF_PROBE_MEMSX | 1f9a1ea821ff25 Yonghong Song 2023-07-27 21886 BPF_SIZE((insn)->code); 27ae7997a66174 Martin KaFai Lau 2020-01-08 21887 env->prog->aux->num_exentries++; 2a02759ef5f8a3 Alexei Starovoitov 2019-10-15 21888 } 2a02759ef5f8a3 Alexei Starovoitov 2019-10-15 21889 continue; 6082b6c328b548 Alexei Starovoitov 2024-03-07 21890 case PTR_TO_ARENA: 6082b6c328b548 Alexei Starovoitov 2024-03-07 21891 if (BPF_MODE(insn->code) == BPF_MEMSX) { a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23 21892 if (!bpf_jit_supports_insn(insn, true)) { 6082b6c328b548 Alexei Starovoitov 2024-03-07 21893 verbose(env, "sign extending loads from arena are not supported yet\n"); 6082b6c328b548 Alexei Starovoitov 2024-03-07 21894 return -EOPNOTSUPP; 6082b6c328b548 Alexei Starovoitov 2024-03-07 21895 } a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23 21896 insn->code = BPF_CLASS(insn->code) | BPF_PROBE_MEM32SX | BPF_SIZE(insn->code); a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23 21897 } else { 6082b6c328b548 Alexei Starovoitov 2024-03-07 21898 insn->code = BPF_CLASS(insn->code) | BPF_PROBE_MEM32 | BPF_SIZE(insn->code); a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23 21899 } 6082b6c328b548 Alexei Starovoitov 2024-03-07 21900 env->prog->aux->num_exentries++; 6082b6c328b548 Alexei Starovoitov 2024-03-07 21901 continue; c64b7983288e63 Joe Stringer 2018-10-02 21902 default: c64b7983288e63 Joe Stringer 2018-10-02 21903 continue; c64b7983288e63 Joe Stringer 2018-10-02 21904 } 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21905 31fd85816dbe3a Yonghong Song 2017-06-13 21906 ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size; f96da09473b52c Daniel Borkmann 2017-07-02 21907 size = BPF_LDST_BYTES(insn); 1f1e864b65554e Yonghong Song 2023-07-27 21908 mode = BPF_MODE(insn->code); 31fd85816dbe3a Yonghong Song 2017-06-13 21909 31fd85816dbe3a Yonghong Song 2017-06-13 21910 /* If the read access is a narrower load of the field, 31fd85816dbe3a Yonghong Song 2017-06-13 21911 * convert to a 4/8-byte load, to minimum program type specific 31fd85816dbe3a Yonghong Song 2017-06-13 21912 * convert_ctx_access changes. If conversion is successful, 31fd85816dbe3a Yonghong Song 2017-06-13 21913 * we will apply proper mask to the result. 31fd85816dbe3a Yonghong Song 2017-06-13 21914 */ f96da09473b52c Daniel Borkmann 2017-07-02 21915 is_narrower_load = size < ctx_field_size; 46f53a65d2de3e Andrey Ignatov 2018-11-10 21916 size_default = bpf_ctx_off_adjust_machine(ctx_field_size); 46f53a65d2de3e Andrey Ignatov 2018-11-10 21917 off = insn->off; 31fd85816dbe3a Yonghong Song 2017-06-13 21918 if (is_narrower_load) { f96da09473b52c Daniel Borkmann 2017-07-02 21919 u8 size_code; 31fd85816dbe3a Yonghong Song 2017-06-13 21920 f96da09473b52c Daniel Borkmann 2017-07-02 21921 if (type == BPF_WRITE) { 0df1a55afa832f Paul Chaignon 2025-07-01 21922 verifier_bug(env, "narrow ctx access misconfigured"); fd508bde5d646f Luis Gerhorst 2025-06-03 21923 return -EFAULT; f96da09473b52c Daniel Borkmann 2017-07-02 21924 } f96da09473b52c Daniel Borkmann 2017-07-02 21925 f96da09473b52c Daniel Borkmann 2017-07-02 21926 size_code = BPF_H; 31fd85816dbe3a Yonghong Song 2017-06-13 21927 if (ctx_field_size == 4) 31fd85816dbe3a Yonghong Song 2017-06-13 21928 size_code = BPF_W; 31fd85816dbe3a Yonghong Song 2017-06-13 21929 else if (ctx_field_size == 8) 31fd85816dbe3a Yonghong Song 2017-06-13 21930 size_code = BPF_DW; f96da09473b52c Daniel Borkmann 2017-07-02 21931 bc23105ca0abde Daniel Borkmann 2018-06-02 21932 insn->off = off & ~(size_default - 1); 31fd85816dbe3a Yonghong Song 2017-06-13 21933 insn->code = BPF_LDX | BPF_MEM | size_code; 31fd85816dbe3a Yonghong Song 2017-06-13 21934 } f96da09473b52c Daniel Borkmann 2017-07-02 21935 f96da09473b52c Daniel Borkmann 2017-07-02 21936 target_size = 0; c64b7983288e63 Joe Stringer 2018-10-02 21937 cnt = convert_ctx_access(type, insn, insn_buf, env->prog, f96da09473b52c Daniel Borkmann 2017-07-02 21938 &target_size); 6f606ffd6dd758 Martin KaFai Lau 2024-08-29 21939 if (cnt == 0 || cnt >= INSN_BUF_SIZE || f96da09473b52c Daniel Borkmann 2017-07-02 21940 (ctx_field_size && !target_size)) { f914876eec9e72 Paul Chaignon 2025-08-01 21941 verifier_bug(env, "error during ctx access conversion (%d)", cnt); fd508bde5d646f Luis Gerhorst 2025-06-03 21942 return -EFAULT; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21943 } f96da09473b52c Daniel Borkmann 2017-07-02 21944 f96da09473b52c Daniel Borkmann 2017-07-02 21945 if (is_narrower_load && size < target_size) { d895a0f16fadb2 Ilya Leoshkevich 2019-08-16 21946 u8 shift = bpf_ctx_narrow_access_offset( d895a0f16fadb2 Ilya Leoshkevich 2019-08-16 21947 off, size, size_default) * 8; 6f606ffd6dd758 Martin KaFai Lau 2024-08-29 21948 if (shift && cnt + 1 >= INSN_BUF_SIZE) { 0df1a55afa832f Paul Chaignon 2025-07-01 21949 verifier_bug(env, "narrow ctx load misconfigured"); fd508bde5d646f Luis Gerhorst 2025-06-03 21950 return -EFAULT; d7af7e497f0308 Andrey Ignatov 2021-08-20 21951 } 46f53a65d2de3e Andrey Ignatov 2018-11-10 21952 if (ctx_field_size <= 4) { 46f53a65d2de3e Andrey Ignatov 2018-11-10 21953 if (shift) 46f53a65d2de3e Andrey Ignatov 2018-11-10 21954 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH, 46f53a65d2de3e Andrey Ignatov 2018-11-10 21955 insn->dst_reg, 46f53a65d2de3e Andrey Ignatov 2018-11-10 21956 shift); 31fd85816dbe3a Yonghong Song 2017-06-13 21957 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, 31fd85816dbe3a Yonghong Song 2017-06-13 21958 (1 << size * 8) - 1); 46f53a65d2de3e Andrey Ignatov 2018-11-10 21959 } else { 46f53a65d2de3e Andrey Ignatov 2018-11-10 21960 if (shift) 46f53a65d2de3e Andrey Ignatov 2018-11-10 21961 insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH, 46f53a65d2de3e Andrey Ignatov 2018-11-10 21962 insn->dst_reg, 46f53a65d2de3e Andrey Ignatov 2018-11-10 21963 shift); 0613d8ca9ab382 Will Deacon 2023-05-18 21964 insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg, e2f7fc0ac6957c Krzesimir Nowak 2019-05-08 21965 (1ULL << size * 8) - 1); 31fd85816dbe3a Yonghong Song 2017-06-13 21966 } 46f53a65d2de3e Andrey Ignatov 2018-11-10 21967 } 1f1e864b65554e Yonghong Song 2023-07-27 21968 if (mode == BPF_MEMSX) 1f1e864b65554e Yonghong Song 2023-07-27 21969 insn_buf[cnt++] = BPF_RAW_INSN(BPF_ALU64 | BPF_MOV | BPF_X, 1f1e864b65554e Yonghong Song 2023-07-27 21970 insn->dst_reg, insn->dst_reg, 1f1e864b65554e Yonghong Song 2023-07-27 21971 size * 8, 0); 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21972 169c31761c8d7f Martin KaFai Lau 2024-08-29 21973 patch_insn_buf: 8041902dae5299 Alexei Starovoitov 2017-03-15 21974 new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21975 if (!new_prog) 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21976 return -ENOMEM; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21977 3df126f35f88dc Jakub Kicinski 2016-09-21 21978 delta += cnt - 1; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21979 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21980 /* keep walking new program and skip insns we just inserted */ 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21981 env->prog = new_prog; 3df126f35f88dc Jakub Kicinski 2016-09-21 21982 insn = new_prog->insnsi + i + delta; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21983 } 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21984 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21985 return 0; 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 @21986 } 9bac3d6d548e5c Alexei Starovoitov 2015-03-13 21987 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan ` (2 preceding siblings ...) 2025-11-15 8:18 ` kernel test robot @ 2025-11-16 1:15 ` kernel test robot 3 siblings, 0 replies; 22+ messages in thread From: kernel test robot @ 2025-11-16 1:15 UTC (permalink / raw) To: Puranjay Mohan, bpf Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team Hi Puranjay, kernel test robot noticed the following build errors: [auto build test ERROR on bpf-next/master] url: https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251114-192509 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master patch link: https://lore.kernel.org/r/20251114111700.43292-4-puranjay%40kernel.org patch subject: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe config: sh-randconfig-r071-20251115 (https://download.01.org/0day-ci/archive/20251116/202511160836.5Ca6PimB-lkp@intel.com/config) compiler: sh4-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251116/202511160836.5Ca6PimB-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202511160836.5Ca6PimB-lkp@intel.com/ All errors (new ones prefixed by >>): sh4-linux-ld: kernel/bpf/verifier.o: in function `fixup_kfunc_call': >> kernel/bpf/verifier.c:22428:(.text+0x7748): undefined reference to `bpf_arena_free_pages_non_sleepable' sh4-linux-ld: drivers/net/phy/air_en8811h.o: in function `en8811h_resume': drivers/net/phy/air_en8811h.c:1178:(.text+0x544): undefined reference to `clk_restore_context' sh4-linux-ld: drivers/net/phy/air_en8811h.o: in function `en8811h_suspend': drivers/net/phy/air_en8811h.c:1185:(.text+0x56c): undefined reference to `clk_save_context' sh4-linux-ld: drivers/media/i2c/tc358746.o: in function `tc358746_probe': drivers/media/i2c/tc358746.c:1585:(.text+0x1408): undefined reference to `devm_clk_hw_register' Kconfig warnings: (for reference only) WARNING: unmet direct dependencies detected for OF_GPIO Depends on [n]: GPIOLIB [=y] && OF [=n] && HAS_IOMEM [=y] Selected by [y]: - GPIO_TB10X [=y] && GPIOLIB [=y] && HAS_IOMEM [=y] && (ARC_PLAT_TB10X || COMPILE_TEST [=y]) WARNING: unmet direct dependencies detected for GPIO_SYSCON Depends on [n]: GPIOLIB [=y] && HAS_IOMEM [=y] && MFD_SYSCON [=y] && OF [=n] Selected by [y]: - GPIO_SAMA5D2_PIOBU [=y] && GPIOLIB [=y] && HAS_IOMEM [=y] && MFD_SYSCON [=y] && OF_GPIO [=y] && (ARCH_AT91 || COMPILE_TEST [=y]) WARNING: unmet direct dependencies detected for I2C_K1 Depends on [n]: I2C [=y] && HAS_IOMEM [=y] && (ARCH_SPACEMIT || COMPILE_TEST [=y]) && OF [=n] Selected by [y]: - MFD_SPACEMIT_P1 [=y] && HAS_IOMEM [=y] && (ARCH_SPACEMIT || COMPILE_TEST [=y]) && I2C [=y] vim +22428 kernel/bpf/verifier.c d2dcc67df910dd Dave Marchevsky 2023-04-15 22392 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22393 static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn, 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22394 struct bpf_insn *insn_buf, int insn_idx, int *cnt) e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22395 { d869d56ca84841 Mykyta Yatsenko 2025-10-26 22396 struct bpf_kfunc_desc *desc; d869d56ca84841 Mykyta Yatsenko 2025-10-26 22397 int err; e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22398 a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02 22399 if (!insn->imm) { a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02 22400 verbose(env, "invalid kernel function call not eliminated in verifier pass\n"); a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02 22401 return -EINVAL; a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02 22402 } a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02 22403 3d76a4d3d4e591 Stanislav Fomichev 2023-01-19 22404 *cnt = 0; 3d76a4d3d4e591 Stanislav Fomichev 2023-01-19 22405 1cf3bfc60f9836 Ilya Leoshkevich 2023-04-13 22406 /* insn->imm has the btf func_id. Replace it with an offset relative to 1cf3bfc60f9836 Ilya Leoshkevich 2023-04-13 22407 * __bpf_call_base, unless the JIT needs to call functions that are 1cf3bfc60f9836 Ilya Leoshkevich 2023-04-13 22408 * further than 32 bits away (bpf_jit_supports_far_kfunc_call()). e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22409 */ 2357672c54c3f7 Kumar Kartikeya Dwivedi 2021-10-02 22410 desc = find_kfunc_desc(env->prog, insn->imm, insn->off); e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22411 if (!desc) { 0df1a55afa832f Paul Chaignon 2025-07-01 22412 verifier_bug(env, "kernel function descriptor not found for func_id %u", e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22413 insn->imm); e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22414 return -EFAULT; e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22415 } e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22416 2c52e8943a437a Mykyta Yatsenko 2025-10-26 22417 err = specialize_kfunc(env, desc, insn_idx); d869d56ca84841 Mykyta Yatsenko 2025-10-26 22418 if (err) d869d56ca84841 Mykyta Yatsenko 2025-10-26 22419 return err; d869d56ca84841 Mykyta Yatsenko 2025-10-26 22420 1cf3bfc60f9836 Ilya Leoshkevich 2023-04-13 22421 if (!bpf_jit_supports_far_kfunc_call()) 1cf3bfc60f9836 Ilya Leoshkevich 2023-04-13 22422 insn->imm = BPF_CALL_IMM(desc->addr); 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22423 if (insn->off) 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22424 return 0; 36d8bdf75a9319 Yonghong Song 2023-08-27 22425 if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] || 36d8bdf75a9319 Yonghong Song 2023-08-27 22426 desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) { 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22427 struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta; 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 @22428 struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) }; 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22429 u64 obj_new_size = env->insn_aux_data[insn_idx].obj_new_size; e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22430 36d8bdf75a9319 Yonghong Song 2023-08-27 22431 if (desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl] && kptr_struct_meta) { 0df1a55afa832f Paul Chaignon 2025-07-01 22432 verifier_bug(env, "NULL kptr_struct_meta expected at insn_idx %d", 36d8bdf75a9319 Yonghong Song 2023-08-27 22433 insn_idx); 36d8bdf75a9319 Yonghong Song 2023-08-27 22434 return -EFAULT; 36d8bdf75a9319 Yonghong Song 2023-08-27 22435 } 36d8bdf75a9319 Yonghong Song 2023-08-27 22436 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22437 insn_buf[0] = BPF_MOV64_IMM(BPF_REG_1, obj_new_size); 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22438 insn_buf[1] = addr[0]; 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22439 insn_buf[2] = addr[1]; 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22440 insn_buf[3] = *insn; 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22441 *cnt = 4; 7c50b1cb76aca4 Dave Marchevsky 2023-04-15 22442 } else if (desc->func_id == special_kfunc_list[KF_bpf_obj_drop_impl] || 36d8bdf75a9319 Yonghong Song 2023-08-27 22443 desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_drop_impl] || 7c50b1cb76aca4 Dave Marchevsky 2023-04-15 22444 desc->func_id == special_kfunc_list[KF_bpf_refcount_acquire_impl]) { ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22445 struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta; ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22446 struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) }; ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22447 36d8bdf75a9319 Yonghong Song 2023-08-27 22448 if (desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_drop_impl] && kptr_struct_meta) { 0df1a55afa832f Paul Chaignon 2025-07-01 22449 verifier_bug(env, "NULL kptr_struct_meta expected at insn_idx %d", 36d8bdf75a9319 Yonghong Song 2023-08-27 22450 insn_idx); 36d8bdf75a9319 Yonghong Song 2023-08-27 22451 return -EFAULT; 36d8bdf75a9319 Yonghong Song 2023-08-27 22452 } 36d8bdf75a9319 Yonghong Song 2023-08-27 22453 f0d991a070750a Dave Marchevsky 2023-08-21 22454 if (desc->func_id == special_kfunc_list[KF_bpf_refcount_acquire_impl] && f0d991a070750a Dave Marchevsky 2023-08-21 22455 !kptr_struct_meta) { 0df1a55afa832f Paul Chaignon 2025-07-01 22456 verifier_bug(env, "kptr_struct_meta expected at insn_idx %d", f0d991a070750a Dave Marchevsky 2023-08-21 22457 insn_idx); f0d991a070750a Dave Marchevsky 2023-08-21 22458 return -EFAULT; f0d991a070750a Dave Marchevsky 2023-08-21 22459 } f0d991a070750a Dave Marchevsky 2023-08-21 22460 ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22461 insn_buf[0] = addr[0]; ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22462 insn_buf[1] = addr[1]; ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22463 insn_buf[2] = *insn; ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18 22464 *cnt = 3; d2dcc67df910dd Dave Marchevsky 2023-04-15 22465 } else if (desc->func_id == special_kfunc_list[KF_bpf_list_push_back_impl] || d2dcc67df910dd Dave Marchevsky 2023-04-15 22466 desc->func_id == special_kfunc_list[KF_bpf_list_push_front_impl] || d2dcc67df910dd Dave Marchevsky 2023-04-15 22467 desc->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) { f0d991a070750a Dave Marchevsky 2023-08-21 22468 struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta; d2dcc67df910dd Dave Marchevsky 2023-04-15 22469 int struct_meta_reg = BPF_REG_3; d2dcc67df910dd Dave Marchevsky 2023-04-15 22470 int node_offset_reg = BPF_REG_4; d2dcc67df910dd Dave Marchevsky 2023-04-15 22471 d2dcc67df910dd Dave Marchevsky 2023-04-15 22472 /* rbtree_add has extra 'less' arg, so args-to-fixup are in diff regs */ d2dcc67df910dd Dave Marchevsky 2023-04-15 22473 if (desc->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) { d2dcc67df910dd Dave Marchevsky 2023-04-15 22474 struct_meta_reg = BPF_REG_4; d2dcc67df910dd Dave Marchevsky 2023-04-15 22475 node_offset_reg = BPF_REG_5; d2dcc67df910dd Dave Marchevsky 2023-04-15 22476 } d2dcc67df910dd Dave Marchevsky 2023-04-15 22477 f0d991a070750a Dave Marchevsky 2023-08-21 22478 if (!kptr_struct_meta) { 0df1a55afa832f Paul Chaignon 2025-07-01 22479 verifier_bug(env, "kptr_struct_meta expected at insn_idx %d", f0d991a070750a Dave Marchevsky 2023-08-21 22480 insn_idx); f0d991a070750a Dave Marchevsky 2023-08-21 22481 return -EFAULT; f0d991a070750a Dave Marchevsky 2023-08-21 22482 } f0d991a070750a Dave Marchevsky 2023-08-21 22483 d2dcc67df910dd Dave Marchevsky 2023-04-15 22484 __fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg, d2dcc67df910dd Dave Marchevsky 2023-04-15 22485 node_offset_reg, insn, insn_buf, cnt); a35b9af4ec2c7f Yonghong Song 2022-11-20 22486 } else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] || a35b9af4ec2c7f Yonghong Song 2022-11-20 22487 desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) { fd264ca020948a Yonghong Song 2022-11-20 22488 insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1); fd264ca020948a Yonghong Song 2022-11-20 22489 *cnt = 1; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22490 } 81f1d7a583fa1f Benjamin Tissoires 2024-04-20 22491 bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22492 if (env->insn_aux_data[insn_idx].arg_prog) { bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22493 u32 regno = env->insn_aux_data[insn_idx].arg_prog; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22494 struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(regno, (long)env->prog->aux) }; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22495 int idx = *cnt; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22496 bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22497 insn_buf[idx++] = ld_addrs[0]; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22498 insn_buf[idx++] = ld_addrs[1]; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22499 insn_buf[idx++] = *insn; bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13 22500 *cnt = idx; 958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 22501 } e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22502 return 0; e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22503 } e6ac2450d6dee3 Martin KaFai Lau 2021-03-24 22504 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations 2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan ` (2 preceding siblings ...) 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan @ 2025-11-14 11:16 ` Puranjay Mohan 2025-11-14 22:18 ` Alexei Starovoitov 3 siblings, 1 reply; 22+ messages in thread From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team As arena kfuncs can now be called from non-sleepable contexts, test this by adding non-sleepable copies of tests in verifier_arena, this is done by using a socket program instead of syscall. Add a new test case in verifier_arena_large to check that the bpf_arena_alloc_pages() works for more than 1024 pages. 1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but bpf_arena_alloc_pages() should still succeed because it re-uses this array in a loop. Augment the arena_list selftest to also run in non-sleepable context by taking rcu_read_lock. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- .../selftests/bpf/prog_tests/arena_list.c | 20 +- .../testing/selftests/bpf/progs/arena_list.c | 11 ++ .../selftests/bpf/progs/verifier_arena.c | 185 ++++++++++++++++++ .../bpf/progs/verifier_arena_large.c | 24 +++ 4 files changed, 235 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c index d15867cddde0..4f2866a615ce 100644 --- a/tools/testing/selftests/bpf/prog_tests/arena_list.c +++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c @@ -27,17 +27,23 @@ static int list_sum(struct arena_list_head *head) return sum; } -static void test_arena_list_add_del(int cnt) +static void test_arena_list_add_del(int cnt, bool nonsleepable) { LIBBPF_OPTS(bpf_test_run_opts, opts); struct arena_list *skel; int expected_sum = (u64)cnt * (cnt - 1) / 2; int ret, sum; - skel = arena_list__open_and_load(); - if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load")) + skel = arena_list__open(); + if (!ASSERT_OK_PTR(skel, "arena_list__open")) return; + skel->rodata->nonsleepable = nonsleepable; + + ret = arena_list__load(skel); + if (!ASSERT_OK(ret, "arena_list__load")) + goto out; + skel->bss->cnt = cnt; ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts); ASSERT_OK(ret, "ret_add"); @@ -65,7 +71,11 @@ static void test_arena_list_add_del(int cnt) void test_arena_list(void) { if (test__start_subtest("arena_list_1")) - test_arena_list_add_del(1); + test_arena_list_add_del(1, false); if (test__start_subtest("arena_list_1000")) - test_arena_list_add_del(1000); + test_arena_list_add_del(1000, false); + if (test__start_subtest("arena_list_1_nonsleepable")) + test_arena_list_add_del(1, true); + if (test__start_subtest("arena_list_1000_nonsleepable")) + test_arena_list_add_del(1000, true); } diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c index 3a2ddcacbea6..235d8cc95bdd 100644 --- a/tools/testing/selftests/bpf/progs/arena_list.c +++ b/tools/testing/selftests/bpf/progs/arena_list.c @@ -30,6 +30,7 @@ struct arena_list_head __arena *list_head; int list_sum; int cnt; bool skip = false; +const volatile bool nonsleepable = false; #ifdef __BPF_FEATURE_ADDR_SPACE_CAST long __arena arena_sum; @@ -42,6 +43,9 @@ int test_val SEC(".addr_space.1"); int zero; +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; + SEC("syscall") int arena_list_add(void *ctx) { @@ -71,6 +75,10 @@ int arena_list_del(void *ctx) struct elem __arena *n; int sum = 0; + /* Take rcu_read_lock to test non-sleepable context */ + if (nonsleepable) + bpf_rcu_read_lock(); + arena_sum = 0; list_for_each_entry(n, list_head, node) { sum += n->value; @@ -79,6 +87,9 @@ int arena_list_del(void *ctx) bpf_free(n); } list_sum = sum; + + if (nonsleepable) + bpf_rcu_read_unlock(); #else skip = true; #endif diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c index 7f4827eede3c..4a9d96344813 100644 --- a/tools/testing/selftests/bpf/progs/verifier_arena.c +++ b/tools/testing/selftests/bpf/progs/verifier_arena.c @@ -21,6 +21,37 @@ struct { #endif } arena SEC(".maps"); +SEC("socket") +__success __retval(0) +int basic_alloc1_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + volatile int __arena *page1, *page2, *no_page; + + page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (!page1) + return 1; + *page1 = 1; + page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (!page2) + return 2; + *page2 = 2; + no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (no_page) + return 3; + if (*page1 != 1) + return 4; + if (*page2 != 2) + return 5; + bpf_arena_free_pages(&arena, (void __arena *)page2, 1); + if (*page1 != 1) + return 6; + if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */ + return 7; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_alloc1(void *ctx) @@ -60,6 +91,44 @@ int basic_alloc1(void *ctx) return 0; } +SEC("socket") +__success __retval(0) +int basic_alloc2_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + volatile char __arena *page1, *page2, *page3, *page4; + + page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0); + if (!page1) + return 1; + page2 = page1 + __PAGE_SIZE; + page3 = page1 + __PAGE_SIZE * 2; + page4 = page1 - __PAGE_SIZE; + *page1 = 1; + *page2 = 2; + *page3 = 3; + *page4 = 4; + if (*page1 != 1) + return 1; + if (*page2 != 2) + return 2; + if (*page3 != 0) + return 3; + if (*page4 != 0) + return 4; + bpf_arena_free_pages(&arena, (void __arena *)page1, 2); + if (*page1 != 0 && *page1 != 1) + return 5; + if (*page2 != 0 && *page2 != 2) + return 6; + if (*page3 != 0) + return 7; + if (*page4 != 0) + return 8; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_alloc2(void *ctx) @@ -102,6 +171,19 @@ struct bpf_arena___l { struct bpf_map map; } __attribute__((preserve_access_index)); +SEC("socket") +__success __retval(0) __log_level(2) +int basic_alloc3_nosleep(void *ctx) +{ + struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena; + volatile char __arena *pages; + + pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0); + if (!pages) + return 1; + return 0; +} + SEC("syscall") __success __retval(0) __log_level(2) int basic_alloc3(void *ctx) @@ -115,6 +197,38 @@ int basic_alloc3(void *ctx) return 0; } +SEC("socket") +__success __retval(0) +int basic_reserve1_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (!page) + return 1; + + page += __PAGE_SIZE; + + /* Reserve the second page */ + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret) + return 2; + + /* Try to explicitly allocate the reserved page. */ + page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0); + if (page) + return 3; + + /* Try to implicitly allocate the page (since there's only 2 of them). */ + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (page) + return 4; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_reserve1(void *ctx) @@ -147,6 +261,26 @@ int basic_reserve1(void *ctx) return 0; } +SEC("socket") +__success __retval(0) +int basic_reserve2_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + page = arena_base(&arena); + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret) + return 1; + + page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0); + if ((u64)page) + return 2; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_reserve2(void *ctx) @@ -168,6 +302,27 @@ int basic_reserve2(void *ctx) } /* Reserve the same page twice, should return -EBUSY. */ +SEC("socket") +__success __retval(0) +int reserve_twice_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + page = arena_base(&arena); + + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret) + return 1; + + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret != -EBUSY) + return 2; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int reserve_twice(void *ctx) @@ -190,6 +345,36 @@ int reserve_twice(void *ctx) } /* Try to reserve past the end of the arena. */ +SEC("socket") +__success __retval(0) +int reserve_invalid_region_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + /* Try a NULL pointer. */ + ret = bpf_arena_reserve_pages(&arena, NULL, 3); + if (ret != -EINVAL) + return 1; + + page = arena_base(&arena); + + ret = bpf_arena_reserve_pages(&arena, page, 3); + if (ret != -EINVAL) + return 2; + + ret = bpf_arena_reserve_pages(&arena, page, 4096); + if (ret != -EINVAL) + return 3; + + ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1); + if (ret != -EINVAL) + return 4; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int reserve_invalid_region(void *ctx) diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c index f19e15400b3e..507cd489e3e2 100644 --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c @@ -270,5 +270,29 @@ int big_alloc2(void *ctx) return 9; return 0; } + +SEC("socket") +__success __retval(0) +int big_alloc3(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *pages; + u64 i; + + /* Allocate 2051 pages (more than 1024) at once to test the limit of kmalloc_nolock() */ + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0); + if (!pages) + return -1; + + bpf_for(i, 0, 2051) + pages[i * PAGE_SIZE] = 123; + bpf_for(i, 0, 2051) + if (pages[i * PAGE_SIZE] != 123) + return i; + + bpf_arena_free_pages(&arena, pages, 1025); +#endif + return 0; +} #endif char _license[] SEC("license") = "GPL"; -- 2.47.1 ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations 2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan @ 2025-11-14 22:18 ` Alexei Starovoitov 2025-11-15 0:58 ` Puranjay Mohan 0 siblings, 1 reply; 22+ messages in thread From: Alexei Starovoitov @ 2025-11-14 22:18 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > + > + /* Allocate 2051 pages (more than 1024) at once to test the limit of kmalloc_nolock() */ > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0); Please explain the choice of 2051 a bit better. I think you wanted to do 3 steps and last one not aligned to 1024 ? > + if (!pages) > + return -1; > + > + bpf_for(i, 0, 2051) > + pages[i * PAGE_SIZE] = 123; > + bpf_for(i, 0, 2051) > + if (pages[i * PAGE_SIZE] != 123) > + return i; > + > + bpf_arena_free_pages(&arena, pages, 1025); free less on purpose? ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations 2025-11-14 22:18 ` Alexei Starovoitov @ 2025-11-15 0:58 ` Puranjay Mohan 0 siblings, 0 replies; 22+ messages in thread From: Puranjay Mohan @ 2025-11-15 0:58 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote: >> >> + >> + /* Allocate 2051 pages (more than 1024) at once to test the limit of kmalloc_nolock() */ >> + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0); > > Please explain the choice of 2051 a bit better. > I think you wanted to do 3 steps and last one not aligned to 1024 ? Yes, I wanted to exercise the loop a couple of times and also do an iteration that is not aligned to test all edge cases. Will add a better comment. >> + if (!pages) >> + return -1; >> + >> + bpf_for(i, 0, 2051) >> + pages[i * PAGE_SIZE] = 123; >> + bpf_for(i, 0, 2051) >> + if (pages[i * PAGE_SIZE] != 123) >> + return i; >> + >> + bpf_arena_free_pages(&arena, pages, 1025); > > free less on purpose? This is should be 2051 too, missed updating it here. ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2025-11-16 1:16 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan 2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan 2025-11-14 11:47 ` bot+bpf-ci 2025-11-14 14:57 ` Puranjay Mohan 2025-11-14 21:21 ` Alexei Starovoitov 2025-11-15 0:52 ` Puranjay Mohan 2025-11-15 1:26 ` Alexei Starovoitov 2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan 2025-11-14 11:39 ` bot+bpf-ci 2025-11-14 15:13 ` Puranjay Mohan 2025-11-14 21:25 ` Alexei Starovoitov 2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan 2025-11-14 11:47 ` bot+bpf-ci 2025-11-14 15:28 ` Puranjay Mohan 2025-11-14 21:27 ` Alexei Starovoitov 2025-11-15 0:56 ` Puranjay Mohan 2025-11-15 1:28 ` Alexei Starovoitov 2025-11-15 8:18 ` kernel test robot 2025-11-16 1:15 ` kernel test robot 2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan 2025-11-14 22:18 ` Alexei Starovoitov 2025-11-15 0:58 ` Puranjay Mohan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox