* [PATCH bpf-next v3 0/4] Remove KF_SLEEPABLE from arena kfuncs
@ 2025-11-17 16:01 Puranjay Mohan
2025-11-17 16:01 ` [PATCH bpf-next v3 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-11-17 16:06 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
0 siblings, 2 replies; 9+ messages in thread
From: Puranjay Mohan @ 2025-11-17 16:01 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/
Changes in v2->v3:
Patch 1:
- Call range_tree_destroy() in error path of
populate_pgtable_except_pte() in arena_map_alloc() (AI)
Patch 2:
- Fix double mutex_unlock() in the error path of
arena_alloc_pages() (AI)
- Fix coding style issues (Alexei)
Patch 3:
- Unlock spinlock before returning from arena_vm_fault() in case
BPF_F_SEGV_ON_FAULT is set by user. (AI)
- Use __llist_del_all() in place of llist_del_all for on-stack
llist (free_pages) (Alexei)
- Fix build issues on 32-bit systems where arena.c is not compiled.
(kernel test robot)
- Make bpf_arena_alloc_pages() polymorphic so it knows if it has
been called in sleepable or non-sleepable context. This
information is passed to arena_free_pages() in the error path.
Patch 4:
- Add a better comment for the big_alloc3() test that triggers
kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works
correctly above this limit.
v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
- Import tlbflush.h to fix build issue in loongarch. (kernel
test robot)
- Fix unused variable error in apply_range_clear_cb() (kernel
test robot)
- Call bpf_map_area_free() on error path of
populate_pgtable_except_pte() (AI)
- Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
- Cap allocation made by kmalloc_nolock() for pages array to
KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
to overcome this limit. (AI)
Patch 3:
- Do page_ref_add(page, 1); under the spinlock to mitigate a
race (AI)
Patch 4:
- Add a new testcase big_alloc3() verifier_arena_large.c that
tries to allocate a large number of pages at once, this is to
trigger the kmalloc_nolock() limit in Patch 2 and see if the
loop logic works correctly.
This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:
The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.
bpf_arena_alloc_pages() had four points where it could sleep:
1. Mutex to protect range_tree: now replaced with rqspinlock
2. kvcalloc() for allocations: now replaced with kmalloc_nolock()
3. Allocating pages with bpf_map_alloc_pages(): this already calls
alloc_pages_nolock() in non-sleepable contexts and therefore is safe.
4. Setting up kernel page tables with vm_area_map_pages():
vm_area_map_pages() may allocate memory while inserting pages into
bpf arena's vm_area. Now, at arena creation time populate all page
table levels except the last level and when new pages need to be
inserted call apply_to_page_range() again which will only do
set_pte_at() for those pages and will not allocate memory.
The above four changes make bpf_arena_alloc_pages() any context safe.
bpf_arena_free_pages() has to do the following steps:
1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.
The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
table levels, apply_to_existing_page_range() with a callback is used
that only does pte_clear() on the last level and leaves the intermediate
page table levels intact. This is needed to make sure that
bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
intermediate page tables.
When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.
arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.
apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
Puranjay Mohan (4):
bpf: arena: populate vm_area without allocating memory
bpf: arena: use kmalloc_nolock() in place of kvcalloc()
bpf: arena: make arena kfuncs any context safe
selftests: bpf: test non-sleepable arena allocations
include/linux/bpf.h | 15 +
kernel/bpf/arena.c | 371 +++++++++++++++---
kernel/bpf/verifier.c | 10 +
.../selftests/bpf/prog_tests/arena_list.c | 20 +-
.../testing/selftests/bpf/progs/arena_list.c | 11 +
.../selftests/bpf/progs/verifier_arena.c | 185 +++++++++
.../bpf/progs/verifier_arena_large.c | 29 ++
7 files changed, 582 insertions(+), 59 deletions(-)
--
2.47.1
^ permalink raw reply [flat|nested] 9+ messages in thread* [PATCH bpf-next v3 1/4] bpf: arena: populate vm_area without allocating memory 2025-11-17 16:01 [PATCH bpf-next v3 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan @ 2025-11-17 16:01 ` Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan 1 sibling, 0 replies; 9+ messages in thread From: Puranjay Mohan @ 2025-11-17 16:01 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team vm_area_map_pages() may allocate memory while inserting pages into bpf arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc non-sleepable change bpf arena to populate pages without allocating memory: - at arena creation time populate all page table levels except the last level - when new pages need to be inserted call apply_to_page_range() again with apply_range_set_cb() which will only set_pte_at() those pages and will not allocate memory. - when freeing pages call apply_to_existing_page_range with apply_range_clear_cb() to clear the pte for the page to be removed. This doesn't free intermediate page table levels. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- kernel/bpf/arena.c | 77 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 71 insertions(+), 6 deletions(-) diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c index 1074ac4459f2..214a4da54162 100644 --- a/kernel/bpf/arena.c +++ b/kernel/bpf/arena.c @@ -7,6 +7,7 @@ #include <linux/btf_ids.h> #include <linux/vmalloc.h> #include <linux/pagemap.h> +#include <asm/tlbflush.h> #include "range_tree.h" /* @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr) return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT; } +struct apply_range_data { + struct page **pages; + int i; +}; + +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) +{ + struct apply_range_data *d = data; + struct page *page; + + if (!data) + return 0; + /* sanity check */ + if (unlikely(!pte_none(ptep_get(pte)))) + return -EBUSY; + + page = d->pages[d->i++]; + /* paranoia, similar to vmap_pages_pte_range() */ + if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page)))) + return -EINVAL; + + set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL)); + return 0; +} + +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) +{ + pte_t old_pte; + struct page *page; + + /* sanity check */ + old_pte = ptep_get(pte); + if (pte_none(old_pte) || !pte_present(old_pte)) + return 0; /* nothing to do */ + + /* get page and free it */ + page = pte_page(old_pte); + if (WARN_ON_ONCE(!page)) + return -EINVAL; + + pte_clear(&init_mm, addr, pte); + + /* ensure no stale TLB entries */ + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + __free_page(page); + + return 0; +} + +static int populate_pgtable_except_pte(struct bpf_arena *arena) +{ + return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena), + KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL); +} + static struct bpf_map *arena_map_alloc(union bpf_attr *attr) { struct vm_struct *kern_vm; @@ -144,6 +201,12 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) goto err; } mutex_init(&arena->lock); + err = populate_pgtable_except_pte(arena); + if (err) { + range_tree_destroy(&arena->rt); + bpf_map_area_free(arena); + goto err; + } return &arena->map; err: @@ -286,6 +349,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) if (ret) return VM_FAULT_SIGSEGV; + struct apply_range_data data = { .pages = &page, .i = 0 }; /* Account into memcg of the process that created bpf_arena */ ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); if (ret) { @@ -293,7 +357,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) return VM_FAULT_SIGSEGV; } - ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page); + ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); __free_page(page); @@ -428,7 +492,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt /* user_vm_end/start are fixed before bpf prog runs */ long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); - struct page **pages; + struct page **pages = NULL; long pgoff = 0; u32 uaddr32; int ret, i; @@ -465,6 +529,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt if (ret) goto out_free_pages; + struct apply_range_data data = { .pages = pages, .i = 0 }; ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); if (ret) goto out; @@ -477,8 +542,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow * lower 32-bit and it's ok. */ - ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32, - kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages); + ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, + page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); if (ret) { for (i = 0; i < page_cnt; i++) __free_page(pages[i]); @@ -545,8 +610,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) * page_cnt is big it's faster to do the batched zap. */ zap_pages(arena, full_uaddr, 1); - vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE); - __free_page(page); + apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb, + NULL); } } -- 2.47.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-17 16:01 [PATCH bpf-next v3 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan 2025-11-17 16:01 ` [PATCH bpf-next v3 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan @ 2025-11-17 16:06 ` Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan ` (2 more replies) 1 sibling, 3 replies; 9+ messages in thread From: Puranjay Mohan @ 2025-11-17 16:06 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team To make arena_alloc_pages() safe to be called from any context, replace kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any locks. kmalloc_nolock() returns NULL for allocations larger than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 * 8 and reuse the array in a loop. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- kernel/bpf/arena.c | 83 +++++++++++++++++++++++++++++++--------------- 1 file changed, 57 insertions(+), 26 deletions(-) diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c index 214a4da54162..1d0b49a39ad0 100644 --- a/kernel/bpf/arena.c +++ b/kernel/bpf/arena.c @@ -43,6 +43,8 @@ #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) #define KERN_VM_SZ (SZ_4G + GUARD_SZ) +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); + struct bpf_arena { struct bpf_map map; u64 user_vm_start; @@ -492,7 +494,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt /* user_vm_end/start are fixed before bpf prog runs */ long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); + struct apply_range_data data; struct page **pages = NULL; + long remaining, mapped = 0; + long alloc_pages; long pgoff = 0; u32 uaddr32; int ret, i; @@ -509,17 +514,21 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt return 0; } - /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ - pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); + /* + * Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. + */ + alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *)); + pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE); if (!pages) return 0; + data.pages = pages; - guard(mutex)(&arena->lock); + mutex_lock(&arena->lock); if (uaddr) { ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); if (ret) - goto out_free_pages; + goto out_unlock_free_pages; ret = range_tree_clear(&arena->rt, pgoff, page_cnt); } else { ret = pgoff = range_tree_find(&arena->rt, page_cnt); @@ -527,34 +536,56 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt ret = range_tree_clear(&arena->rt, pgoff, page_cnt); } if (ret) - goto out_free_pages; - - struct apply_range_data data = { .pages = pages, .i = 0 }; - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); - if (ret) - goto out; + goto out_unlock_free_pages; + remaining = page_cnt; uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 - * will not overflow 32-bit. Lower 32-bit need to represent - * contiguous user address range. - * Map these pages at kern_vm_start base. - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow - * lower 32-bit and it's ok. - */ - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); - if (ret) { - for (i = 0; i < page_cnt; i++) - __free_page(pages[i]); - goto out; + + while (remaining) { + long this_batch = min(remaining, alloc_pages); + + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ + memset(pages, 0, this_batch * sizeof(struct page *)); + data.i = 0; + + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); + if (ret) + goto out; + + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 + * will not overflow 32-bit. Lower 32-bit need to represent + * contiguous user address range. + * Map these pages at kern_vm_start base. + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow + * lower 32-bit and it's ok. + */ + ret = apply_to_page_range(&init_mm, + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); + if (ret) { + /* data.i pages were mapped, account them and free the remaining */ + mapped += data.i; + for (i = data.i; i < this_batch; i++) + __free_page(pages[i]); + goto out; + } + + mapped += this_batch; + remaining -= this_batch; } - kvfree(pages); + mutex_unlock(&arena->lock); + kfree_nolock(pages); return clear_lo32(arena->user_vm_start) + uaddr32; out: - range_tree_set(&arena->rt, pgoff, page_cnt); + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); + mutex_unlock(&arena->lock); + if (mapped) + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); + goto out_free_pages; +out_unlock_free_pages: + mutex_unlock(&arena->lock); out_free_pages: - kvfree(pages); + kfree_nolock(pages); return 0; } -- 2.47.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-17 16:06 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan @ 2025-11-17 16:06 ` Puranjay Mohan 2025-11-21 21:25 ` Alexei Starovoitov 2025-11-17 16:06 ` [PATCH bpf-next v3 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan 2025-11-21 21:15 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Alexei Starovoitov 2 siblings, 1 reply; 9+ messages in thread From: Puranjay Mohan @ 2025-11-17 16:06 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team Make arena related kfuncs any context safe by the following changes: bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): Replace the usage of the mutex with a rqspinlock for range tree and use kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages from any context. apply_range_set/clear_cb() with apply_to_page_range() has already made populating the vm_area in bpf_arena_alloc_pages() any context safe. bpf_arena_free_pages(): defer the main logic to a workqueue if it is called from a non-sleepable context. specialize_kfunc() is used to replace the sleepable arena_free_pages() with bpf_arena_free_pages_non_sleepable() when the verifier detects the call is from a non-sleepable context. In the non-sleepable case, arena_free_pages() queues the address and the page count to be freed to a lock-less list of struct arena_free_spans and raises an irq_work. The irq_work handler calls schedules_work() as it is safe to be called from irq context. arena_free_worker() (the work queue handler) iterates these spans and clears ptes, flushes tlb, zaps pages, and calls __free_page(). Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- include/linux/bpf.h | 15 +++ kernel/bpf/arena.c | 249 +++++++++++++++++++++++++++++++++++------- kernel/bpf/verifier.c | 10 ++ 3 files changed, 233 insertions(+), 41 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 09d5dc541d1c..8339b3bd8295 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -673,6 +673,21 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj); int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags, struct bpf_dynptr *ptr__uninit); +#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) +void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, int node_id, + u64 flags); +void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt); +#else +static inline void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, + int node_id, u64 flags) +{ +} + +static inline void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt) +{ +} +#endif + extern const struct bpf_map_ops bpf_map_offload_ops; /* bpf_type_flag contains a set of flags that are applicable to the values of diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c index 1d0b49a39ad0..8134d907b8e2 100644 --- a/kernel/bpf/arena.c +++ b/kernel/bpf/arena.c @@ -3,7 +3,9 @@ #include <linux/bpf.h> #include <linux/btf.h> #include <linux/err.h> +#include <linux/irq_work.h> #include "linux/filter.h" +#include <linux/llist.h> #include <linux/btf_ids.h> #include <linux/vmalloc.h> #include <linux/pagemap.h> @@ -43,7 +45,7 @@ #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) #define KERN_VM_SZ (SZ_4G + GUARD_SZ) -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable); struct bpf_arena { struct bpf_map map; @@ -51,8 +53,23 @@ struct bpf_arena { u64 user_vm_end; struct vm_struct *kern_vm; struct range_tree rt; + /* protects rt */ + rqspinlock_t spinlock; struct list_head vma_list; + /* protects vma_list */ struct mutex lock; + struct irq_work free_irq; + struct work_struct free_work; + struct llist_head free_spans; +}; + +static void arena_free_worker(struct work_struct *work); +static void arena_free_irq(struct irq_work *iw); + +struct arena_free_span { + struct llist_node node; + unsigned long uaddr; + u32 page_cnt; }; u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena) @@ -120,7 +137,7 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) return 0; } -static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages) { pte_t old_pte; struct page *page; @@ -130,17 +147,16 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) if (pte_none(old_pte) || !pte_present(old_pte)) return 0; /* nothing to do */ - /* get page and free it */ + /* get page and clear pte */ page = pte_page(old_pte); if (WARN_ON_ONCE(!page)) return -EINVAL; pte_clear(&init_mm, addr, pte); - /* ensure no stale TLB entries */ - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); - - __free_page(page); + /* Add page to the list so it is freed later */ + if (free_pages) + __llist_add(&page->pcp_llist, free_pages); return 0; } @@ -195,6 +211,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) arena->user_vm_end = arena->user_vm_start + vm_range; INIT_LIST_HEAD(&arena->vma_list); + init_llist_head(&arena->free_spans); + init_irq_work(&arena->free_irq, arena_free_irq); + INIT_WORK(&arena->free_work, arena_free_worker); bpf_map_init_from_attr(&arena->map, attr); range_tree_init(&arena->rt); err = range_tree_set(&arena->rt, 0, attr->max_entries); @@ -203,6 +222,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) goto err; } mutex_init(&arena->lock); + raw_res_spin_lock_init(&arena->spinlock); err = populate_pgtable_except_pte(arena); if (err) { range_tree_destroy(&arena->rt); @@ -249,6 +269,10 @@ static void arena_map_free(struct bpf_map *map) if (WARN_ON_ONCE(!list_empty(&arena->vma_list))) return; + /* Ensure no pending deferred frees */ + irq_work_sync(&arena->free_irq); + flush_work(&arena->free_work); + /* * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area(). * It unmaps everything from vmalloc area and clears pgtables. @@ -332,12 +356,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) struct bpf_arena *arena = container_of(map, struct bpf_arena, map); struct page *page; long kbase, kaddr; + unsigned long flags; int ret; kbase = bpf_arena_get_kern_vm_start(arena); kaddr = kbase + (u32)(vmf->address); - guard(mutex)(&arena->lock); + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) + /* + * This is an impossible case and would only trigger if res_spin_lock is buggy or + * due to another kernel bug. + */ + return VM_FAULT_RETRY; + page = vmalloc_to_page((void *)kaddr); if (page) /* already have a page vmap-ed */ @@ -345,30 +376,34 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) /* User space requested to segfault when page is not allocated by bpf prog */ - return VM_FAULT_SIGSEGV; + goto out_unlock_sigsegv; ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); if (ret) - return VM_FAULT_SIGSEGV; + goto out_unlock_sigsegv; struct apply_range_data data = { .pages = &page, .i = 0 }; /* Account into memcg of the process that created bpf_arena */ ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); - return VM_FAULT_SIGSEGV; + goto out_unlock_sigsegv; } ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); if (ret) { range_tree_set(&arena->rt, vmf->pgoff, 1); - __free_page(page); - return VM_FAULT_SIGSEGV; + free_pages_nolock(page, 0); + goto out_unlock_sigsegv; } out: page_ref_add(page, 1); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); vmf->page = page; return 0; +out_unlock_sigsegv: + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + return VM_FAULT_SIGSEGV; } static const struct vm_operations_struct arena_vm_ops = { @@ -489,7 +524,8 @@ static u64 clear_lo32(u64 val) * Allocate pages and vmap them into kernel vmalloc area. * Later the pages will be mmaped into user space vma. */ -static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id) +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id, + bool sleepable) { /* user_vm_end/start are fixed before bpf prog runs */ long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; @@ -498,6 +534,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt struct page **pages = NULL; long remaining, mapped = 0; long alloc_pages; + unsigned long flags; long pgoff = 0; u32 uaddr32; int ret, i; @@ -523,7 +560,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt return 0; data.pages = pages; - mutex_lock(&arena->lock); + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) + goto out_free_pages; if (uaddr) { ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); @@ -566,24 +604,25 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt /* data.i pages were mapped, account them and free the remaining */ mapped += data.i; for (i = data.i; i < this_batch; i++) - __free_page(pages[i]); + free_pages_nolock(pages[i], 0); goto out; } mapped += this_batch; remaining -= this_batch; } - mutex_unlock(&arena->lock); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); kfree_nolock(pages); return clear_lo32(arena->user_vm_start) + uaddr32; out: range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); - mutex_unlock(&arena->lock); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); if (mapped) - arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped, + sleepable); goto out_free_pages; out_unlock_free_pages: - mutex_unlock(&arena->lock); + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); out_free_pages: kfree_nolock(pages); return 0; @@ -598,42 +637,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) { struct vma_list *vml; + guard(mutex)(&arena->lock); + /* iterate link list under lock */ list_for_each_entry(vml, &arena->vma_list, head) zap_page_range_single(vml->vma, uaddr, PAGE_SIZE * page_cnt, NULL); } -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable) { u64 full_uaddr, uaddr_end; - long kaddr, pgoff, i; + long kaddr, pgoff; struct page *page; + struct llist_head free_pages; + struct llist_node *pos, *t; + struct arena_free_span *s; + unsigned long flags; + int ret = 0; /* only aligned lower 32-bit are relevant */ uaddr = (u32)uaddr; uaddr &= PAGE_MASK; + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); if (full_uaddr >= uaddr_end) return; page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; + pgoff = compute_pgoff(arena, uaddr); - guard(mutex)(&arena->lock); + if (!sleepable) + goto defer; + + ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags); + /* + * Can't proceed without holding the spinlock so defer the free + */ + if (ret) + goto defer; - pgoff = compute_pgoff(arena, uaddr); - /* clear range */ range_tree_set(&arena->rt, pgoff, page_cnt); + init_llist_head(&free_pages); + /* clear ptes and collect struct pages */ + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, + apply_range_clear_cb, &free_pages); + + /* drop the lock to do the tlb flush and zap pages */ + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + + /* ensure no stale TLB entries */ + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); + if (page_cnt > 1) /* bulk zap if multiple pages being freed */ zap_pages(arena, full_uaddr, page_cnt); - kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; - for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) { - page = vmalloc_to_page((void *)kaddr); - if (!page) - continue; + llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) { + page = llist_entry(pos, struct page, pcp_llist); if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */ /* Optimization for the common case of page_cnt==1: * If page wasn't mapped into some user vma there @@ -641,9 +703,20 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) * page_cnt is big it's faster to do the batched zap. */ zap_pages(arena, full_uaddr, 1); - apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb, - NULL); + __free_page(page); } + + return; + +defer: + s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1); + if (!s) + return; + + s->page_cnt = page_cnt; + s->uaddr = uaddr; + llist_add(&s->node, &arena->free_spans); + irq_work_queue(&arena->free_irq); } /* @@ -653,6 +726,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt) { long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; + unsigned long flags; long pgoff; int ret; @@ -663,15 +737,87 @@ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt if (pgoff + page_cnt > page_cnt_max) return -EINVAL; - guard(mutex)(&arena->lock); + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) + return -EBUSY; /* Cannot guard already allocated pages. */ ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); - if (ret) - return -EBUSY; + if (ret) { + ret = -EBUSY; + goto out; + } /* "Allocate" the region to prevent it from being allocated. */ - return range_tree_clear(&arena->rt, pgoff, page_cnt); + ret = range_tree_clear(&arena->rt, pgoff, page_cnt); +out: + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + return ret; +} + +static void arena_free_worker(struct work_struct *work) +{ + struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work); + struct llist_node *list, *pos, *t; + struct arena_free_span *s; + u64 arena_vm_start, user_vm_start; + struct llist_head free_pages; + struct page *page; + unsigned long full_uaddr; + long kaddr, page_cnt, pgoff; + unsigned long flags; + + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) { + schedule_work(work); + return; + } + + init_llist_head(&free_pages); + arena_vm_start = bpf_arena_get_kern_vm_start(arena); + user_vm_start = bpf_arena_get_user_vm_start(arena); + + list = llist_del_all(&arena->free_spans); + llist_for_each(pos, list) { + s = llist_entry(pos, struct arena_free_span, node); + page_cnt = s->page_cnt; + kaddr = arena_vm_start + s->uaddr; + pgoff = compute_pgoff(arena, s->uaddr); + + /* clear ptes and collect pages in free_pages llist */ + apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT, + apply_range_clear_cb, &free_pages); + + range_tree_set(&arena->rt, pgoff, page_cnt); + } + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); + + /* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */ + llist_for_each_safe(pos, t, list) { + s = llist_entry(pos, struct arena_free_span, node); + page_cnt = s->page_cnt; + full_uaddr = user_vm_start + s->uaddr; + kaddr = arena_vm_start + s->uaddr; + + /* ensure no stale TLB entries */ + flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE)); + + /* remove pages from user vmas */ + zap_pages(arena, full_uaddr, page_cnt); + + kfree_nolock(s); + } + + /* free all pages collected by apply_to_existing_page_range() in the first loop */ + llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) { + page = llist_entry(pos, struct page, pcp_llist); + __free_page(page); + } +} + +static void arena_free_irq(struct irq_work *iw) +{ + struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq); + + schedule_work(&arena->free_work); } __bpf_kfunc_start_defs(); @@ -685,9 +831,20 @@ __bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_ if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt) return NULL; - return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id); + return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, true); } +void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, + int node_id, u64 flags) +{ + struct bpf_map *map = p__map; + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); + + if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt) + return NULL; + + return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, false); +} __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt) { struct bpf_map *map = p__map; @@ -695,7 +852,17 @@ __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) return; - arena_free_pages(arena, (long)ptr__ign, page_cnt); + arena_free_pages(arena, (long)ptr__ign, page_cnt, true); +} + +void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt) +{ + struct bpf_map *map = p__map; + struct bpf_arena *arena = container_of(map, struct bpf_arena, map); + + if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign) + return; + arena_free_pages(arena, (long)ptr__ign, page_cnt, false); } __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt) @@ -714,9 +881,9 @@ __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_c __bpf_kfunc_end_defs(); BTF_KFUNCS_START(arena_kfuncs) -BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2) -BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) -BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2) +BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2) +BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2) +BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2) BTF_KFUNCS_END(arena_kfuncs) static const struct btf_kfunc_id_set common_kfunc_set = { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 098dd7f21c89..14839f178a7c 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -12349,6 +12349,8 @@ enum special_kfunc_type { KF___bpf_trap, KF_bpf_task_work_schedule_signal_impl, KF_bpf_task_work_schedule_resume_impl, + KF_bpf_arena_alloc_pages, + KF_bpf_arena_free_pages, }; BTF_ID_LIST(special_kfunc_list) @@ -12423,6 +12425,8 @@ BTF_ID(func, bpf_dynptr_file_discard) BTF_ID(func, __bpf_trap) BTF_ID(func, bpf_task_work_schedule_signal_impl) BTF_ID(func, bpf_task_work_schedule_resume_impl) +BTF_ID(func, bpf_arena_alloc_pages) +BTF_ID(func, bpf_arena_free_pages) static bool is_task_work_add_kfunc(u32 func_id) { @@ -22380,6 +22384,12 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc } else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) { if (!env->insn_aux_data[insn_idx].non_sleepable) addr = (unsigned long)bpf_dynptr_from_file_sleepable; + } else if (func_id == special_kfunc_list[KF_bpf_arena_alloc_pages]) { + if (env->insn_aux_data[insn_idx].non_sleepable) + addr = (unsigned long)bpf_arena_alloc_pages_non_sleepable; + } else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) { + if (env->insn_aux_data[insn_idx].non_sleepable) + addr = (unsigned long)bpf_arena_free_pages_non_sleepable; } desc->addr = addr; return 0; -- 2.47.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe 2025-11-17 16:06 ` [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan @ 2025-11-21 21:25 ` Alexei Starovoitov 0 siblings, 0 replies; 9+ messages in thread From: Alexei Starovoitov @ 2025-11-21 21:25 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Mon, Nov 17, 2025 at 8:06 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > Make arena related kfuncs any context safe by the following changes: > > bpf_arena_alloc_pages() and bpf_arena_reserve_pages(): > Replace the usage of the mutex with a rqspinlock for range tree and use > kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages > from any context. > apply_range_set/clear_cb() with apply_to_page_range() has already made > populating the vm_area in bpf_arena_alloc_pages() any context safe. > > bpf_arena_free_pages(): defer the main logic to a workqueue if it is > called from a non-sleepable context. > > specialize_kfunc() is used to replace the sleepable arena_free_pages() > with bpf_arena_free_pages_non_sleepable() when the verifier detects the > call is from a non-sleepable context. > > In the non-sleepable case, arena_free_pages() queues the address and the > page count to be freed to a lock-less list of struct arena_free_spans > and raises an irq_work. The irq_work handler calls schedules_work() as > it is safe to be called from irq context. arena_free_worker() (the work > queue handler) iterates these spans and clears ptes, flushes tlb, zaps > pages, and calls __free_page(). > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > --- > include/linux/bpf.h | 15 +++ > kernel/bpf/arena.c | 249 +++++++++++++++++++++++++++++++++++------- > kernel/bpf/verifier.c | 10 ++ > 3 files changed, 233 insertions(+), 41 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 09d5dc541d1c..8339b3bd8295 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -673,6 +673,21 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj); > int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags, > struct bpf_dynptr *ptr__uninit); > > +#if defined(CONFIG_MMU) && defined(CONFIG_64BIT) > +void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, int node_id, > + u64 flags); > +void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt); > +#else > +static inline void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, > + int node_id, u64 flags) > +{ > +} > + > +static inline void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt) > +{ > +} > +#endif > + > extern const struct bpf_map_ops bpf_map_offload_ops; > > /* bpf_type_flag contains a set of flags that are applicable to the values of > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 1d0b49a39ad0..8134d907b8e2 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c > @@ -3,7 +3,9 @@ > #include <linux/bpf.h> > #include <linux/btf.h> > #include <linux/err.h> > +#include <linux/irq_work.h> > #include "linux/filter.h" > +#include <linux/llist.h> > #include <linux/btf_ids.h> > #include <linux/vmalloc.h> > #include <linux/pagemap.h> > @@ -43,7 +45,7 @@ > #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) > #define KERN_VM_SZ (SZ_4G + GUARD_SZ) > > -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable); > > struct bpf_arena { > struct bpf_map map; > @@ -51,8 +53,23 @@ struct bpf_arena { > u64 user_vm_end; > struct vm_struct *kern_vm; > struct range_tree rt; > + /* protects rt */ > + rqspinlock_t spinlock; > struct list_head vma_list; > + /* protects vma_list */ > struct mutex lock; > + struct irq_work free_irq; > + struct work_struct free_work; > + struct llist_head free_spans; > +}; > + > +static void arena_free_worker(struct work_struct *work); > +static void arena_free_irq(struct irq_work *iw); > + > +struct arena_free_span { > + struct llist_node node; > + unsigned long uaddr; > + u32 page_cnt; > }; > > u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena) > @@ -120,7 +137,7 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data) > return 0; > } > > -static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) > +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages) > { > pte_t old_pte; > struct page *page; > @@ -130,17 +147,16 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data) > if (pte_none(old_pte) || !pte_present(old_pte)) > return 0; /* nothing to do */ > > - /* get page and free it */ > + /* get page and clear pte */ Just delete the comment. The code is obvious enough. > page = pte_page(old_pte); > if (WARN_ON_ONCE(!page)) > return -EINVAL; > > pte_clear(&init_mm, addr, pte); > > - /* ensure no stale TLB entries */ > - flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > - > - __free_page(page); > + /* Add page to the list so it is freed later */ > + if (free_pages) > + __llist_add(&page->pcp_llist, free_pages); > > return 0; > } > @@ -195,6 +211,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > arena->user_vm_end = arena->user_vm_start + vm_range; > > INIT_LIST_HEAD(&arena->vma_list); > + init_llist_head(&arena->free_spans); > + init_irq_work(&arena->free_irq, arena_free_irq); > + INIT_WORK(&arena->free_work, arena_free_worker); > bpf_map_init_from_attr(&arena->map, attr); > range_tree_init(&arena->rt); > err = range_tree_set(&arena->rt, 0, attr->max_entries); > @@ -203,6 +222,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr) > goto err; > } > mutex_init(&arena->lock); > + raw_res_spin_lock_init(&arena->spinlock); > err = populate_pgtable_except_pte(arena); > if (err) { > range_tree_destroy(&arena->rt); > @@ -249,6 +269,10 @@ static void arena_map_free(struct bpf_map *map) > if (WARN_ON_ONCE(!list_empty(&arena->vma_list))) > return; > > + /* Ensure no pending deferred frees */ > + irq_work_sync(&arena->free_irq); > + flush_work(&arena->free_work); > + > /* > * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area(). > * It unmaps everything from vmalloc area and clears pgtables. > @@ -332,12 +356,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > struct bpf_arena *arena = container_of(map, struct bpf_arena, map); > struct page *page; > long kbase, kaddr; > + unsigned long flags; > int ret; > > kbase = bpf_arena_get_kern_vm_start(arena); > kaddr = kbase + (u32)(vmf->address); > > - guard(mutex)(&arena->lock); > + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) > + /* > + * This is an impossible case and would only trigger if res_spin_lock is buggy or > + * due to another kernel bug. > + */ Looking at this comment again I'm starting to feel that it opens a can of worms, since what does it mean to retry when there is a kernel bug or rqspinlock bug? Should it be BUG_ON? etc. Let's just say /* Make a reasonable effort to address impossible case */ > + return VM_FAULT_RETRY; > + > page = vmalloc_to_page((void *)kaddr); > if (page) > /* already have a page vmap-ed */ > @@ -345,30 +376,34 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf) > > if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT) > /* User space requested to segfault when page is not allocated by bpf prog */ > - return VM_FAULT_SIGSEGV; > + goto out_unlock_sigsegv; > > ret = range_tree_clear(&arena->rt, vmf->pgoff, 1); > if (ret) > - return VM_FAULT_SIGSEGV; > + goto out_unlock_sigsegv; > > struct apply_range_data data = { .pages = &page, .i = 0 }; > /* Account into memcg of the process that created bpf_arena */ > ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page); > if (ret) { > range_tree_set(&arena->rt, vmf->pgoff, 1); > - return VM_FAULT_SIGSEGV; > + goto out_unlock_sigsegv; > } > > ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data); > if (ret) { > range_tree_set(&arena->rt, vmf->pgoff, 1); > - __free_page(page); > - return VM_FAULT_SIGSEGV; > + free_pages_nolock(page, 0); > + goto out_unlock_sigsegv; > } > out: > page_ref_add(page, 1); > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > vmf->page = page; > return 0; > +out_unlock_sigsegv: > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > + return VM_FAULT_SIGSEGV; > } > > static const struct vm_operations_struct arena_vm_ops = { > @@ -489,7 +524,8 @@ static u64 clear_lo32(u64 val) > * Allocate pages and vmap them into kernel vmalloc area. > * Later the pages will be mmaped into user space vma. > */ > -static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id) > +static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id, > + bool sleepable) > { > /* user_vm_end/start are fixed before bpf prog runs */ > long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; > @@ -498,6 +534,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > struct page **pages = NULL; > long remaining, mapped = 0; > long alloc_pages; > + unsigned long flags; > long pgoff = 0; > u32 uaddr32; > int ret, i; > @@ -523,7 +560,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > return 0; > data.pages = pages; > > - mutex_lock(&arena->lock); > + if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) > + goto out_free_pages; > > if (uaddr) { > ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); > @@ -566,24 +604,25 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > /* data.i pages were mapped, account them and free the remaining */ > mapped += data.i; > for (i = data.i; i < this_batch; i++) > - __free_page(pages[i]); > + free_pages_nolock(pages[i], 0); > goto out; > } > > mapped += this_batch; > remaining -= this_batch; > } > - mutex_unlock(&arena->lock); > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > kfree_nolock(pages); > return clear_lo32(arena->user_vm_start) + uaddr32; > out: > range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); > - mutex_unlock(&arena->lock); > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > if (mapped) > - arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); > + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped, > + sleepable); > goto out_free_pages; > out_unlock_free_pages: > - mutex_unlock(&arena->lock); > + raw_res_spin_unlock_irqrestore(&arena->spinlock, flags); > out_free_pages: > kfree_nolock(pages); > return 0; > @@ -598,42 +637,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt) > { > struct vma_list *vml; > > + guard(mutex)(&arena->lock); > + /* iterate link list under lock */ > list_for_each_entry(vml, &arena->vma_list, head) > zap_page_range_single(vml->vma, uaddr, > PAGE_SIZE * page_cnt, NULL); > } > > -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt) > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable) > { > u64 full_uaddr, uaddr_end; > - long kaddr, pgoff, i; > + long kaddr, pgoff; > struct page *page; > + struct llist_head free_pages; > + struct llist_node *pos, *t; > + struct arena_free_span *s; > + unsigned long flags; > + int ret = 0; > > /* only aligned lower 32-bit are relevant */ > uaddr = (u32)uaddr; > uaddr &= PAGE_MASK; > + kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr; > full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; > uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT)); > if (full_uaddr >= uaddr_end) > return; > > page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT; > + pgoff = compute_pgoff(arena, uaddr); > > - guard(mutex)(&arena->lock); > + if (!sleepable) > + goto defer; > + > + ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags); > + /* > + * Can't proceed without holding the spinlock so defer the free > + */ This can be a single line comment like /* Can't proceed ... */ ^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH bpf-next v3 4/4] selftests: bpf: test non-sleepable arena allocations 2025-11-17 16:06 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan @ 2025-11-17 16:06 ` Puranjay Mohan 2025-11-21 21:15 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Alexei Starovoitov 2 siblings, 0 replies; 9+ messages in thread From: Puranjay Mohan @ 2025-11-17 16:06 UTC (permalink / raw) To: bpf Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team As arena kfuncs can now be called from non-sleepable contexts, test this by adding non-sleepable copies of tests in verifier_arena, this is done by using a socket program instead of syscall. Add a new test case in verifier_arena_large to check that the bpf_arena_alloc_pages() works for more than 1024 pages. 1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but bpf_arena_alloc_pages() should still succeed because it re-uses this array in a loop. Augment the arena_list selftest to also run in non-sleepable context by taking rcu_read_lock. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> --- .../selftests/bpf/prog_tests/arena_list.c | 20 +- .../testing/selftests/bpf/progs/arena_list.c | 11 ++ .../selftests/bpf/progs/verifier_arena.c | 185 ++++++++++++++++++ .../bpf/progs/verifier_arena_large.c | 29 +++ 4 files changed, 240 insertions(+), 5 deletions(-) diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c index d15867cddde0..4f2866a615ce 100644 --- a/tools/testing/selftests/bpf/prog_tests/arena_list.c +++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c @@ -27,17 +27,23 @@ static int list_sum(struct arena_list_head *head) return sum; } -static void test_arena_list_add_del(int cnt) +static void test_arena_list_add_del(int cnt, bool nonsleepable) { LIBBPF_OPTS(bpf_test_run_opts, opts); struct arena_list *skel; int expected_sum = (u64)cnt * (cnt - 1) / 2; int ret, sum; - skel = arena_list__open_and_load(); - if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load")) + skel = arena_list__open(); + if (!ASSERT_OK_PTR(skel, "arena_list__open")) return; + skel->rodata->nonsleepable = nonsleepable; + + ret = arena_list__load(skel); + if (!ASSERT_OK(ret, "arena_list__load")) + goto out; + skel->bss->cnt = cnt; ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts); ASSERT_OK(ret, "ret_add"); @@ -65,7 +71,11 @@ static void test_arena_list_add_del(int cnt) void test_arena_list(void) { if (test__start_subtest("arena_list_1")) - test_arena_list_add_del(1); + test_arena_list_add_del(1, false); if (test__start_subtest("arena_list_1000")) - test_arena_list_add_del(1000); + test_arena_list_add_del(1000, false); + if (test__start_subtest("arena_list_1_nonsleepable")) + test_arena_list_add_del(1, true); + if (test__start_subtest("arena_list_1000_nonsleepable")) + test_arena_list_add_del(1000, true); } diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c index 3a2ddcacbea6..235d8cc95bdd 100644 --- a/tools/testing/selftests/bpf/progs/arena_list.c +++ b/tools/testing/selftests/bpf/progs/arena_list.c @@ -30,6 +30,7 @@ struct arena_list_head __arena *list_head; int list_sum; int cnt; bool skip = false; +const volatile bool nonsleepable = false; #ifdef __BPF_FEATURE_ADDR_SPACE_CAST long __arena arena_sum; @@ -42,6 +43,9 @@ int test_val SEC(".addr_space.1"); int zero; +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; + SEC("syscall") int arena_list_add(void *ctx) { @@ -71,6 +75,10 @@ int arena_list_del(void *ctx) struct elem __arena *n; int sum = 0; + /* Take rcu_read_lock to test non-sleepable context */ + if (nonsleepable) + bpf_rcu_read_lock(); + arena_sum = 0; list_for_each_entry(n, list_head, node) { sum += n->value; @@ -79,6 +87,9 @@ int arena_list_del(void *ctx) bpf_free(n); } list_sum = sum; + + if (nonsleepable) + bpf_rcu_read_unlock(); #else skip = true; #endif diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c index 7f4827eede3c..4a9d96344813 100644 --- a/tools/testing/selftests/bpf/progs/verifier_arena.c +++ b/tools/testing/selftests/bpf/progs/verifier_arena.c @@ -21,6 +21,37 @@ struct { #endif } arena SEC(".maps"); +SEC("socket") +__success __retval(0) +int basic_alloc1_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + volatile int __arena *page1, *page2, *no_page; + + page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (!page1) + return 1; + *page1 = 1; + page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (!page2) + return 2; + *page2 = 2; + no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (no_page) + return 3; + if (*page1 != 1) + return 4; + if (*page2 != 2) + return 5; + bpf_arena_free_pages(&arena, (void __arena *)page2, 1); + if (*page1 != 1) + return 6; + if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */ + return 7; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_alloc1(void *ctx) @@ -60,6 +91,44 @@ int basic_alloc1(void *ctx) return 0; } +SEC("socket") +__success __retval(0) +int basic_alloc2_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + volatile char __arena *page1, *page2, *page3, *page4; + + page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0); + if (!page1) + return 1; + page2 = page1 + __PAGE_SIZE; + page3 = page1 + __PAGE_SIZE * 2; + page4 = page1 - __PAGE_SIZE; + *page1 = 1; + *page2 = 2; + *page3 = 3; + *page4 = 4; + if (*page1 != 1) + return 1; + if (*page2 != 2) + return 2; + if (*page3 != 0) + return 3; + if (*page4 != 0) + return 4; + bpf_arena_free_pages(&arena, (void __arena *)page1, 2); + if (*page1 != 0 && *page1 != 1) + return 5; + if (*page2 != 0 && *page2 != 2) + return 6; + if (*page3 != 0) + return 7; + if (*page4 != 0) + return 8; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_alloc2(void *ctx) @@ -102,6 +171,19 @@ struct bpf_arena___l { struct bpf_map map; } __attribute__((preserve_access_index)); +SEC("socket") +__success __retval(0) __log_level(2) +int basic_alloc3_nosleep(void *ctx) +{ + struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena; + volatile char __arena *pages; + + pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0); + if (!pages) + return 1; + return 0; +} + SEC("syscall") __success __retval(0) __log_level(2) int basic_alloc3(void *ctx) @@ -115,6 +197,38 @@ int basic_alloc3(void *ctx) return 0; } +SEC("socket") +__success __retval(0) +int basic_reserve1_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (!page) + return 1; + + page += __PAGE_SIZE; + + /* Reserve the second page */ + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret) + return 2; + + /* Try to explicitly allocate the reserved page. */ + page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0); + if (page) + return 3; + + /* Try to implicitly allocate the page (since there's only 2 of them). */ + page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0); + if (page) + return 4; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_reserve1(void *ctx) @@ -147,6 +261,26 @@ int basic_reserve1(void *ctx) return 0; } +SEC("socket") +__success __retval(0) +int basic_reserve2_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + page = arena_base(&arena); + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret) + return 1; + + page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0); + if ((u64)page) + return 2; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int basic_reserve2(void *ctx) @@ -168,6 +302,27 @@ int basic_reserve2(void *ctx) } /* Reserve the same page twice, should return -EBUSY. */ +SEC("socket") +__success __retval(0) +int reserve_twice_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + page = arena_base(&arena); + + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret) + return 1; + + ret = bpf_arena_reserve_pages(&arena, page, 1); + if (ret != -EBUSY) + return 2; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int reserve_twice(void *ctx) @@ -190,6 +345,36 @@ int reserve_twice(void *ctx) } /* Try to reserve past the end of the arena. */ +SEC("socket") +__success __retval(0) +int reserve_invalid_region_nosleep(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *page; + int ret; + + /* Try a NULL pointer. */ + ret = bpf_arena_reserve_pages(&arena, NULL, 3); + if (ret != -EINVAL) + return 1; + + page = arena_base(&arena); + + ret = bpf_arena_reserve_pages(&arena, page, 3); + if (ret != -EINVAL) + return 2; + + ret = bpf_arena_reserve_pages(&arena, page, 4096); + if (ret != -EINVAL) + return 3; + + ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1); + if (ret != -EINVAL) + return 4; +#endif + return 0; +} + SEC("syscall") __success __retval(0) int reserve_invalid_region(void *ctx) diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c index f19e15400b3e..90bb7c1b2253 100644 --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c @@ -270,5 +270,34 @@ int big_alloc2(void *ctx) return 9; return 0; } + +SEC("socket") +__success __retval(0) +int big_alloc3(void *ctx) +{ +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST) + char __arena *pages; + u64 i; + + /* + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests. + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3 + * pages. + */ + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0); + if (!pages) + return -1; + + bpf_for(i, 0, 2051) + pages[i * PAGE_SIZE] = 123; + bpf_for(i, 0, 2051) + if (pages[i * PAGE_SIZE] != 123) + return i; + + bpf_arena_free_pages(&arena, pages, 2051); +#endif + return 0; +} #endif char _license[] SEC("license") = "GPL"; -- 2.47.1 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-17 16:06 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan @ 2025-11-21 21:15 ` Alexei Starovoitov 2025-12-11 23:45 ` Puranjay Mohan 2 siblings, 1 reply; 9+ messages in thread From: Alexei Starovoitov @ 2025-11-21 21:15 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Mon, Nov 17, 2025 at 8:06 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > To make arena_alloc_pages() safe to be called from any context, replace > kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any > locks. kmalloc_nolock() returns NULL for allocations larger than > KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with > 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 > * 8 and reuse the array in a loop. > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > --- > kernel/bpf/arena.c | 83 +++++++++++++++++++++++++++++++--------------- > 1 file changed, 57 insertions(+), 26 deletions(-) > > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > index 214a4da54162..1d0b49a39ad0 100644 > --- a/kernel/bpf/arena.c > +++ b/kernel/bpf/arena.c > @@ -43,6 +43,8 @@ > #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) > #define KERN_VM_SZ (SZ_4G + GUARD_SZ) > > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); > + > struct bpf_arena { > struct bpf_map map; > u64 user_vm_start; > @@ -492,7 +494,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > /* user_vm_end/start are fixed before bpf prog runs */ > long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; > u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); > + struct apply_range_data data; > struct page **pages = NULL; > + long remaining, mapped = 0; > + long alloc_pages; > long pgoff = 0; > u32 uaddr32; > int ret, i; > @@ -509,17 +514,21 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > return 0; > } > > - /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > - pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); > + /* > + * Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. > + */ > + alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *)); > + pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE); > if (!pages) > return 0; > + data.pages = pages; > > - guard(mutex)(&arena->lock); > + mutex_lock(&arena->lock); > > if (uaddr) { > ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); > if (ret) > - goto out_free_pages; > + goto out_unlock_free_pages; > ret = range_tree_clear(&arena->rt, pgoff, page_cnt); > } else { > ret = pgoff = range_tree_find(&arena->rt, page_cnt); > @@ -527,34 +536,56 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > ret = range_tree_clear(&arena->rt, pgoff, page_cnt); > } > if (ret) > - goto out_free_pages; > - > - struct apply_range_data data = { .pages = pages, .i = 0 }; > - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); > - if (ret) > - goto out; > + goto out_unlock_free_pages; > > + remaining = page_cnt; > uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); > - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > - * will not overflow 32-bit. Lower 32-bit need to represent > - * contiguous user address range. > - * Map these pages at kern_vm_start base. > - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > - * lower 32-bit and it's ok. > - */ > - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, > - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); > - if (ret) { > - for (i = 0; i < page_cnt; i++) > - __free_page(pages[i]); > - goto out; > + > + while (remaining) { > + long this_batch = min(remaining, alloc_pages); > + > + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > + memset(pages, 0, this_batch * sizeof(struct page *)); > + data.i = 0; Pls move data.i = 0 further down to be done right before apply_to_page_range() since it's one logical operation. Here it's done too early. > + > + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); > + if (ret) > + goto out; > + > + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 Pls reformat the comment as you move them as /* * ... */ > + * will not overflow 32-bit. Lower 32-bit need to represent > + * contiguous user address range. > + * Map these pages at kern_vm_start base. > + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > + * lower 32-bit and it's ok. > + */ > + ret = apply_to_page_range(&init_mm, > + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), > + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); > + if (ret) { > + /* data.i pages were mapped, account them and free the remaining */ > + mapped += data.i; > + for (i = data.i; i < this_batch; i++) > + __free_page(pages[i]); > + goto out; > + } > + > + mapped += this_batch; > + remaining -= this_batch; > } > - kvfree(pages); > + mutex_unlock(&arena->lock); > + kfree_nolock(pages); > return clear_lo32(arena->user_vm_start) + uaddr32; > out: > - range_tree_set(&arena->rt, pgoff, page_cnt); > + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); > + mutex_unlock(&arena->lock); > + if (mapped) > + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); This doesn't look right. The first thing arena_free_pages() does is: uaddr = (u32)uaddr; uaddr &= PAGE_MASK; full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; pw-bot: cr ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-11-21 21:15 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Alexei Starovoitov @ 2025-12-11 23:45 ` Puranjay Mohan 2025-12-12 0:49 ` Alexei Starovoitov 0 siblings, 1 reply; 9+ messages in thread From: Puranjay Mohan @ 2025-12-11 23:45 UTC (permalink / raw) To: Alexei Starovoitov Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Sat, Nov 22, 2025 at 6:15 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Nov 17, 2025 at 8:06 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > > > To make arena_alloc_pages() safe to be called from any context, replace > > kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any > > locks. kmalloc_nolock() returns NULL for allocations larger than > > KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with > > 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 > > * 8 and reuse the array in a loop. > > > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > > --- > > kernel/bpf/arena.c | 83 +++++++++++++++++++++++++++++++--------------- > > 1 file changed, 57 insertions(+), 26 deletions(-) > > > > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > > index 214a4da54162..1d0b49a39ad0 100644 > > --- a/kernel/bpf/arena.c > > +++ b/kernel/bpf/arena.c > > @@ -43,6 +43,8 @@ > > #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) > > #define KERN_VM_SZ (SZ_4G + GUARD_SZ) > > > > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); > > + > > struct bpf_arena { > > struct bpf_map map; > > u64 user_vm_start; > > @@ -492,7 +494,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > > /* user_vm_end/start are fixed before bpf prog runs */ > > long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; > > u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); > > + struct apply_range_data data; > > struct page **pages = NULL; > > + long remaining, mapped = 0; > > + long alloc_pages; > > long pgoff = 0; > > u32 uaddr32; > > int ret, i; > > @@ -509,17 +514,21 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > > return 0; > > } > > > > - /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > > - pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); > > + /* > > + * Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. > > + */ > > + alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *)); > > + pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE); > > if (!pages) > > return 0; > > + data.pages = pages; > > > > - guard(mutex)(&arena->lock); > > + mutex_lock(&arena->lock); > > > > if (uaddr) { > > ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); > > if (ret) > > - goto out_free_pages; > > + goto out_unlock_free_pages; > > ret = range_tree_clear(&arena->rt, pgoff, page_cnt); > > } else { > > ret = pgoff = range_tree_find(&arena->rt, page_cnt); > > @@ -527,34 +536,56 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > > ret = range_tree_clear(&arena->rt, pgoff, page_cnt); > > } > > if (ret) > > - goto out_free_pages; > > - > > - struct apply_range_data data = { .pages = pages, .i = 0 }; > > - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); > > - if (ret) > > - goto out; > > + goto out_unlock_free_pages; > > > > + remaining = page_cnt; > > uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); > > - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > > - * will not overflow 32-bit. Lower 32-bit need to represent > > - * contiguous user address range. > > - * Map these pages at kern_vm_start base. > > - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > > - * lower 32-bit and it's ok. > > - */ > > - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, > > - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); > > - if (ret) { > > - for (i = 0; i < page_cnt; i++) > > - __free_page(pages[i]); > > - goto out; > > + > > + while (remaining) { > > + long this_batch = min(remaining, alloc_pages); > > + > > + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > > + memset(pages, 0, this_batch * sizeof(struct page *)); > > + data.i = 0; > > Pls move data.i = 0 further down to be done right before > apply_to_page_range() since it's one logical operation. > Here it's done too early. > > > + > > + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); > > + if (ret) > > + goto out; > > + > > + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > > Pls reformat the comment as you move them as > /* > * ... > */ > > > + * will not overflow 32-bit. Lower 32-bit need to represent > > + * contiguous user address range. > > + * Map these pages at kern_vm_start base. > > + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > > + * lower 32-bit and it's ok. > > + */ > > + ret = apply_to_page_range(&init_mm, > > + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), > > + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); > > + if (ret) { > > + /* data.i pages were mapped, account them and free the remaining */ > > + mapped += data.i; > > + for (i = data.i; i < this_batch; i++) > > + __free_page(pages[i]); > > + goto out; > > + } > > + > > + mapped += this_batch; > > + remaining -= this_batch; > > } > > - kvfree(pages); > > + mutex_unlock(&arena->lock); > > + kfree_nolock(pages); > > return clear_lo32(arena->user_vm_start) + uaddr32; > > out: > > - range_tree_set(&arena->rt, pgoff, page_cnt); > > + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); > > + mutex_unlock(&arena->lock); > > + if (mapped) > > + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); > > This doesn't look right. > The first thing arena_free_pages() does is: > uaddr = (u32)uaddr; > uaddr &= PAGE_MASK; > full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; So, arena_free_pages() should be called with what arena_alloc_pages() returns, we return: return clear_lo32(arena->user_vm_start) + uaddr32; from arena_alloc_pages(); few lines above. This usage looks correct to me. Thanks, Puranjay ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() 2025-12-11 23:45 ` Puranjay Mohan @ 2025-12-12 0:49 ` Alexei Starovoitov 0 siblings, 0 replies; 9+ messages in thread From: Alexei Starovoitov @ 2025-12-12 0:49 UTC (permalink / raw) To: Puranjay Mohan Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi, Kernel Team On Fri, Dec 12, 2025 at 8:45 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > On Sat, Nov 22, 2025 at 6:15 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Mon, Nov 17, 2025 at 8:06 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > > > > > To make arena_alloc_pages() safe to be called from any context, replace > > > kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any > > > locks. kmalloc_nolock() returns NULL for allocations larger than > > > KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with > > > 4KB pages. So, round down the allocation done by kmalloc_nolock to 1024 > > > * 8 and reuse the array in a loop. > > > > > > Signed-off-by: Puranjay Mohan <puranjay@kernel.org> > > > --- > > > kernel/bpf/arena.c | 83 +++++++++++++++++++++++++++++++--------------- > > > 1 file changed, 57 insertions(+), 26 deletions(-) > > > > > > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c > > > index 214a4da54162..1d0b49a39ad0 100644 > > > --- a/kernel/bpf/arena.c > > > +++ b/kernel/bpf/arena.c > > > @@ -43,6 +43,8 @@ > > > #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1) > > > #define KERN_VM_SZ (SZ_4G + GUARD_SZ) > > > > > > +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt); > > > + > > > struct bpf_arena { > > > struct bpf_map map; > > > u64 user_vm_start; > > > @@ -492,7 +494,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > > > /* user_vm_end/start are fixed before bpf prog runs */ > > > long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT; > > > u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena); > > > + struct apply_range_data data; > > > struct page **pages = NULL; > > > + long remaining, mapped = 0; > > > + long alloc_pages; > > > long pgoff = 0; > > > u32 uaddr32; > > > int ret, i; > > > @@ -509,17 +514,21 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > > > return 0; > > > } > > > > > > - /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > > > - pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL); > > > + /* > > > + * Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. > > > + */ > > > + alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *)); > > > + pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE); > > > if (!pages) > > > return 0; > > > + data.pages = pages; > > > > > > - guard(mutex)(&arena->lock); > > > + mutex_lock(&arena->lock); > > > > > > if (uaddr) { > > > ret = is_range_tree_set(&arena->rt, pgoff, page_cnt); > > > if (ret) > > > - goto out_free_pages; > > > + goto out_unlock_free_pages; > > > ret = range_tree_clear(&arena->rt, pgoff, page_cnt); > > > } else { > > > ret = pgoff = range_tree_find(&arena->rt, page_cnt); > > > @@ -527,34 +536,56 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt > > > ret = range_tree_clear(&arena->rt, pgoff, page_cnt); > > > } > > > if (ret) > > > - goto out_free_pages; > > > - > > > - struct apply_range_data data = { .pages = pages, .i = 0 }; > > > - ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages); > > > - if (ret) > > > - goto out; > > > + goto out_unlock_free_pages; > > > > > > + remaining = page_cnt; > > > uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE); > > > - /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > > > - * will not overflow 32-bit. Lower 32-bit need to represent > > > - * contiguous user address range. > > > - * Map these pages at kern_vm_start base. > > > - * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > > > - * lower 32-bit and it's ok. > > > - */ > > > - ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32, > > > - page_cnt << PAGE_SHIFT, apply_range_set_cb, &data); > > > - if (ret) { > > > - for (i = 0; i < page_cnt; i++) > > > - __free_page(pages[i]); > > > - goto out; > > > + > > > + while (remaining) { > > > + long this_batch = min(remaining, alloc_pages); > > > + > > > + /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */ > > > + memset(pages, 0, this_batch * sizeof(struct page *)); > > > + data.i = 0; > > > > Pls move data.i = 0 further down to be done right before > > apply_to_page_range() since it's one logical operation. > > Here it's done too early. > > > > > + > > > + ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages); > > > + if (ret) > > > + goto out; > > > + > > > + /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1 > > > > Pls reformat the comment as you move them as > > /* > > * ... > > */ > > > > > + * will not overflow 32-bit. Lower 32-bit need to represent > > > + * contiguous user address range. > > > + * Map these pages at kern_vm_start base. > > > + * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow > > > + * lower 32-bit and it's ok. > > > + */ > > > + ret = apply_to_page_range(&init_mm, > > > + kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT), > > > + this_batch << PAGE_SHIFT, apply_range_set_cb, &data); > > > + if (ret) { > > > + /* data.i pages were mapped, account them and free the remaining */ > > > + mapped += data.i; > > > + for (i = data.i; i < this_batch; i++) > > > + __free_page(pages[i]); > > > + goto out; > > > + } > > > + > > > + mapped += this_batch; > > > + remaining -= this_batch; > > > } > > > - kvfree(pages); > > > + mutex_unlock(&arena->lock); > > > + kfree_nolock(pages); > > > return clear_lo32(arena->user_vm_start) + uaddr32; > > > out: > > > - range_tree_set(&arena->rt, pgoff, page_cnt); > > > + range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped); > > > + mutex_unlock(&arena->lock); > > > + if (mapped) > > > + arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped); > > > > This doesn't look right. > > The first thing arena_free_pages() does is: > > uaddr = (u32)uaddr; > > uaddr &= PAGE_MASK; > > full_uaddr = clear_lo32(arena->user_vm_start) + uaddr; > > So, arena_free_pages() should be called with what arena_alloc_pages() > returns, we return: > > return clear_lo32(arena->user_vm_start) + uaddr32; > > from arena_alloc_pages(); few lines above. yes, that's what this kfunc returns to bpf prog, but we shouldn't do pointless math to call arena_free_pages(). Just arena_free_pages(..., uaddr32, ...) will do. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-12-12 0:50 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-11-17 16:01 [PATCH bpf-next v3 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan 2025-11-17 16:01 ` [PATCH bpf-next v3 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan 2025-11-17 16:06 ` [PATCH bpf-next v3 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan 2025-11-21 21:25 ` Alexei Starovoitov 2025-11-17 16:06 ` [PATCH bpf-next v3 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan 2025-11-21 21:15 ` [PATCH bpf-next v3 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Alexei Starovoitov 2025-12-11 23:45 ` Puranjay Mohan 2025-12-12 0:49 ` Alexei Starovoitov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox