* [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs
@ 2025-12-22 19:50 Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/
Changes in V7->v8:
- Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3
V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/
Changes in v6->v7:
- Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1.
- Call flush_cache_vmap() after setting up the mappings as it is
required by some architectures.
V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/
Changes in v5->v6:
Patch 1:
- Add a missing ; to make sure this patch builds individually. (AI)
V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/
Changes in v4->v5:
Patch 1:
- Fix a memory leak in arena_alloc_pages(), it was being fixed in
Patch 3 but, every patch should be complete in itself. (AI)
Patch 3:
- Don't do useless addition in arena_alloc_pages() (Alexei)
- Add a comment about kmalloc_nolock() failure and expectations.
v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/
Changes in v3->v4:
- Coding style changes related to comments in Patch 2/3 (Alexei)
v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/
Changes in v2->v3:
Patch 1:
- Call range_tree_destroy() in error path of
populate_pgtable_except_pte() in arena_map_alloc() (AI)
Patch 2:
- Fix double mutex_unlock() in the error path of
arena_alloc_pages() (AI)
- Fix coding style issues (Alexei)
Patch 3:
- Unlock spinlock before returning from arena_vm_fault() in case
BPF_F_SEGV_ON_FAULT is set by user. (AI)
- Use __llist_del_all() in place of llist_del_all for on-stack
llist (free_pages) (Alexei)
- Fix build issues on 32-bit systems where arena.c is not compiled.
(kernel test robot)
- Make bpf_arena_alloc_pages() polymorphic so it knows if it has
been called in sleepable or non-sleepable context. This
information is passed to arena_free_pages() in the error path.
Patch 4:
- Add a better comment for the big_alloc3() test that triggers
kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works
correctly above this limit.
v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
- Import tlbflush.h to fix build issue in loongarch. (kernel
test robot)
- Fix unused variable error in apply_range_clear_cb() (kernel
test robot)
- Call bpf_map_area_free() on error path of
populate_pgtable_except_pte() (AI)
- Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
- Cap allocation made by kmalloc_nolock() for pages array to
KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
to overcome this limit. (AI)
Patch 3:
- Do page_ref_add(page, 1); under the spinlock to mitigate a
race (AI)
Patch 4:
- Add a new testcase big_alloc3() verifier_arena_large.c that
tries to allocate a large number of pages at once, this is to
trigger the kmalloc_nolock() limit in Patch 2 and see if the
loop logic works correctly.
This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:
The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.
bpf_arena_alloc_pages() had four points where it could sleep:
1. Mutex to protect range_tree: now replaced with rqspinlock
2. kvcalloc() for allocations: now replaced with kmalloc_nolock()
3. Allocating pages with bpf_map_alloc_pages(): this already calls
alloc_pages_nolock() in non-sleepable contexts and therefore is safe.
4. Setting up kernel page tables with vm_area_map_pages():
vm_area_map_pages() may allocate memory while inserting pages into
bpf arena's vm_area. Now, at arena creation time populate all page
table levels except the last level and when new pages need to be
inserted call apply_to_page_range() again which will only do
set_pte_at() for those pages and will not allocate memory.
The above four changes make bpf_arena_alloc_pages() any context safe.
bpf_arena_free_pages() has to do the following steps:
1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.
The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
table levels, apply_to_existing_page_range() with a callback is used
that only does pte_clear() on the last level and leaves the intermediate
page table levels intact. This is needed to make sure that
bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
intermediate page tables.
When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.
arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.
apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
Puranjay Mohan (4):
bpf: arena: populate vm_area without allocating memory
bpf: arena: use kmalloc_nolock() in place of kvcalloc()
bpf: arena: make arena kfuncs any context safe
selftests: bpf: test non-sleepable arena allocations
include/linux/bpf.h | 16 +
kernel/bpf/arena.c | 380 +++++++++++++++---
kernel/bpf/verifier.c | 10 +
.../selftests/bpf/prog_tests/arena_list.c | 20 +-
.../testing/selftests/bpf/progs/arena_list.c | 11 +
.../selftests/bpf/progs/verifier_arena.c | 185 +++++++++
.../bpf/progs/verifier_arena_large.c | 29 ++
7 files changed, 592 insertions(+), 59 deletions(-)
base-commit: f785a31395d9cafb8b2c42c7358fad72a6463142
--
2.47.3
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory
2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
the last level
- when new pages need to be inserted call apply_to_page_range() again
with apply_range_set_cb() which will only set_pte_at() those pages and
will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
apply_range_clear_cb() to clear the pte for the page to be removed. This
doesn't free intermediate page table levels.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
kernel/bpf/arena.c | 100 ++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 90 insertions(+), 10 deletions(-)
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 872dc0e41c65..55b198b9f1a3 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -2,11 +2,13 @@
/* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
#include <linux/bpf.h>
#include <linux/btf.h>
+#include <linux/cacheflush.h>
#include <linux/err.h>
#include "linux/filter.h"
#include <linux/btf_ids.h>
#include <linux/vmalloc.h>
#include <linux/pagemap.h>
+#include <asm/tlbflush.h>
#include "range_tree.h"
/*
@@ -92,6 +94,68 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
}
+struct apply_range_data {
+ struct page **pages;
+ int i;
+};
+
+static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
+{
+ struct apply_range_data *d = data;
+ struct page *page;
+
+ if (!data)
+ return 0;
+ /* sanity check */
+ if (unlikely(!pte_none(ptep_get(pte))))
+ return -EBUSY;
+
+ page = d->pages[d->i];
+ /* paranoia, similar to vmap_pages_pte_range() */
+ if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
+ return -EINVAL;
+
+ set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+ d->i++;
+ return 0;
+}
+
+static void flush_vmap_cache(unsigned long start, unsigned long size)
+{
+ flush_cache_vmap(start, start + size);
+}
+
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+{
+ pte_t old_pte;
+ struct page *page;
+
+ /* sanity check */
+ old_pte = ptep_get(pte);
+ if (pte_none(old_pte) || !pte_present(old_pte))
+ return 0; /* nothing to do */
+
+ /* get page and free it */
+ page = pte_page(old_pte);
+ if (WARN_ON_ONCE(!page))
+ return -EINVAL;
+
+ pte_clear(&init_mm, addr, pte);
+
+ /* ensure no stale TLB entries */
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+ __free_page(page);
+
+ return 0;
+}
+
+static int populate_pgtable_except_pte(struct bpf_arena *arena)
+{
+ return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+ KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
+}
+
static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
{
struct vm_struct *kern_vm;
@@ -144,6 +208,12 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
goto err;
}
mutex_init(&arena->lock);
+ err = populate_pgtable_except_pte(arena);
+ if (err) {
+ range_tree_destroy(&arena->rt);
+ bpf_map_area_free(arena);
+ goto err;
+ }
return &arena->map;
err:
@@ -286,6 +356,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
if (ret)
return VM_FAULT_SIGSEGV;
+ struct apply_range_data data = { .pages = &page, .i = 0 };
/* Account into memcg of the process that created bpf_arena */
ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
if (ret) {
@@ -293,12 +364,13 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
return VM_FAULT_SIGSEGV;
}
- ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
+ ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
if (ret) {
range_tree_set(&arena->rt, vmf->pgoff, 1);
__free_page(page);
return VM_FAULT_SIGSEGV;
}
+ flush_vmap_cache(kaddr, PAGE_SIZE);
out:
page_ref_add(page, 1);
vmf->page = page;
@@ -428,7 +500,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
/* user_vm_end/start are fixed before bpf prog runs */
long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
- struct page **pages;
+ struct page **pages = NULL;
+ long mapped = 0;
long pgoff = 0;
u32 uaddr32;
int ret, i;
@@ -450,7 +523,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
if (!pages)
return 0;
- guard(mutex)(&arena->lock);
+ mutex_lock(&arena->lock);
if (uaddr) {
ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
@@ -465,6 +538,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
if (ret)
goto out_free_pages;
+ struct apply_range_data data = { .pages = pages, .i = 0 };
ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
if (ret)
goto out;
@@ -477,18 +551,24 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
* kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
* lower 32-bit and it's ok.
*/
- ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
- kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
- if (ret) {
- for (i = 0; i < page_cnt; i++)
+ apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
+ page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
+ mapped = data.i;
+ flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
+ if (mapped < page_cnt) {
+ for (i = mapped; i < page_cnt; i++)
__free_page(pages[i]);
goto out;
}
+ mutex_unlock(&arena->lock);
kvfree(pages);
return clear_lo32(arena->user_vm_start) + uaddr32;
out:
- range_tree_set(&arena->rt, pgoff, page_cnt);
+ range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
out_free_pages:
+ mutex_unlock(&arena->lock);
+ if (mapped)
+ arena_free_pages(arena, uaddr32, mapped);
kvfree(pages);
return 0;
}
@@ -545,8 +625,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
* page_cnt is big it's faster to do the batched zap.
*/
zap_pages(arena, full_uaddr, 1);
- vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
- __free_page(page);
+ apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb,
+ NULL);
}
}
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
kernel/bpf/arena.c | 84 ++++++++++++++++++++++++++++++----------------
1 file changed, 55 insertions(+), 29 deletions(-)
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 55b198b9f1a3..128efb68d47b 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -44,6 +44,8 @@
#define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1)
#define KERN_VM_SZ (SZ_4G + GUARD_SZ)
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt);
+
struct bpf_arena {
struct bpf_map map;
u64 user_vm_start;
@@ -500,8 +502,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
/* user_vm_end/start are fixed before bpf prog runs */
long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
+ struct apply_range_data data;
struct page **pages = NULL;
- long mapped = 0;
+ long remaining, mapped = 0;
+ long alloc_pages;
long pgoff = 0;
u32 uaddr32;
int ret, i;
@@ -518,17 +522,19 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
return 0;
}
- /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
- pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
+ /* Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. */
+ alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *));
+ pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE);
if (!pages)
return 0;
+ data.pages = pages;
mutex_lock(&arena->lock);
if (uaddr) {
ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
if (ret)
- goto out_free_pages;
+ goto out_unlock_free_pages;
ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
} else {
ret = pgoff = range_tree_find(&arena->rt, page_cnt);
@@ -536,40 +542,60 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
}
if (ret)
- goto out_free_pages;
-
- struct apply_range_data data = { .pages = pages, .i = 0 };
- ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
- if (ret)
- goto out;
+ goto out_unlock_free_pages;
+ remaining = page_cnt;
uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
- /* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
- * will not overflow 32-bit. Lower 32-bit need to represent
- * contiguous user address range.
- * Map these pages at kern_vm_start base.
- * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
- * lower 32-bit and it's ok.
- */
- apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
- page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
- mapped = data.i;
- flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
- if (mapped < page_cnt) {
- for (i = mapped; i < page_cnt; i++)
- __free_page(pages[i]);
- goto out;
+
+ while (remaining) {
+ long this_batch = min(remaining, alloc_pages);
+
+ /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
+ memset(pages, 0, this_batch * sizeof(struct page *));
+
+ ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages);
+ if (ret)
+ goto out;
+
+ /*
+ * Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
+ * will not overflow 32-bit. Lower 32-bit need to represent
+ * contiguous user address range.
+ * Map these pages at kern_vm_start base.
+ * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
+ * lower 32-bit and it's ok.
+ */
+ data.i = 0;
+ ret = apply_to_page_range(&init_mm,
+ kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT),
+ this_batch << PAGE_SHIFT, apply_range_set_cb, &data);
+ if (ret) {
+ /* data.i pages were mapped, account them and free the remaining */
+ mapped += data.i;
+ for (i = data.i; i < this_batch; i++)
+ __free_page(pages[i]);
+ goto out;
+ }
+
+ mapped += this_batch;
+ remaining -= this_batch;
}
+ flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
mutex_unlock(&arena->lock);
- kvfree(pages);
+ kfree_nolock(pages);
return clear_lo32(arena->user_vm_start) + uaddr32;
out:
range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
-out_free_pages:
mutex_unlock(&arena->lock);
- if (mapped)
+ if (mapped) {
+ flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
arena_free_pages(arena, uaddr32, mapped);
- kvfree(pages);
+ }
+ goto out_free_pages;
+out_unlock_free_pages:
+ mutex_unlock(&arena->lock);
+out_free_pages:
+ kfree_nolock(pages);
return 0;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe
2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
Make arena related kfuncs any context safe by the following changes:
bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.
bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.
specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.
In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context. arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
include/linux/bpf.h | 16 +++
kernel/bpf/arena.c | 248 +++++++++++++++++++++++++++++++++++-------
kernel/bpf/verifier.c | 10 ++
3 files changed, 233 insertions(+), 41 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index da6a00dd313f..4e7d72dfbcd4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -673,6 +673,22 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj);
int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags,
struct bpf_dynptr *ptr__uninit);
+#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, int node_id,
+ u64 flags);
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt);
+#else
+static inline void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt,
+ int node_id, u64 flags)
+{
+ return NULL;
+}
+
+static inline void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+}
+#endif
+
extern const struct bpf_map_ops bpf_map_offload_ops;
/* bpf_type_flag contains a set of flags that are applicable to the values of
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 128efb68d47b..456ac989269d 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -4,7 +4,9 @@
#include <linux/btf.h>
#include <linux/cacheflush.h>
#include <linux/err.h>
+#include <linux/irq_work.h>
#include "linux/filter.h"
+#include <linux/llist.h>
#include <linux/btf_ids.h>
#include <linux/vmalloc.h>
#include <linux/pagemap.h>
@@ -44,7 +46,7 @@
#define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1)
#define KERN_VM_SZ (SZ_4G + GUARD_SZ)
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt);
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable);
struct bpf_arena {
struct bpf_map map;
@@ -52,8 +54,23 @@ struct bpf_arena {
u64 user_vm_end;
struct vm_struct *kern_vm;
struct range_tree rt;
+ /* protects rt */
+ rqspinlock_t spinlock;
struct list_head vma_list;
+ /* protects vma_list */
struct mutex lock;
+ struct irq_work free_irq;
+ struct work_struct free_work;
+ struct llist_head free_spans;
+};
+
+static void arena_free_worker(struct work_struct *work);
+static void arena_free_irq(struct irq_work *iw);
+
+struct arena_free_span {
+ struct llist_node node;
+ unsigned long uaddr;
+ u32 page_cnt;
};
u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
@@ -127,7 +144,7 @@ static void flush_vmap_cache(unsigned long start, unsigned long size)
flush_cache_vmap(start, start + size);
}
-static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages)
{
pte_t old_pte;
struct page *page;
@@ -137,17 +154,15 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
if (pte_none(old_pte) || !pte_present(old_pte))
return 0; /* nothing to do */
- /* get page and free it */
page = pte_page(old_pte);
if (WARN_ON_ONCE(!page))
return -EINVAL;
pte_clear(&init_mm, addr, pte);
- /* ensure no stale TLB entries */
- flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
- __free_page(page);
+ /* Add page to the list so it is freed later */
+ if (free_pages)
+ __llist_add(&page->pcp_llist, free_pages);
return 0;
}
@@ -202,6 +217,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
arena->user_vm_end = arena->user_vm_start + vm_range;
INIT_LIST_HEAD(&arena->vma_list);
+ init_llist_head(&arena->free_spans);
+ init_irq_work(&arena->free_irq, arena_free_irq);
+ INIT_WORK(&arena->free_work, arena_free_worker);
bpf_map_init_from_attr(&arena->map, attr);
range_tree_init(&arena->rt);
err = range_tree_set(&arena->rt, 0, attr->max_entries);
@@ -210,6 +228,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
goto err;
}
mutex_init(&arena->lock);
+ raw_res_spin_lock_init(&arena->spinlock);
err = populate_pgtable_except_pte(arena);
if (err) {
range_tree_destroy(&arena->rt);
@@ -256,6 +275,10 @@ static void arena_map_free(struct bpf_map *map)
if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
return;
+ /* Ensure no pending deferred frees */
+ irq_work_sync(&arena->free_irq);
+ flush_work(&arena->free_work);
+
/*
* free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
* It unmaps everything from vmalloc area and clears pgtables.
@@ -339,12 +362,16 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
struct page *page;
long kbase, kaddr;
+ unsigned long flags;
int ret;
kbase = bpf_arena_get_kern_vm_start(arena);
kaddr = kbase + (u32)(vmf->address);
- guard(mutex)(&arena->lock);
+ if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+ /* Make a reasonable effort to address impossible case */
+ return VM_FAULT_RETRY;
+
page = vmalloc_to_page((void *)kaddr);
if (page)
/* already have a page vmap-ed */
@@ -352,31 +379,35 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
/* User space requested to segfault when page is not allocated by bpf prog */
- return VM_FAULT_SIGSEGV;
+ goto out_unlock_sigsegv;
ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
if (ret)
- return VM_FAULT_SIGSEGV;
+ goto out_unlock_sigsegv;
struct apply_range_data data = { .pages = &page, .i = 0 };
/* Account into memcg of the process that created bpf_arena */
ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
if (ret) {
range_tree_set(&arena->rt, vmf->pgoff, 1);
- return VM_FAULT_SIGSEGV;
+ goto out_unlock_sigsegv;
}
ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
if (ret) {
range_tree_set(&arena->rt, vmf->pgoff, 1);
- __free_page(page);
- return VM_FAULT_SIGSEGV;
+ free_pages_nolock(page, 0);
+ goto out_unlock_sigsegv;
}
flush_vmap_cache(kaddr, PAGE_SIZE);
out:
page_ref_add(page, 1);
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
vmf->page = page;
return 0;
+out_unlock_sigsegv:
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+ return VM_FAULT_SIGSEGV;
}
static const struct vm_operations_struct arena_vm_ops = {
@@ -497,7 +528,8 @@ static u64 clear_lo32(u64 val)
* Allocate pages and vmap them into kernel vmalloc area.
* Later the pages will be mmaped into user space vma.
*/
-static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
+static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id,
+ bool sleepable)
{
/* user_vm_end/start are fixed before bpf prog runs */
long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
@@ -506,6 +538,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
struct page **pages = NULL;
long remaining, mapped = 0;
long alloc_pages;
+ unsigned long flags;
long pgoff = 0;
u32 uaddr32;
int ret, i;
@@ -529,7 +562,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
return 0;
data.pages = pages;
- mutex_lock(&arena->lock);
+ if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+ goto out_free_pages;
if (uaddr) {
ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
@@ -573,7 +607,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
/* data.i pages were mapped, account them and free the remaining */
mapped += data.i;
for (i = data.i; i < this_batch; i++)
- __free_page(pages[i]);
+ free_pages_nolock(pages[i], 0);
goto out;
}
@@ -581,19 +615,19 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
remaining -= this_batch;
}
flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
- mutex_unlock(&arena->lock);
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
kfree_nolock(pages);
return clear_lo32(arena->user_vm_start) + uaddr32;
out:
range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
- mutex_unlock(&arena->lock);
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
if (mapped) {
flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
- arena_free_pages(arena, uaddr32, mapped);
+ arena_free_pages(arena, uaddr32, mapped, sleepable);
}
goto out_free_pages;
out_unlock_free_pages:
- mutex_unlock(&arena->lock);
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
out_free_pages:
kfree_nolock(pages);
return 0;
@@ -608,42 +642,64 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
{
struct vma_list *vml;
+ guard(mutex)(&arena->lock);
+ /* iterate link list under lock */
list_for_each_entry(vml, &arena->vma_list, head)
zap_page_range_single(vml->vma, uaddr,
PAGE_SIZE * page_cnt, NULL);
}
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable)
{
u64 full_uaddr, uaddr_end;
- long kaddr, pgoff, i;
+ long kaddr, pgoff;
struct page *page;
+ struct llist_head free_pages;
+ struct llist_node *pos, *t;
+ struct arena_free_span *s;
+ unsigned long flags;
+ int ret = 0;
/* only aligned lower 32-bit are relevant */
uaddr = (u32)uaddr;
uaddr &= PAGE_MASK;
+ kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
if (full_uaddr >= uaddr_end)
return;
page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
+ pgoff = compute_pgoff(arena, uaddr);
- guard(mutex)(&arena->lock);
+ if (!sleepable)
+ goto defer;
+
+ ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags);
+
+ /* Can't proceed without holding the spinlock so defer the free */
+ if (ret)
+ goto defer;
- pgoff = compute_pgoff(arena, uaddr);
- /* clear range */
range_tree_set(&arena->rt, pgoff, page_cnt);
+ init_llist_head(&free_pages);
+ /* clear ptes and collect struct pages */
+ apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+ apply_range_clear_cb, &free_pages);
+
+ /* drop the lock to do the tlb flush and zap pages */
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+ /* ensure no stale TLB entries */
+ flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
if (page_cnt > 1)
/* bulk zap if multiple pages being freed */
zap_pages(arena, full_uaddr, page_cnt);
- kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
- for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
- page = vmalloc_to_page((void *)kaddr);
- if (!page)
- continue;
+ llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) {
+ page = llist_entry(pos, struct page, pcp_llist);
if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
/* Optimization for the common case of page_cnt==1:
* If page wasn't mapped into some user vma there
@@ -651,9 +707,25 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
* page_cnt is big it's faster to do the batched zap.
*/
zap_pages(arena, full_uaddr, 1);
- apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb,
- NULL);
+ __free_page(page);
}
+
+ return;
+
+defer:
+ s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1);
+ if (!s)
+ /*
+ * If allocation fails in non-sleepable context, pages are intentionally left
+ * inaccessible (leaked) until the arena is destroyed. Cleanup or retries are not
+ * possible here, so we intentionally omit them for safety.
+ */
+ return;
+
+ s->page_cnt = page_cnt;
+ s->uaddr = uaddr;
+ llist_add(&s->node, &arena->free_spans);
+ irq_work_queue(&arena->free_irq);
}
/*
@@ -663,6 +735,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt)
{
long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
+ unsigned long flags;
long pgoff;
int ret;
@@ -673,15 +746,87 @@ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt
if (pgoff + page_cnt > page_cnt_max)
return -EINVAL;
- guard(mutex)(&arena->lock);
+ if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+ return -EBUSY;
/* Cannot guard already allocated pages. */
ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
- if (ret)
- return -EBUSY;
+ if (ret) {
+ ret = -EBUSY;
+ goto out;
+ }
/* "Allocate" the region to prevent it from being allocated. */
- return range_tree_clear(&arena->rt, pgoff, page_cnt);
+ ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
+out:
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+ return ret;
+}
+
+static void arena_free_worker(struct work_struct *work)
+{
+ struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work);
+ struct llist_node *list, *pos, *t;
+ struct arena_free_span *s;
+ u64 arena_vm_start, user_vm_start;
+ struct llist_head free_pages;
+ struct page *page;
+ unsigned long full_uaddr;
+ long kaddr, page_cnt, pgoff;
+ unsigned long flags;
+
+ if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) {
+ schedule_work(work);
+ return;
+ }
+
+ init_llist_head(&free_pages);
+ arena_vm_start = bpf_arena_get_kern_vm_start(arena);
+ user_vm_start = bpf_arena_get_user_vm_start(arena);
+
+ list = llist_del_all(&arena->free_spans);
+ llist_for_each(pos, list) {
+ s = llist_entry(pos, struct arena_free_span, node);
+ page_cnt = s->page_cnt;
+ kaddr = arena_vm_start + s->uaddr;
+ pgoff = compute_pgoff(arena, s->uaddr);
+
+ /* clear ptes and collect pages in free_pages llist */
+ apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+ apply_range_clear_cb, &free_pages);
+
+ range_tree_set(&arena->rt, pgoff, page_cnt);
+ }
+ raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+ /* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */
+ llist_for_each_safe(pos, t, list) {
+ s = llist_entry(pos, struct arena_free_span, node);
+ page_cnt = s->page_cnt;
+ full_uaddr = clear_lo32(user_vm_start) + s->uaddr;
+ kaddr = arena_vm_start + s->uaddr;
+
+ /* ensure no stale TLB entries */
+ flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
+ /* remove pages from user vmas */
+ zap_pages(arena, full_uaddr, page_cnt);
+
+ kfree_nolock(s);
+ }
+
+ /* free all pages collected by apply_to_existing_page_range() in the first loop */
+ llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) {
+ page = llist_entry(pos, struct page, pcp_llist);
+ __free_page(page);
+ }
+}
+
+static void arena_free_irq(struct irq_work *iw)
+{
+ struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq);
+
+ schedule_work(&arena->free_work);
}
__bpf_kfunc_start_defs();
@@ -695,9 +840,20 @@ __bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_
if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt)
return NULL;
- return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id);
+ return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, true);
}
+void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt,
+ int node_id, u64 flags)
+{
+ struct bpf_map *map = p__map;
+ struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+ if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt)
+ return NULL;
+
+ return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, false);
+}
__bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
{
struct bpf_map *map = p__map;
@@ -705,7 +861,17 @@ __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt
if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
return;
- arena_free_pages(arena, (long)ptr__ign, page_cnt);
+ arena_free_pages(arena, (long)ptr__ign, page_cnt, true);
+}
+
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+ struct bpf_map *map = p__map;
+ struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+ if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
+ return;
+ arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
}
__bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt)
@@ -724,9 +890,9 @@ __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_c
__bpf_kfunc_end_defs();
BTF_KFUNCS_START(arena_kfuncs)
-BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
BTF_KFUNCS_END(arena_kfuncs)
static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6b8a77fbe3b..2de1a736ef69 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12380,6 +12380,8 @@ enum special_kfunc_type {
KF___bpf_trap,
KF_bpf_task_work_schedule_signal_impl,
KF_bpf_task_work_schedule_resume_impl,
+ KF_bpf_arena_alloc_pages,
+ KF_bpf_arena_free_pages,
};
BTF_ID_LIST(special_kfunc_list)
@@ -12454,6 +12456,8 @@ BTF_ID(func, bpf_dynptr_file_discard)
BTF_ID(func, __bpf_trap)
BTF_ID(func, bpf_task_work_schedule_signal_impl)
BTF_ID(func, bpf_task_work_schedule_resume_impl)
+BTF_ID(func, bpf_arena_alloc_pages)
+BTF_ID(func, bpf_arena_free_pages)
static bool is_task_work_add_kfunc(u32 func_id)
{
@@ -22432,6 +22436,12 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc
} else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) {
if (!env->insn_aux_data[insn_idx].non_sleepable)
addr = (unsigned long)bpf_dynptr_from_file_sleepable;
+ } else if (func_id == special_kfunc_list[KF_bpf_arena_alloc_pages]) {
+ if (env->insn_aux_data[insn_idx].non_sleepable)
+ addr = (unsigned long)bpf_arena_alloc_pages_non_sleepable;
+ } else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) {
+ if (env->insn_aux_data[insn_idx].non_sleepable)
+ addr = (unsigned long)bpf_arena_free_pages_non_sleepable;
}
desc->addr = addr;
return 0;
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
` (2 preceding siblings ...)
2025-12-22 19:50 ` [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
2025-12-23 5:03 ` Alexei Starovoitov
3 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team
As arena kfuncs can now be called from non-sleepable contexts, test this
by adding non-sleepable copies of tests in verifier_arena, this is done
by using a socket program instead of syscall.
Add a new test case in verifier_arena_large to check that the
bpf_arena_alloc_pages() works for more than 1024 pages.
1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but
bpf_arena_alloc_pages() should still succeed because it re-uses this
array in a loop.
Augment the arena_list selftest to also run in non-sleepable context by
taking rcu_read_lock.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
.../selftests/bpf/prog_tests/arena_list.c | 20 +-
.../testing/selftests/bpf/progs/arena_list.c | 11 ++
.../selftests/bpf/progs/verifier_arena.c | 185 ++++++++++++++++++
.../bpf/progs/verifier_arena_large.c | 29 +++
4 files changed, 240 insertions(+), 5 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c
index d15867cddde0..4f2866a615ce 100644
--- a/tools/testing/selftests/bpf/prog_tests/arena_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c
@@ -27,17 +27,23 @@ static int list_sum(struct arena_list_head *head)
return sum;
}
-static void test_arena_list_add_del(int cnt)
+static void test_arena_list_add_del(int cnt, bool nonsleepable)
{
LIBBPF_OPTS(bpf_test_run_opts, opts);
struct arena_list *skel;
int expected_sum = (u64)cnt * (cnt - 1) / 2;
int ret, sum;
- skel = arena_list__open_and_load();
- if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load"))
+ skel = arena_list__open();
+ if (!ASSERT_OK_PTR(skel, "arena_list__open"))
return;
+ skel->rodata->nonsleepable = nonsleepable;
+
+ ret = arena_list__load(skel);
+ if (!ASSERT_OK(ret, "arena_list__load"))
+ goto out;
+
skel->bss->cnt = cnt;
ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
ASSERT_OK(ret, "ret_add");
@@ -65,7 +71,11 @@ static void test_arena_list_add_del(int cnt)
void test_arena_list(void)
{
if (test__start_subtest("arena_list_1"))
- test_arena_list_add_del(1);
+ test_arena_list_add_del(1, false);
if (test__start_subtest("arena_list_1000"))
- test_arena_list_add_del(1000);
+ test_arena_list_add_del(1000, false);
+ if (test__start_subtest("arena_list_1_nonsleepable"))
+ test_arena_list_add_del(1, true);
+ if (test__start_subtest("arena_list_1000_nonsleepable"))
+ test_arena_list_add_del(1000, true);
}
diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c
index 3a2ddcacbea6..235d8cc95bdd 100644
--- a/tools/testing/selftests/bpf/progs/arena_list.c
+++ b/tools/testing/selftests/bpf/progs/arena_list.c
@@ -30,6 +30,7 @@ struct arena_list_head __arena *list_head;
int list_sum;
int cnt;
bool skip = false;
+const volatile bool nonsleepable = false;
#ifdef __BPF_FEATURE_ADDR_SPACE_CAST
long __arena arena_sum;
@@ -42,6 +43,9 @@ int test_val SEC(".addr_space.1");
int zero;
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+
SEC("syscall")
int arena_list_add(void *ctx)
{
@@ -71,6 +75,10 @@ int arena_list_del(void *ctx)
struct elem __arena *n;
int sum = 0;
+ /* Take rcu_read_lock to test non-sleepable context */
+ if (nonsleepable)
+ bpf_rcu_read_lock();
+
arena_sum = 0;
list_for_each_entry(n, list_head, node) {
sum += n->value;
@@ -79,6 +87,9 @@ int arena_list_del(void *ctx)
bpf_free(n);
}
list_sum = sum;
+
+ if (nonsleepable)
+ bpf_rcu_read_unlock();
#else
skip = true;
#endif
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c
index 7f4827eede3c..4a9d96344813 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena.c
@@ -21,6 +21,37 @@ struct {
#endif
} arena SEC(".maps");
+SEC("socket")
+__success __retval(0)
+int basic_alloc1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ volatile int __arena *page1, *page2, *no_page;
+
+ page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (!page1)
+ return 1;
+ *page1 = 1;
+ page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (!page2)
+ return 2;
+ *page2 = 2;
+ no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (no_page)
+ return 3;
+ if (*page1 != 1)
+ return 4;
+ if (*page2 != 2)
+ return 5;
+ bpf_arena_free_pages(&arena, (void __arena *)page2, 1);
+ if (*page1 != 1)
+ return 6;
+ if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */
+ return 7;
+#endif
+ return 0;
+}
+
SEC("syscall")
__success __retval(0)
int basic_alloc1(void *ctx)
@@ -60,6 +91,44 @@ int basic_alloc1(void *ctx)
return 0;
}
+SEC("socket")
+__success __retval(0)
+int basic_alloc2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ volatile char __arena *page1, *page2, *page3, *page4;
+
+ page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+ if (!page1)
+ return 1;
+ page2 = page1 + __PAGE_SIZE;
+ page3 = page1 + __PAGE_SIZE * 2;
+ page4 = page1 - __PAGE_SIZE;
+ *page1 = 1;
+ *page2 = 2;
+ *page3 = 3;
+ *page4 = 4;
+ if (*page1 != 1)
+ return 1;
+ if (*page2 != 2)
+ return 2;
+ if (*page3 != 0)
+ return 3;
+ if (*page4 != 0)
+ return 4;
+ bpf_arena_free_pages(&arena, (void __arena *)page1, 2);
+ if (*page1 != 0 && *page1 != 1)
+ return 5;
+ if (*page2 != 0 && *page2 != 2)
+ return 6;
+ if (*page3 != 0)
+ return 7;
+ if (*page4 != 0)
+ return 8;
+#endif
+ return 0;
+}
+
SEC("syscall")
__success __retval(0)
int basic_alloc2(void *ctx)
@@ -102,6 +171,19 @@ struct bpf_arena___l {
struct bpf_map map;
} __attribute__((preserve_access_index));
+SEC("socket")
+__success __retval(0) __log_level(2)
+int basic_alloc3_nosleep(void *ctx)
+{
+ struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena;
+ volatile char __arena *pages;
+
+ pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0);
+ if (!pages)
+ return 1;
+ return 0;
+}
+
SEC("syscall")
__success __retval(0) __log_level(2)
int basic_alloc3(void *ctx)
@@ -115,6 +197,38 @@ int basic_alloc3(void *ctx)
return 0;
}
+SEC("socket")
+__success __retval(0)
+int basic_reserve1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ char __arena *page;
+ int ret;
+
+ page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (!page)
+ return 1;
+
+ page += __PAGE_SIZE;
+
+ /* Reserve the second page */
+ ret = bpf_arena_reserve_pages(&arena, page, 1);
+ if (ret)
+ return 2;
+
+ /* Try to explicitly allocate the reserved page. */
+ page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+ if (page)
+ return 3;
+
+ /* Try to implicitly allocate the page (since there's only 2 of them). */
+ page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+ if (page)
+ return 4;
+#endif
+ return 0;
+}
+
SEC("syscall")
__success __retval(0)
int basic_reserve1(void *ctx)
@@ -147,6 +261,26 @@ int basic_reserve1(void *ctx)
return 0;
}
+SEC("socket")
+__success __retval(0)
+int basic_reserve2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ char __arena *page;
+ int ret;
+
+ page = arena_base(&arena);
+ ret = bpf_arena_reserve_pages(&arena, page, 1);
+ if (ret)
+ return 1;
+
+ page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+ if ((u64)page)
+ return 2;
+#endif
+ return 0;
+}
+
SEC("syscall")
__success __retval(0)
int basic_reserve2(void *ctx)
@@ -168,6 +302,27 @@ int basic_reserve2(void *ctx)
}
/* Reserve the same page twice, should return -EBUSY. */
+SEC("socket")
+__success __retval(0)
+int reserve_twice_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ char __arena *page;
+ int ret;
+
+ page = arena_base(&arena);
+
+ ret = bpf_arena_reserve_pages(&arena, page, 1);
+ if (ret)
+ return 1;
+
+ ret = bpf_arena_reserve_pages(&arena, page, 1);
+ if (ret != -EBUSY)
+ return 2;
+#endif
+ return 0;
+}
+
SEC("syscall")
__success __retval(0)
int reserve_twice(void *ctx)
@@ -190,6 +345,36 @@ int reserve_twice(void *ctx)
}
/* Try to reserve past the end of the arena. */
+SEC("socket")
+__success __retval(0)
+int reserve_invalid_region_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ char __arena *page;
+ int ret;
+
+ /* Try a NULL pointer. */
+ ret = bpf_arena_reserve_pages(&arena, NULL, 3);
+ if (ret != -EINVAL)
+ return 1;
+
+ page = arena_base(&arena);
+
+ ret = bpf_arena_reserve_pages(&arena, page, 3);
+ if (ret != -EINVAL)
+ return 2;
+
+ ret = bpf_arena_reserve_pages(&arena, page, 4096);
+ if (ret != -EINVAL)
+ return 3;
+
+ ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1);
+ if (ret != -EINVAL)
+ return 4;
+#endif
+ return 0;
+}
+
SEC("syscall")
__success __retval(0)
int reserve_invalid_region(void *ctx)
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
index 2b8cf2a4d880..4ca491cbe8d1 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
@@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
return 9;
return 0;
}
+
+SEC("socket")
+__success __retval(0)
+int big_alloc3(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+ char __arena *pages;
+ u64 i;
+
+ /*
+ * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
+ * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
+ * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
+ * pages.
+ */
+ pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
+ if (!pages)
+ return -1;
+
+ bpf_for(i, 0, 2051)
+ pages[i * PAGE_SIZE] = 123;
+ bpf_for(i, 0, 2051)
+ if (pages[i * PAGE_SIZE] != 123)
+ return i;
+
+ bpf_arena_free_pages(&arena, pages, 2051);
+#endif
+ return 0;
+}
#endif
char _license[] SEC("license") = "GPL";
--
2.47.3
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
@ 2025-12-23 5:03 ` Alexei Starovoitov
2025-12-23 14:51 ` Puranjay Mohan
0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-23 5:03 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Kernel Team
On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> int reserve_invalid_region(void *ctx)
> diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> index 2b8cf2a4d880..4ca491cbe8d1 100644
> --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> return 9;
> return 0;
> }
> +
> +SEC("socket")
> +__success __retval(0)
> +int big_alloc3(void *ctx)
> +{
> +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> + char __arena *pages;
> + u64 i;
> +
> + /*
> + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> + * pages.
> + */
> + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> + if (!pages)
> + return -1;
> +
> + bpf_for(i, 0, 2051)
> + pages[i * PAGE_SIZE] = 123;
> + bpf_for(i, 0, 2051)
> + if (pages[i * PAGE_SIZE] != 123)
> + return i;
> +
> + bpf_arena_free_pages(&arena, pages, 2051);
> +#endif
> + return 0;
> +}
CI says that it's failing on arm64.
Error: #511/6 verifier_arena_large/big_alloc3
run_subtest:FAIL:1299 Unexpected retval: -1 != 0
cannot quite tell whether it's sporadic or caused by this patch set.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-23 5:03 ` Alexei Starovoitov
@ 2025-12-23 14:51 ` Puranjay Mohan
2025-12-23 19:35 ` Alexei Starovoitov
0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-23 14:51 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >
> > int reserve_invalid_region(void *ctx)
> > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > return 9;
> > return 0;
> > }
> > +
> > +SEC("socket")
> > +__success __retval(0)
> > +int big_alloc3(void *ctx)
> > +{
> > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > + char __arena *pages;
> > + u64 i;
> > +
> > + /*
> > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > + * pages.
> > + */
> > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > + if (!pages)
> > + return -1;
> > +
> > + bpf_for(i, 0, 2051)
> > + pages[i * PAGE_SIZE] = 123;
> > + bpf_for(i, 0, 2051)
> > + if (pages[i * PAGE_SIZE] != 123)
> > + return i;
> > +
> > + bpf_arena_free_pages(&arena, pages, 2051);
> > +#endif
> > + return 0;
> > +}
>
> CI says that it's failing on arm64.
> Error: #511/6 verifier_arena_large/big_alloc3
> run_subtest:FAIL:1299 Unexpected retval: -1 != 0
>
> cannot quite tell whether it's sporadic or caused by this patch set.
I tried reproducing it locally multiple times and it didn't fail. It
also doesn't fail on manual CI run:
https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
I assume it is sporadic.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-23 14:51 ` Puranjay Mohan
@ 2025-12-23 19:35 ` Alexei Starovoitov
2025-12-23 23:13 ` Puranjay Mohan
0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-23 19:35 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > >
> > > int reserve_invalid_region(void *ctx)
> > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > return 9;
> > > return 0;
> > > }
> > > +
> > > +SEC("socket")
> > > +__success __retval(0)
> > > +int big_alloc3(void *ctx)
> > > +{
> > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > + char __arena *pages;
> > > + u64 i;
> > > +
> > > + /*
> > > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > + * pages.
> > > + */
> > > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > + if (!pages)
> > > + return -1;
> > > +
> > > + bpf_for(i, 0, 2051)
> > > + pages[i * PAGE_SIZE] = 123;
> > > + bpf_for(i, 0, 2051)
> > > + if (pages[i * PAGE_SIZE] != 123)
> > > + return i;
> > > +
> > > + bpf_arena_free_pages(&arena, pages, 2051);
> > > +#endif
> > > + return 0;
> > > +}
> >
> > CI says that it's failing on arm64.
> > Error: #511/6 verifier_arena_large/big_alloc3
> > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> >
> > cannot quite tell whether it's sporadic or caused by this patch set.
>
> I tried reproducing it locally multiple times and it didn't fail. It
> also doesn't fail on manual CI run:
> https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
>
> I assume it is sporadic.
Ok. Applied. Let's watch for this. If it's actually flaky
we need to fix it.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-23 19:35 ` Alexei Starovoitov
@ 2025-12-23 23:13 ` Puranjay Mohan
2025-12-24 0:02 ` Alexei Starovoitov
0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-23 23:13 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > >
> > > > int reserve_invalid_region(void *ctx)
> > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > return 9;
> > > > return 0;
> > > > }
> > > > +
> > > > +SEC("socket")
> > > > +__success __retval(0)
> > > > +int big_alloc3(void *ctx)
> > > > +{
> > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > + char __arena *pages;
> > > > + u64 i;
> > > > +
> > > > + /*
> > > > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > + * pages.
> > > > + */
> > > > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > + if (!pages)
> > > > + return -1;
> > > > +
> > > > + bpf_for(i, 0, 2051)
> > > > + pages[i * PAGE_SIZE] = 123;
> > > > + bpf_for(i, 0, 2051)
> > > > + if (pages[i * PAGE_SIZE] != 123)
> > > > + return i;
> > > > +
> > > > + bpf_arena_free_pages(&arena, pages, 2051);
> > > > +#endif
> > > > + return 0;
> > > > +}
> > >
> > > CI says that it's failing on arm64.
> > > Error: #511/6 verifier_arena_large/big_alloc3
> > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > >
> > > cannot quite tell whether it's sporadic or caused by this patch set.
> >
> > I tried reproducing it locally multiple times and it didn't fail. It
> > also doesn't fail on manual CI run:
> > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> >
> > I assume it is sporadic.
>
> Ok. Applied. Let's watch for this. If it's actually flaky
> we need to fix it.
I have found out why it fails sometimes:
arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
alloc_pages_nolock(1) this is called in a loop and fails sometimes,
from my debug prints:
__bpf_alloc_page: alloc_pages_nolock failed for nid=-1
bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
already allocated pages
bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
So, I think we can do the same for the CI.
The CI currently runs through vmtest which runs a VM with 4G of memory
an 2 CPUs by default:
I checked the logs of the CI and saw:
[ 0.626933] smp: Brought up 1 node, 2 CPUs
[ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
[...]
[ 0.629145] Memory: 3388084K/4193784K available
I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
Here is a PR for this change: https://github.com/libbpf/ci/pull/206
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-23 23:13 ` Puranjay Mohan
@ 2025-12-24 0:02 ` Alexei Starovoitov
2025-12-24 0:28 ` Puranjay Mohan
0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-24 0:02 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > >
> > > > > int reserve_invalid_region(void *ctx)
> > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > return 9;
> > > > > return 0;
> > > > > }
> > > > > +
> > > > > +SEC("socket")
> > > > > +__success __retval(0)
> > > > > +int big_alloc3(void *ctx)
> > > > > +{
> > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > + char __arena *pages;
> > > > > + u64 i;
> > > > > +
> > > > > + /*
> > > > > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > + * pages.
> > > > > + */
> > > > > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > + if (!pages)
> > > > > + return -1;
> > > > > +
> > > > > + bpf_for(i, 0, 2051)
> > > > > + pages[i * PAGE_SIZE] = 123;
> > > > > + bpf_for(i, 0, 2051)
> > > > > + if (pages[i * PAGE_SIZE] != 123)
> > > > > + return i;
> > > > > +
> > > > > + bpf_arena_free_pages(&arena, pages, 2051);
> > > > > +#endif
> > > > > + return 0;
> > > > > +}
> > > >
> > > > CI says that it's failing on arm64.
> > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > >
> > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > >
> > > I tried reproducing it locally multiple times and it didn't fail. It
> > > also doesn't fail on manual CI run:
> > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > >
> > > I assume it is sporadic.
> >
> > Ok. Applied. Let's watch for this. If it's actually flaky
> > we need to fix it.
>
> I have found out why it fails sometimes:
>
> arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> from my debug prints:
>
> __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> already allocated pages
> bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
>
>
> The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
That doesn't quite make sense.
The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
close to a Gbyte. So 4Gb should be plenty.
Number of cpus shouldn't matter either.
> So, I think we can do the same for the CI.
> The CI currently runs through vmtest which runs a VM with 4G of memory
> an 2 CPUs by default:
>
> I checked the logs of the CI and saw:
>
> [ 0.626933] smp: Brought up 1 node, 2 CPUs
> [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> [...]
> [ 0.629145] Memory: 3388084K/4193784K available
>
>
> I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
>
> Here is a PR for this change: https://github.com/libbpf/ci/pull/206
I don't think we should bump it without full understanding.
It's better to make selftest recover on page alloc failure.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-24 0:02 ` Alexei Starovoitov
@ 2025-12-24 0:28 ` Puranjay Mohan
2025-12-24 0:29 ` Alexei Starovoitov
0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-24 0:28 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Wed, Dec 24, 2025 at 12:02 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > > >
> > > > > > int reserve_invalid_region(void *ctx)
> > > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > > return 9;
> > > > > > return 0;
> > > > > > }
> > > > > > +
> > > > > > +SEC("socket")
> > > > > > +__success __retval(0)
> > > > > > +int big_alloc3(void *ctx)
> > > > > > +{
> > > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > > + char __arena *pages;
> > > > > > + u64 i;
> > > > > > +
> > > > > > + /*
> > > > > > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > > + * pages.
> > > > > > + */
> > > > > > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > > + if (!pages)
> > > > > > + return -1;
> > > > > > +
> > > > > > + bpf_for(i, 0, 2051)
> > > > > > + pages[i * PAGE_SIZE] = 123;
> > > > > > + bpf_for(i, 0, 2051)
> > > > > > + if (pages[i * PAGE_SIZE] != 123)
> > > > > > + return i;
> > > > > > +
> > > > > > + bpf_arena_free_pages(&arena, pages, 2051);
> > > > > > +#endif
> > > > > > + return 0;
> > > > > > +}
> > > > >
> > > > > CI says that it's failing on arm64.
> > > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > > >
> > > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > > >
> > > > I tried reproducing it locally multiple times and it didn't fail. It
> > > > also doesn't fail on manual CI run:
> > > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > > >
> > > > I assume it is sporadic.
> > >
> > > Ok. Applied. Let's watch for this. If it's actually flaky
> > > we need to fix it.
> >
> > I have found out why it fails sometimes:
> >
> > arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> > alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> > from my debug prints:
> >
> > __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> > bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> > already allocated pages
> > bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> > fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
> >
> >
> > The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
>
> That doesn't quite make sense.
> The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
> close to a Gbyte. So 4Gb should be plenty.
> Number of cpus shouldn't matter either.
>
> > So, I think we can do the same for the CI.
> > The CI currently runs through vmtest which runs a VM with 4G of memory
> > an 2 CPUs by default:
> >
> > I checked the logs of the CI and saw:
> >
> > [ 0.626933] smp: Brought up 1 node, 2 CPUs
> > [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> > [...]
> > [ 0.629145] Memory: 3388084K/4193784K available
> >
> >
> > I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
> >
> > Here is a PR for this change: https://github.com/libbpf/ci/pull/206
>
> I don't think we should bump it without full understanding.
> It's better to make selftest recover on page alloc failure.
Okay, I will debug deeper to find out exactly where it fails in
alloc_pages_nolock().
For now do we want to allow the CI to fail or I can send a patch with following:
--- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
@@ -300,7 +300,7 @@ int big_alloc3(void *ctx)
*/
pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
if (!pages)
- return -1;
+ return 0;
bpf_for(i, 0, 2051)
pages[i * PAGE_SIZE] = 123;
This will make this test unconditionally pass.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-24 0:28 ` Puranjay Mohan
@ 2025-12-24 0:29 ` Alexei Starovoitov
2025-12-24 19:06 ` Puranjay Mohan
0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-24 0:29 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Tue, Dec 23, 2025 at 2:28 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Wed, Dec 24, 2025 at 12:02 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > > > >
> > > > > > > int reserve_invalid_region(void *ctx)
> > > > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > > > return 9;
> > > > > > > return 0;
> > > > > > > }
> > > > > > > +
> > > > > > > +SEC("socket")
> > > > > > > +__success __retval(0)
> > > > > > > +int big_alloc3(void *ctx)
> > > > > > > +{
> > > > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > > > + char __arena *pages;
> > > > > > > + u64 i;
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > > > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > > > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > > > + * pages.
> > > > > > > + */
> > > > > > > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > > > + if (!pages)
> > > > > > > + return -1;
> > > > > > > +
> > > > > > > + bpf_for(i, 0, 2051)
> > > > > > > + pages[i * PAGE_SIZE] = 123;
> > > > > > > + bpf_for(i, 0, 2051)
> > > > > > > + if (pages[i * PAGE_SIZE] != 123)
> > > > > > > + return i;
> > > > > > > +
> > > > > > > + bpf_arena_free_pages(&arena, pages, 2051);
> > > > > > > +#endif
> > > > > > > + return 0;
> > > > > > > +}
> > > > > >
> > > > > > CI says that it's failing on arm64.
> > > > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > > > >
> > > > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > > > >
> > > > > I tried reproducing it locally multiple times and it didn't fail. It
> > > > > also doesn't fail on manual CI run:
> > > > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > > > >
> > > > > I assume it is sporadic.
> > > >
> > > > Ok. Applied. Let's watch for this. If it's actually flaky
> > > > we need to fix it.
> > >
> > > I have found out why it fails sometimes:
> > >
> > > arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> > > alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> > > from my debug prints:
> > >
> > > __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> > > bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> > > already allocated pages
> > > bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> > > fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
> > >
> > >
> > > The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
> >
> > That doesn't quite make sense.
> > The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
> > close to a Gbyte. So 4Gb should be plenty.
> > Number of cpus shouldn't matter either.
> >
> > > So, I think we can do the same for the CI.
> > > The CI currently runs through vmtest which runs a VM with 4G of memory
> > > an 2 CPUs by default:
> > >
> > > I checked the logs of the CI and saw:
> > >
> > > [ 0.626933] smp: Brought up 1 node, 2 CPUs
> > > [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> > > [...]
> > > [ 0.629145] Memory: 3388084K/4193784K available
> > >
> > >
> > > I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
> > >
> > > Here is a PR for this change: https://github.com/libbpf/ci/pull/206
> >
> > I don't think we should bump it without full understanding.
> > It's better to make selftest recover on page alloc failure.
>
>
> Okay, I will debug deeper to find out exactly where it fails in
> alloc_pages_nolock().
> For now do we want to allow the CI to fail or I can send a patch with following:
>
> --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> @@ -300,7 +300,7 @@ int big_alloc3(void *ctx)
> */
> pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> if (!pages)
> - return -1;
> + return 0;
>
> bpf_for(i, 0, 2051)
> pages[i * PAGE_SIZE] = 123;
>
> This will make this test unconditionally pass.
Pls make it skip on failure instead of pass.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
2025-12-24 0:29 ` Alexei Starovoitov
@ 2025-12-24 19:06 ` Puranjay Mohan
0 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-24 19:06 UTC (permalink / raw)
To: Alexei Starovoitov
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Kernel Team
On Wed, Dec 24, 2025 at 12:29 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 2:28 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >
> > On Wed, Dec 24, 2025 at 12:02 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > > > > >
> > > > > > > > int reserve_invalid_region(void *ctx)
> > > > > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > > > > return 9;
> > > > > > > > return 0;
> > > > > > > > }
> > > > > > > > +
> > > > > > > > +SEC("socket")
> > > > > > > > +__success __retval(0)
> > > > > > > > +int big_alloc3(void *ctx)
> > > > > > > > +{
> > > > > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > > > > + char __arena *pages;
> > > > > > > > + u64 i;
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > > > > + * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > > > > + * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > > > > + * pages.
> > > > > > > > + */
> > > > > > > > + pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > > > > + if (!pages)
> > > > > > > > + return -1;
> > > > > > > > +
> > > > > > > > + bpf_for(i, 0, 2051)
> > > > > > > > + pages[i * PAGE_SIZE] = 123;
> > > > > > > > + bpf_for(i, 0, 2051)
> > > > > > > > + if (pages[i * PAGE_SIZE] != 123)
> > > > > > > > + return i;
> > > > > > > > +
> > > > > > > > + bpf_arena_free_pages(&arena, pages, 2051);
> > > > > > > > +#endif
> > > > > > > > + return 0;
> > > > > > > > +}
> > > > > > >
> > > > > > > CI says that it's failing on arm64.
> > > > > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > > > > >
> > > > > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > > > > >
> > > > > > I tried reproducing it locally multiple times and it didn't fail. It
> > > > > > also doesn't fail on manual CI run:
> > > > > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > > > > >
> > > > > > I assume it is sporadic.
> > > > >
> > > > > Ok. Applied. Let's watch for this. If it's actually flaky
> > > > > we need to fix it.
> > > >
> > > > I have found out why it fails sometimes:
> > > >
> > > > arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> > > > alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> > > > from my debug prints:
> > > >
> > > > __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> > > > bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> > > > already allocated pages
> > > > bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> > > > fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
> > > >
> > > >
> > > > The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
> > >
> > > That doesn't quite make sense.
> > > The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
> > > close to a Gbyte. So 4Gb should be plenty.
> > > Number of cpus shouldn't matter either.
> > >
> > > > So, I think we can do the same for the CI.
> > > > The CI currently runs through vmtest which runs a VM with 4G of memory
> > > > an 2 CPUs by default:
> > > >
> > > > I checked the logs of the CI and saw:
> > > >
> > > > [ 0.626933] smp: Brought up 1 node, 2 CPUs
> > > > [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> > > > [...]
> > > > [ 0.629145] Memory: 3388084K/4193784K available
> > > >
> > > >
> > > > I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
> > > >
> > > > Here is a PR for this change: https://github.com/libbpf/ci/pull/206
> > >
> > > I don't think we should bump it without full understanding.
> > > It's better to make selftest recover on page alloc failure.
> >
> >
> > Okay, I will debug deeper to find out exactly where it fails in
> > alloc_pages_nolock().
> > For now do we want to allow the CI to fail or I can send a patch with following:
> >
> > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > @@ -300,7 +300,7 @@ int big_alloc3(void *ctx)
> > */
> > pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > if (!pages)
> > - return -1;
> > + return 0;
> >
> > bpf_for(i, 0, 2051)
> > pages[i * PAGE_SIZE] = 123;
> >
> > This will make this test unconditionally pass.
>
> Pls make it skip on failure instead of pass.
Extracted more information using some debug prints (AI Generated):
[ 29.946603] [mm/page_alloc.c:3386] PCP list[0] empty
[ 29.946642] Zone: DMA
[ 29.946649] CPU: 1
[ 29.946655] Migratetype: 0
[ 29.946662] Order: 0
[ 29.946668] Total PCP count: 491 (all migratetypes)
[ 29.946681] PCP high: 46754
[ 29.946689] [mm/page_alloc.c:3214] spin_trylock_irqsave(&zone->lock) FAILED
[ 29.946706] Zone: DMA
[ 29.946713] CPU: 1
[ 29.946719] Retry attempts: 3
[ 29.946727] Cycles spent: 384
[ 29.946734] Zone free pages: 221198
[ 29.946743] Zone watermarks: min=9903 low=12378 high=14853
[ 29.946757] Zone managed pages: 751344
[ 29.946767] [mm/page_alloc.c:3977] rmqueue() returned NULL for zone DMA
[ 29.946783] [mm/page_alloc.c:4010] get_page_from_freelist() failed
[ 29.946797] Zones attempted: 1
[ 29.946805] skip_kswapd_nodes: 0
[ 29.946814] skipped_kswapd_nodes: 0
[ 29.946823] ============================================================
[ 29.946838] alloc_pages_nolock() FAILED
[ 29.946847] Order: 0
[ 29.946852] Node: 0
[ 29.946858] Context: preempt_count=514 irqs_disabled=1 in_interrupt=512
[ 29.946874] Architecture: ARM64
[ 29.946881] Page size: 4096 bytes
[ 29.946889] ============================================================
[ 29.946905] bpf_map_alloc_pages() failed: page 670/1024 (nid=-1)
The failure occurs when allocating 1024 pages one-by-one in softirq
context on ARM64:
1. PCP Exhaustion (mm/page_alloc.c:3386): After ~670 pages, the PCP
list for migratetype 0 (MIGRATE_UNMOVABLE) becomes empty, despite 491
pages remaining in other
migratetype lists
2. Zone Lock Contention (mm/page_alloc.c:3214): Fallback to buddy
allocator requires zone->lock, but spin_trylock_irqsave() fails after
3 attempts (384 cycles), even
though 221,198 free pages are available
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-12-24 19:06 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
2025-12-23 5:03 ` Alexei Starovoitov
2025-12-23 14:51 ` Puranjay Mohan
2025-12-23 19:35 ` Alexei Starovoitov
2025-12-23 23:13 ` Puranjay Mohan
2025-12-24 0:02 ` Alexei Starovoitov
2025-12-24 0:28 ` Puranjay Mohan
2025-12-24 0:29 ` Alexei Starovoitov
2025-12-24 19:06 ` Puranjay Mohan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox