BPF List
 help / color / mirror / Atom feed
* [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs
@ 2025-12-22 19:50 Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

V7: https://lore.kernel.org/all/20251222190815.4112944-1-puranjay@kernel.org/
Changes in V7->v8:
- Use clear_lo32(arena->user_vm_start) in place of user_vm_start in patch 3

V6: https://lore.kernel.org/all/20251217184438.3557859-1-puranjay@kernel.org/
Changes in v6->v7:
- Fix a deadlock in patch 1, that was being fixed in patch 2. Move the fix to patch 1.
- Call flush_cache_vmap() after setting up the mappings as it is
  required by some architectures.

V5: https://lore.kernel.org/all/20251212044516.37513-1-puranjay@kernel.org/
Changes in v5->v6:
Patch 1:
	- Add a missing ; to make sure this patch builds individually. (AI)

V4: https://lore.kernel.org/all/20251212004350.6520-1-puranjay@kernel.org/
Changes in v4->v5:
Patch 1:
	- Fix a memory leak in arena_alloc_pages(), it was being fixed in
	  Patch 3 but, every patch should be complete in itself. (AI)
Patch 3:
	- Don't do useless addition in arena_alloc_pages() (Alexei)
	- Add a comment about kmalloc_nolock() failure and expectations.

v3: https://lore.kernel.org/all/20251117160150.62183-1-puranjay@kernel.org/
Changes in v3->v4:
	- Coding style changes related to comments in Patch 2/3 (Alexei)

v2: https://lore.kernel.org/all/20251114111700.43292-1-puranjay@kernel.org/
Changes in v2->v3:
Patch 1:
        - Call range_tree_destroy() in error path of
          populate_pgtable_except_pte() in arena_map_alloc() (AI)
Patch 2:
        - Fix double mutex_unlock() in the error path of
          arena_alloc_pages() (AI)
        - Fix coding style issues (Alexei)
Patch 3:
        - Unlock spinlock before returning from arena_vm_fault() in case
          BPF_F_SEGV_ON_FAULT is set by user. (AI)
        - Use __llist_del_all() in place of llist_del_all for on-stack
          llist (free_pages) (Alexei)
        - Fix build issues on 32-bit systems where arena.c is not compiled.
          (kernel test robot)
        - Make bpf_arena_alloc_pages() polymorphic so it knows if it has
          been called in sleepable or non-sleepable context. This
          information is passed to arena_free_pages() in the error path.
Patch 4:
        - Add a better comment for the big_alloc3() test that triggers
          kmalloc_nolock()'s limit and if bpf_arena_alloc_pages() works
          correctly above this limit.

v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
        - Import tlbflush.h to fix build issue in loongarch. (kernel
          test robot)
        - Fix unused variable error in apply_range_clear_cb() (kernel
          test robot)
        - Call bpf_map_area_free() on error path of
          populate_pgtable_except_pte() (AI)
        - Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
        - Cap allocation made by kmalloc_nolock() for pages array to
          KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
          to overcome this limit. (AI)
Patch 3:
        - Do page_ref_add(page, 1); under the spinlock to mitigate a
          race (AI)
Patch 4:
        - Add a new testcase big_alloc3() verifier_arena_large.c that
          tries to allocate a large number of pages at once, this is to
          trigger the kmalloc_nolock() limit in Patch 2 and see if the
          loop logic works correctly.

This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:

The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.

bpf_arena_alloc_pages() had four points where it could sleep:

1. Mutex to protect range_tree: now replaced with rqspinlock

2. kvcalloc() for allocations: now replaced with kmalloc_nolock()

3. Allocating pages with bpf_map_alloc_pages(): this already calls
   alloc_pages_nolock() in non-sleepable contexts and therefore is safe.

4. Setting up kernel page tables with vm_area_map_pages():
   vm_area_map_pages() may allocate memory while inserting pages into
   bpf arena's vm_area. Now, at arena creation time populate all page
   table levels except the last level and when new pages need to be
   inserted call apply_to_page_range() again which will only do
   set_pte_at() for those pages and will not allocate memory.

The above four changes make bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages() has to do the following steps:

1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.

The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
   table levels, apply_to_existing_page_range() with a callback is used
   that only does pte_clear() on the last level and leaves the intermediate
   page table levels intact. This is needed to make sure that
   bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
   intermediate page tables.

When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.

arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.

apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.

Puranjay Mohan (4):
  bpf: arena: populate vm_area without allocating memory
  bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  bpf: arena: make arena kfuncs any context safe
  selftests: bpf: test non-sleepable arena allocations

 include/linux/bpf.h                           |  16 +
 kernel/bpf/arena.c                            | 380 +++++++++++++++---
 kernel/bpf/verifier.c                         |  10 +
 .../selftests/bpf/prog_tests/arena_list.c     |  20 +-
 .../testing/selftests/bpf/progs/arena_list.c  |  11 +
 .../selftests/bpf/progs/verifier_arena.c      | 185 +++++++++
 .../bpf/progs/verifier_arena_large.c          |  29 ++
 7 files changed, 592 insertions(+), 59 deletions(-)


base-commit: f785a31395d9cafb8b2c42c7358fad72a6463142
-- 
2.47.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory
  2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
  the last level
- when new pages need to be inserted call apply_to_page_range() again
  with apply_range_set_cb() which will only set_pte_at() those pages and
  will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
  apply_range_clear_cb() to clear the pte for the page to be removed. This
  doesn't free intermediate page table levels.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/arena.c | 100 ++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 90 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 872dc0e41c65..55b198b9f1a3 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -2,11 +2,13 @@
 /* Copyright (c) 2024 Meta Platforms, Inc. and affiliates. */
 #include <linux/bpf.h>
 #include <linux/btf.h>
+#include <linux/cacheflush.h>
 #include <linux/err.h>
 #include "linux/filter.h"
 #include <linux/btf_ids.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
+#include <asm/tlbflush.h>
 #include "range_tree.h"
 
 /*
@@ -92,6 +94,68 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
 	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
 }
 
+struct apply_range_data {
+	struct page **pages;
+	int i;
+};
+
+static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
+{
+	struct apply_range_data *d = data;
+	struct page *page;
+
+	if (!data)
+		return 0;
+	/* sanity check */
+	if (unlikely(!pte_none(ptep_get(pte))))
+		return -EBUSY;
+
+	page = d->pages[d->i];
+	/* paranoia, similar to vmap_pages_pte_range() */
+	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
+		return -EINVAL;
+
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+	d->i++;
+	return 0;
+}
+
+static void flush_vmap_cache(unsigned long start, unsigned long size)
+{
+	flush_cache_vmap(start, start + size);
+}
+
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+{
+	pte_t old_pte;
+	struct page *page;
+
+	/* sanity check */
+	old_pte = ptep_get(pte);
+	if (pte_none(old_pte) || !pte_present(old_pte))
+		return 0; /* nothing to do */
+
+	/* get page and free it */
+	page = pte_page(old_pte);
+	if (WARN_ON_ONCE(!page))
+		return -EINVAL;
+
+	pte_clear(&init_mm, addr, pte);
+
+	/* ensure no stale TLB entries */
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+	__free_page(page);
+
+	return 0;
+}
+
+static int populate_pgtable_except_pte(struct bpf_arena *arena)
+{
+	return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+				   KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
+}
+
 static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 {
 	struct vm_struct *kern_vm;
@@ -144,6 +208,12 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		goto err;
 	}
 	mutex_init(&arena->lock);
+	err = populate_pgtable_except_pte(arena);
+	if (err) {
+		range_tree_destroy(&arena->rt);
+		bpf_map_area_free(arena);
+		goto err;
+	}
 
 	return &arena->map;
 err:
@@ -286,6 +356,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 	if (ret)
 		return VM_FAULT_SIGSEGV;
 
+	struct apply_range_data data = { .pages = &page, .i = 0 };
 	/* Account into memcg of the process that created bpf_arena */
 	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
@@ -293,12 +364,13 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGSEGV;
 	}
 
-	ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
+	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
 		__free_page(page);
 		return VM_FAULT_SIGSEGV;
 	}
+	flush_vmap_cache(kaddr, PAGE_SIZE);
 out:
 	page_ref_add(page, 1);
 	vmf->page = page;
@@ -428,7 +500,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	/* user_vm_end/start are fixed before bpf prog runs */
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
 	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
-	struct page **pages;
+	struct page **pages = NULL;
+	long mapped = 0;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -450,7 +523,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (!pages)
 		return 0;
 
-	guard(mutex)(&arena->lock);
+	mutex_lock(&arena->lock);
 
 	if (uaddr) {
 		ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
@@ -465,6 +538,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (ret)
 		goto out_free_pages;
 
+	struct apply_range_data data = { .pages = pages, .i = 0 };
 	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
 	if (ret)
 		goto out;
@@ -477,18 +551,24 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
 	 * lower 32-bit and it's ok.
 	 */
-	ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
-				kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
-	if (ret) {
-		for (i = 0; i < page_cnt; i++)
+	apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
+			    page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
+	mapped = data.i;
+	flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
+	if (mapped < page_cnt) {
+		for (i = mapped; i < page_cnt; i++)
 			__free_page(pages[i]);
 		goto out;
 	}
+	mutex_unlock(&arena->lock);
 	kvfree(pages);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
-	range_tree_set(&arena->rt, pgoff, page_cnt);
+	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
 out_free_pages:
+	mutex_unlock(&arena->lock);
+	if (mapped)
+		arena_free_pages(arena, uaddr32, mapped);
 	kvfree(pages);
 	return 0;
 }
@@ -545,8 +625,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 			 * page_cnt is big it's faster to do the batched zap.
 			 */
 			zap_pages(arena, full_uaddr, 1);
-		vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
-		__free_page(page);
+		apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb,
+					     NULL);
 	}
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
  3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/arena.c | 84 ++++++++++++++++++++++++++++++----------------
 1 file changed, 55 insertions(+), 29 deletions(-)

diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 55b198b9f1a3..128efb68d47b 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -44,6 +44,8 @@
 #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1)
 #define KERN_VM_SZ (SZ_4G + GUARD_SZ)
 
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt);
+
 struct bpf_arena {
 	struct bpf_map map;
 	u64 user_vm_start;
@@ -500,8 +502,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	/* user_vm_end/start are fixed before bpf prog runs */
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
 	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
+	struct apply_range_data data;
 	struct page **pages = NULL;
-	long mapped = 0;
+	long remaining, mapped = 0;
+	long alloc_pages;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -518,17 +522,19 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			return 0;
 	}
 
-	/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
-	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
+	/* Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed. */
+	alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *));
+	pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE);
 	if (!pages)
 		return 0;
+	data.pages = pages;
 
 	mutex_lock(&arena->lock);
 
 	if (uaddr) {
 		ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
 		if (ret)
-			goto out_free_pages;
+			goto out_unlock_free_pages;
 		ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
 	} else {
 		ret = pgoff = range_tree_find(&arena->rt, page_cnt);
@@ -536,40 +542,60 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
 	}
 	if (ret)
-		goto out_free_pages;
-
-	struct apply_range_data data = { .pages = pages, .i = 0 };
-	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
-	if (ret)
-		goto out;
+		goto out_unlock_free_pages;
 
+	remaining = page_cnt;
 	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
-	/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
-	 * will not overflow 32-bit. Lower 32-bit need to represent
-	 * contiguous user address range.
-	 * Map these pages at kern_vm_start base.
-	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
-	 * lower 32-bit and it's ok.
-	 */
-	apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
-			    page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
-	mapped = data.i;
-	flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
-	if (mapped < page_cnt) {
-		for (i = mapped; i < page_cnt; i++)
-			__free_page(pages[i]);
-		goto out;
+
+	while (remaining) {
+		long this_batch = min(remaining, alloc_pages);
+
+		/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
+		memset(pages, 0, this_batch * sizeof(struct page *));
+
+		ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages);
+		if (ret)
+			goto out;
+
+		/*
+		 * Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
+		 * will not overflow 32-bit. Lower 32-bit need to represent
+		 * contiguous user address range.
+		 * Map these pages at kern_vm_start base.
+		 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
+		 * lower 32-bit and it's ok.
+		 */
+		data.i = 0;
+		ret = apply_to_page_range(&init_mm,
+					  kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT),
+					  this_batch << PAGE_SHIFT, apply_range_set_cb, &data);
+		if (ret) {
+			/* data.i pages were mapped, account them and free the remaining */
+			mapped += data.i;
+			for (i = data.i; i < this_batch; i++)
+				__free_page(pages[i]);
+			goto out;
+		}
+
+		mapped += this_batch;
+		remaining -= this_batch;
 	}
+	flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
 	mutex_unlock(&arena->lock);
-	kvfree(pages);
+	kfree_nolock(pages);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
 	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
-out_free_pages:
 	mutex_unlock(&arena->lock);
-	if (mapped)
+	if (mapped) {
+		flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
 		arena_free_pages(arena, uaddr32, mapped);
-	kvfree(pages);
+	}
+	goto out_free_pages;
+out_unlock_free_pages:
+	mutex_unlock(&arena->lock);
+out_free_pages:
+	kfree_nolock(pages);
 	return 0;
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe
  2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
  2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
  3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Make arena related kfuncs any context safe by the following changes:

bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.

specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.

In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context.  arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h   |  16 +++
 kernel/bpf/arena.c    | 248 +++++++++++++++++++++++++++++++++++-------
 kernel/bpf/verifier.c |  10 ++
 3 files changed, 233 insertions(+), 41 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index da6a00dd313f..4e7d72dfbcd4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -673,6 +673,22 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj);
 int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags,
 				   struct bpf_dynptr *ptr__uninit);
 
+#if defined(CONFIG_MMU) && defined(CONFIG_64BIT)
+void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt, int node_id,
+					  u64 flags);
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt);
+#else
+static inline void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt,
+							int node_id, u64 flags)
+{
+	return NULL;
+}
+
+static inline void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+}
+#endif
+
 extern const struct bpf_map_ops bpf_map_offload_ops;
 
 /* bpf_type_flag contains a set of flags that are applicable to the values of
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 128efb68d47b..456ac989269d 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -4,7 +4,9 @@
 #include <linux/btf.h>
 #include <linux/cacheflush.h>
 #include <linux/err.h>
+#include <linux/irq_work.h>
 #include "linux/filter.h"
+#include <linux/llist.h>
 #include <linux/btf_ids.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
@@ -44,7 +46,7 @@
 #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1)
 #define KERN_VM_SZ (SZ_4G + GUARD_SZ)
 
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt);
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable);
 
 struct bpf_arena {
 	struct bpf_map map;
@@ -52,8 +54,23 @@ struct bpf_arena {
 	u64 user_vm_end;
 	struct vm_struct *kern_vm;
 	struct range_tree rt;
+	/* protects rt */
+	rqspinlock_t spinlock;
 	struct list_head vma_list;
+	/* protects vma_list */
 	struct mutex lock;
+	struct irq_work     free_irq;
+	struct work_struct  free_work;
+	struct llist_head   free_spans;
+};
+
+static void arena_free_worker(struct work_struct *work);
+static void arena_free_irq(struct irq_work *iw);
+
+struct arena_free_span {
+	struct llist_node node;
+	unsigned long uaddr;
+	u32 page_cnt;
 };
 
 u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
@@ -127,7 +144,7 @@ static void flush_vmap_cache(unsigned long start, unsigned long size)
 	flush_cache_vmap(start, start + size);
 }
 
-static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages)
 {
 	pte_t old_pte;
 	struct page *page;
@@ -137,17 +154,15 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
 	if (pte_none(old_pte) || !pte_present(old_pte))
 		return 0; /* nothing to do */
 
-	/* get page and free it */
 	page = pte_page(old_pte);
 	if (WARN_ON_ONCE(!page))
 		return -EINVAL;
 
 	pte_clear(&init_mm, addr, pte);
 
-	/* ensure no stale TLB entries */
-	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
-	__free_page(page);
+	/* Add page to the list so it is freed later */
+	if (free_pages)
+		__llist_add(&page->pcp_llist, free_pages);
 
 	return 0;
 }
@@ -202,6 +217,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		arena->user_vm_end = arena->user_vm_start + vm_range;
 
 	INIT_LIST_HEAD(&arena->vma_list);
+	init_llist_head(&arena->free_spans);
+	init_irq_work(&arena->free_irq, arena_free_irq);
+	INIT_WORK(&arena->free_work, arena_free_worker);
 	bpf_map_init_from_attr(&arena->map, attr);
 	range_tree_init(&arena->rt);
 	err = range_tree_set(&arena->rt, 0, attr->max_entries);
@@ -210,6 +228,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		goto err;
 	}
 	mutex_init(&arena->lock);
+	raw_res_spin_lock_init(&arena->spinlock);
 	err = populate_pgtable_except_pte(arena);
 	if (err) {
 		range_tree_destroy(&arena->rt);
@@ -256,6 +275,10 @@ static void arena_map_free(struct bpf_map *map)
 	if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
 		return;
 
+	/* Ensure no pending deferred frees */
+	irq_work_sync(&arena->free_irq);
+	flush_work(&arena->free_work);
+
 	/*
 	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
 	 * It unmaps everything from vmalloc area and clears pgtables.
@@ -339,12 +362,16 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
 	struct page *page;
 	long kbase, kaddr;
+	unsigned long flags;
 	int ret;
 
 	kbase = bpf_arena_get_kern_vm_start(arena);
 	kaddr = kbase + (u32)(vmf->address);
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		/* Make a reasonable effort to address impossible case */
+		return VM_FAULT_RETRY;
+
 	page = vmalloc_to_page((void *)kaddr);
 	if (page)
 		/* already have a page vmap-ed */
@@ -352,31 +379,35 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 
 	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
 		/* User space requested to segfault when page is not allocated by bpf prog */
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 
 	ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
 	if (ret)
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 
 	struct apply_range_data data = { .pages = &page, .i = 0 };
 	/* Account into memcg of the process that created bpf_arena */
 	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 	}
 
 	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
-		__free_page(page);
-		return VM_FAULT_SIGSEGV;
+		free_pages_nolock(page, 0);
+		goto out_unlock_sigsegv;
 	}
 	flush_vmap_cache(kaddr, PAGE_SIZE);
 out:
 	page_ref_add(page, 1);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	vmf->page = page;
 	return 0;
+out_unlock_sigsegv:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+	return VM_FAULT_SIGSEGV;
 }
 
 static const struct vm_operations_struct arena_vm_ops = {
@@ -497,7 +528,8 @@ static u64 clear_lo32(u64 val)
  * Allocate pages and vmap them into kernel vmalloc area.
  * Later the pages will be mmaped into user space vma.
  */
-static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id)
+static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt, int node_id,
+			      bool sleepable)
 {
 	/* user_vm_end/start are fixed before bpf prog runs */
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
@@ -506,6 +538,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	struct page **pages = NULL;
 	long remaining, mapped = 0;
 	long alloc_pages;
+	unsigned long flags;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -529,7 +562,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 		return 0;
 	data.pages = pages;
 
-	mutex_lock(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		goto out_free_pages;
 
 	if (uaddr) {
 		ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
@@ -573,7 +607,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			/* data.i pages were mapped, account them and free the remaining */
 			mapped += data.i;
 			for (i = data.i; i < this_batch; i++)
-				__free_page(pages[i]);
+				free_pages_nolock(pages[i], 0);
 			goto out;
 		}
 
@@ -581,19 +615,19 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 		remaining -= this_batch;
 	}
 	flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
-	mutex_unlock(&arena->lock);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	kfree_nolock(pages);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
 	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
-	mutex_unlock(&arena->lock);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	if (mapped) {
 		flush_vmap_cache(kern_vm_start + uaddr32, mapped << PAGE_SHIFT);
-		arena_free_pages(arena, uaddr32, mapped);
+		arena_free_pages(arena, uaddr32, mapped, sleepable);
 	}
 	goto out_free_pages;
 out_unlock_free_pages:
-	mutex_unlock(&arena->lock);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 out_free_pages:
 	kfree_nolock(pages);
 	return 0;
@@ -608,42 +642,64 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 {
 	struct vma_list *vml;
 
+	guard(mutex)(&arena->lock);
+	/* iterate link list under lock */
 	list_for_each_entry(vml, &arena->vma_list, head)
 		zap_page_range_single(vml->vma, uaddr,
 				      PAGE_SIZE * page_cnt, NULL);
 }
 
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable)
 {
 	u64 full_uaddr, uaddr_end;
-	long kaddr, pgoff, i;
+	long kaddr, pgoff;
 	struct page *page;
+	struct llist_head free_pages;
+	struct llist_node *pos, *t;
+	struct arena_free_span *s;
+	unsigned long flags;
+	int ret = 0;
 
 	/* only aligned lower 32-bit are relevant */
 	uaddr = (u32)uaddr;
 	uaddr &= PAGE_MASK;
+	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
 	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
 	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
 	if (full_uaddr >= uaddr_end)
 		return;
 
 	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
+	pgoff = compute_pgoff(arena, uaddr);
 
-	guard(mutex)(&arena->lock);
+	if (!sleepable)
+		goto defer;
+
+	ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags);
+
+	/* Can't proceed without holding the spinlock so defer the free */
+	if (ret)
+		goto defer;
 
-	pgoff = compute_pgoff(arena, uaddr);
-	/* clear range */
 	range_tree_set(&arena->rt, pgoff, page_cnt);
 
+	init_llist_head(&free_pages);
+	/* clear ptes and collect struct pages */
+	apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+				     apply_range_clear_cb, &free_pages);
+
+	/* drop the lock to do the tlb flush and zap pages */
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+	/* ensure no stale TLB entries */
+	flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
 	if (page_cnt > 1)
 		/* bulk zap if multiple pages being freed */
 		zap_pages(arena, full_uaddr, page_cnt);
 
-	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
-	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
-		page = vmalloc_to_page((void *)kaddr);
-		if (!page)
-			continue;
+	llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) {
+		page = llist_entry(pos, struct page, pcp_llist);
 		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
 			/* Optimization for the common case of page_cnt==1:
 			 * If page wasn't mapped into some user vma there
@@ -651,9 +707,25 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 			 * page_cnt is big it's faster to do the batched zap.
 			 */
 			zap_pages(arena, full_uaddr, 1);
-		apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb,
-					     NULL);
+		__free_page(page);
 	}
+
+	return;
+
+defer:
+	s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1);
+	if (!s)
+		/*
+		 * If allocation fails in non-sleepable context, pages are intentionally left
+		 * inaccessible (leaked) until the arena is destroyed. Cleanup or retries are not
+		 * possible here, so we intentionally omit them for safety.
+		 */
+		return;
+
+	s->page_cnt = page_cnt;
+	s->uaddr = uaddr;
+	llist_add(&s->node, &arena->free_spans);
+	irq_work_queue(&arena->free_irq);
 }
 
 /*
@@ -663,6 +735,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt)
 {
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
+	unsigned long flags;
 	long pgoff;
 	int ret;
 
@@ -673,15 +746,87 @@ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt
 	if (pgoff + page_cnt > page_cnt_max)
 		return -EINVAL;
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		return -EBUSY;
 
 	/* Cannot guard already allocated pages. */
 	ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
-	if (ret)
-		return -EBUSY;
+	if (ret) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	/* "Allocate" the region to prevent it from being allocated. */
-	return range_tree_clear(&arena->rt, pgoff, page_cnt);
+	ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
+out:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+	return ret;
+}
+
+static void arena_free_worker(struct work_struct *work)
+{
+	struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work);
+	struct llist_node *list, *pos, *t;
+	struct arena_free_span *s;
+	u64 arena_vm_start, user_vm_start;
+	struct llist_head free_pages;
+	struct page *page;
+	unsigned long full_uaddr;
+	long kaddr, page_cnt, pgoff;
+	unsigned long flags;
+
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) {
+		schedule_work(work);
+		return;
+	}
+
+	init_llist_head(&free_pages);
+	arena_vm_start = bpf_arena_get_kern_vm_start(arena);
+	user_vm_start = bpf_arena_get_user_vm_start(arena);
+
+	list = llist_del_all(&arena->free_spans);
+	llist_for_each(pos, list) {
+		s = llist_entry(pos, struct arena_free_span, node);
+		page_cnt = s->page_cnt;
+		kaddr = arena_vm_start + s->uaddr;
+		pgoff = compute_pgoff(arena, s->uaddr);
+
+		/* clear ptes and collect pages in free_pages llist */
+		apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+					     apply_range_clear_cb, &free_pages);
+
+		range_tree_set(&arena->rt, pgoff, page_cnt);
+	}
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+	/* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */
+	llist_for_each_safe(pos, t, list) {
+		s = llist_entry(pos, struct arena_free_span, node);
+		page_cnt = s->page_cnt;
+		full_uaddr = clear_lo32(user_vm_start) + s->uaddr;
+		kaddr = arena_vm_start + s->uaddr;
+
+		/* ensure no stale TLB entries */
+		flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
+		/* remove pages from user vmas */
+		zap_pages(arena, full_uaddr, page_cnt);
+
+		kfree_nolock(s);
+	}
+
+	/* free all pages collected by apply_to_existing_page_range() in the first loop */
+	llist_for_each_safe(pos, t, __llist_del_all(&free_pages)) {
+		page = llist_entry(pos, struct page, pcp_llist);
+		__free_page(page);
+	}
+}
+
+static void arena_free_irq(struct irq_work *iw)
+{
+	struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq);
+
+	schedule_work(&arena->free_work);
 }
 
 __bpf_kfunc_start_defs();
@@ -695,9 +840,20 @@ __bpf_kfunc void *bpf_arena_alloc_pages(void *p__map, void *addr__ign, u32 page_
 	if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt)
 		return NULL;
 
-	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id);
+	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, true);
 }
 
+void *bpf_arena_alloc_pages_non_sleepable(void *p__map, void *addr__ign, u32 page_cnt,
+					  int node_id, u64 flags)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || flags || !page_cnt)
+		return NULL;
+
+	return (void *)arena_alloc_pages(arena, (long)addr__ign, page_cnt, node_id, false);
+}
 __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt)
 {
 	struct bpf_map *map = p__map;
@@ -705,7 +861,17 @@ __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt
 
 	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
 		return;
-	arena_free_pages(arena, (long)ptr__ign, page_cnt);
+	arena_free_pages(arena, (long)ptr__ign, page_cnt, true);
+}
+
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
+		return;
+	arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
 }
 
 __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt)
@@ -724,9 +890,9 @@ __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_c
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(arena_kfuncs)
-BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
 BTF_KFUNCS_END(arena_kfuncs)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d6b8a77fbe3b..2de1a736ef69 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12380,6 +12380,8 @@ enum special_kfunc_type {
 	KF___bpf_trap,
 	KF_bpf_task_work_schedule_signal_impl,
 	KF_bpf_task_work_schedule_resume_impl,
+	KF_bpf_arena_alloc_pages,
+	KF_bpf_arena_free_pages,
 };
 
 BTF_ID_LIST(special_kfunc_list)
@@ -12454,6 +12456,8 @@ BTF_ID(func, bpf_dynptr_file_discard)
 BTF_ID(func, __bpf_trap)
 BTF_ID(func, bpf_task_work_schedule_signal_impl)
 BTF_ID(func, bpf_task_work_schedule_resume_impl)
+BTF_ID(func, bpf_arena_alloc_pages)
+BTF_ID(func, bpf_arena_free_pages)
 
 static bool is_task_work_add_kfunc(u32 func_id)
 {
@@ -22432,6 +22436,12 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc
 	} else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) {
 		if (!env->insn_aux_data[insn_idx].non_sleepable)
 			addr = (unsigned long)bpf_dynptr_from_file_sleepable;
+	} else if (func_id == special_kfunc_list[KF_bpf_arena_alloc_pages]) {
+		if (env->insn_aux_data[insn_idx].non_sleepable)
+			addr = (unsigned long)bpf_arena_alloc_pages_non_sleepable;
+	} else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) {
+		if (env->insn_aux_data[insn_idx].non_sleepable)
+			addr = (unsigned long)bpf_arena_free_pages_non_sleepable;
 	}
 	desc->addr = addr;
 	return 0;
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
                   ` (2 preceding siblings ...)
  2025-12-22 19:50 ` [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
@ 2025-12-22 19:50 ` Puranjay Mohan
  2025-12-23  5:03   ` Alexei Starovoitov
  3 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-22 19:50 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

As arena kfuncs can now be called from non-sleepable contexts, test this
by adding non-sleepable copies of tests in verifier_arena, this is done
by using a socket program instead of syscall.

Add a new test case in verifier_arena_large to check that the
bpf_arena_alloc_pages() works for more than 1024 pages.
1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but
bpf_arena_alloc_pages() should still succeed because it re-uses this
array in a loop.

Augment the arena_list selftest to also run in non-sleepable context by
taking rcu_read_lock.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/prog_tests/arena_list.c     |  20 +-
 .../testing/selftests/bpf/progs/arena_list.c  |  11 ++
 .../selftests/bpf/progs/verifier_arena.c      | 185 ++++++++++++++++++
 .../bpf/progs/verifier_arena_large.c          |  29 +++
 4 files changed, 240 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c
index d15867cddde0..4f2866a615ce 100644
--- a/tools/testing/selftests/bpf/prog_tests/arena_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c
@@ -27,17 +27,23 @@ static int list_sum(struct arena_list_head *head)
 	return sum;
 }
 
-static void test_arena_list_add_del(int cnt)
+static void test_arena_list_add_del(int cnt, bool nonsleepable)
 {
 	LIBBPF_OPTS(bpf_test_run_opts, opts);
 	struct arena_list *skel;
 	int expected_sum = (u64)cnt * (cnt - 1) / 2;
 	int ret, sum;
 
-	skel = arena_list__open_and_load();
-	if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load"))
+	skel = arena_list__open();
+	if (!ASSERT_OK_PTR(skel, "arena_list__open"))
 		return;
 
+	skel->rodata->nonsleepable = nonsleepable;
+
+	ret = arena_list__load(skel);
+	if (!ASSERT_OK(ret, "arena_list__load"))
+		goto out;
+
 	skel->bss->cnt = cnt;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
 	ASSERT_OK(ret, "ret_add");
@@ -65,7 +71,11 @@ static void test_arena_list_add_del(int cnt)
 void test_arena_list(void)
 {
 	if (test__start_subtest("arena_list_1"))
-		test_arena_list_add_del(1);
+		test_arena_list_add_del(1, false);
 	if (test__start_subtest("arena_list_1000"))
-		test_arena_list_add_del(1000);
+		test_arena_list_add_del(1000, false);
+	if (test__start_subtest("arena_list_1_nonsleepable"))
+		test_arena_list_add_del(1, true);
+	if (test__start_subtest("arena_list_1000_nonsleepable"))
+		test_arena_list_add_del(1000, true);
 }
diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c
index 3a2ddcacbea6..235d8cc95bdd 100644
--- a/tools/testing/selftests/bpf/progs/arena_list.c
+++ b/tools/testing/selftests/bpf/progs/arena_list.c
@@ -30,6 +30,7 @@ struct arena_list_head __arena *list_head;
 int list_sum;
 int cnt;
 bool skip = false;
+const volatile bool nonsleepable = false;
 
 #ifdef __BPF_FEATURE_ADDR_SPACE_CAST
 long __arena arena_sum;
@@ -42,6 +43,9 @@ int test_val SEC(".addr_space.1");
 
 int zero;
 
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+
 SEC("syscall")
 int arena_list_add(void *ctx)
 {
@@ -71,6 +75,10 @@ int arena_list_del(void *ctx)
 	struct elem __arena *n;
 	int sum = 0;
 
+	/* Take rcu_read_lock to test non-sleepable context */
+	if (nonsleepable)
+		bpf_rcu_read_lock();
+
 	arena_sum = 0;
 	list_for_each_entry(n, list_head, node) {
 		sum += n->value;
@@ -79,6 +87,9 @@ int arena_list_del(void *ctx)
 		bpf_free(n);
 	}
 	list_sum = sum;
+
+	if (nonsleepable)
+		bpf_rcu_read_unlock();
 #else
 	skip = true;
 #endif
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c
index 7f4827eede3c..4a9d96344813 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena.c
@@ -21,6 +21,37 @@ struct {
 #endif
 } arena SEC(".maps");
 
+SEC("socket")
+__success __retval(0)
+int basic_alloc1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	volatile int __arena *page1, *page2, *no_page;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	*page1 = 1;
+	page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page2)
+		return 2;
+	*page2 = 2;
+	no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (no_page)
+		return 3;
+	if (*page1 != 1)
+		return 4;
+	if (*page2 != 2)
+		return 5;
+	bpf_arena_free_pages(&arena, (void __arena *)page2, 1);
+	if (*page1 != 1)
+		return 6;
+	if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */
+		return 7;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_alloc1(void *ctx)
@@ -60,6 +91,44 @@ int basic_alloc1(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_alloc2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	volatile char __arena *page1, *page2, *page3, *page4;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	page2 = page1 + __PAGE_SIZE;
+	page3 = page1 + __PAGE_SIZE * 2;
+	page4 = page1 - __PAGE_SIZE;
+	*page1 = 1;
+	*page2 = 2;
+	*page3 = 3;
+	*page4 = 4;
+	if (*page1 != 1)
+		return 1;
+	if (*page2 != 2)
+		return 2;
+	if (*page3 != 0)
+		return 3;
+	if (*page4 != 0)
+		return 4;
+	bpf_arena_free_pages(&arena, (void __arena *)page1, 2);
+	if (*page1 != 0 && *page1 != 1)
+		return 5;
+	if (*page2 != 0 && *page2 != 2)
+		return 6;
+	if (*page3 != 0)
+		return 7;
+	if (*page4 != 0)
+		return 8;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_alloc2(void *ctx)
@@ -102,6 +171,19 @@ struct bpf_arena___l {
         struct bpf_map map;
 } __attribute__((preserve_access_index));
 
+SEC("socket")
+__success __retval(0) __log_level(2)
+int basic_alloc3_nosleep(void *ctx)
+{
+	struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena;
+	volatile char __arena *pages;
+
+	pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0);
+	if (!pages)
+		return 1;
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0) __log_level(2)
 int basic_alloc3(void *ctx)
@@ -115,6 +197,38 @@ int basic_alloc3(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_reserve1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page)
+		return 1;
+
+	page += __PAGE_SIZE;
+
+	/* Reserve the second page */
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 2;
+
+	/* Try to explicitly allocate the reserved page. */
+	page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+	if (page)
+		return 3;
+
+	/* Try to implicitly allocate the page (since there's only 2 of them). */
+	page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (page)
+		return 4;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_reserve1(void *ctx)
@@ -147,6 +261,26 @@ int basic_reserve1(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_reserve2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = arena_base(&arena);
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 1;
+
+	page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+	if ((u64)page)
+		return 2;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_reserve2(void *ctx)
@@ -168,6 +302,27 @@ int basic_reserve2(void *ctx)
 }
 
 /* Reserve the same page twice, should return -EBUSY. */
+SEC("socket")
+__success __retval(0)
+int reserve_twice_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = arena_base(&arena);
+
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 1;
+
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret != -EBUSY)
+		return 2;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int reserve_twice(void *ctx)
@@ -190,6 +345,36 @@ int reserve_twice(void *ctx)
 }
 
 /* Try to reserve past the end of the arena. */
+SEC("socket")
+__success __retval(0)
+int reserve_invalid_region_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	/* Try a NULL pointer. */
+	ret = bpf_arena_reserve_pages(&arena, NULL, 3);
+	if (ret != -EINVAL)
+		return 1;
+
+	page = arena_base(&arena);
+
+	ret = bpf_arena_reserve_pages(&arena, page, 3);
+	if (ret != -EINVAL)
+		return 2;
+
+	ret = bpf_arena_reserve_pages(&arena, page, 4096);
+	if (ret != -EINVAL)
+		return 3;
+
+	ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1);
+	if (ret != -EINVAL)
+		return 4;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int reserve_invalid_region(void *ctx)
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
index 2b8cf2a4d880..4ca491cbe8d1 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
@@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
 		return 9;
 	return 0;
 }
+
+SEC("socket")
+__success __retval(0)
+int big_alloc3(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *pages;
+	u64 i;
+
+	/*
+	 * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
+	 * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
+	 * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
+	 * pages.
+	 */
+	pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
+	if (!pages)
+		return -1;
+
+	bpf_for(i, 0, 2051)
+			pages[i * PAGE_SIZE] = 123;
+	bpf_for(i, 0, 2051)
+			if (pages[i * PAGE_SIZE] != 123)
+				return i;
+
+	bpf_arena_free_pages(&arena, pages, 2051);
+#endif
+	return 0;
+}
 #endif
 char _license[] SEC("license") = "GPL";
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
@ 2025-12-23  5:03   ` Alexei Starovoitov
  2025-12-23 14:51     ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-23  5:03 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Kernel Team

On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
>  int reserve_invalid_region(void *ctx)
> diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> index 2b8cf2a4d880..4ca491cbe8d1 100644
> --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
>                 return 9;
>         return 0;
>  }
> +
> +SEC("socket")
> +__success __retval(0)
> +int big_alloc3(void *ctx)
> +{
> +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> +       char __arena *pages;
> +       u64 i;
> +
> +       /*
> +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> +        * pages.
> +        */
> +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> +       if (!pages)
> +               return -1;
> +
> +       bpf_for(i, 0, 2051)
> +                       pages[i * PAGE_SIZE] = 123;
> +       bpf_for(i, 0, 2051)
> +                       if (pages[i * PAGE_SIZE] != 123)
> +                               return i;
> +
> +       bpf_arena_free_pages(&arena, pages, 2051);
> +#endif
> +       return 0;
> +}

CI says that it's failing on arm64.
Error: #511/6 verifier_arena_large/big_alloc3
run_subtest:FAIL:1299 Unexpected retval: -1 != 0

cannot quite tell whether it's sporadic or caused by this patch set.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-23  5:03   ` Alexei Starovoitov
@ 2025-12-23 14:51     ` Puranjay Mohan
  2025-12-23 19:35       ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-23 14:51 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >
> >  int reserve_invalid_region(void *ctx)
> > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> >                 return 9;
> >         return 0;
> >  }
> > +
> > +SEC("socket")
> > +__success __retval(0)
> > +int big_alloc3(void *ctx)
> > +{
> > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > +       char __arena *pages;
> > +       u64 i;
> > +
> > +       /*
> > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > +        * pages.
> > +        */
> > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > +       if (!pages)
> > +               return -1;
> > +
> > +       bpf_for(i, 0, 2051)
> > +                       pages[i * PAGE_SIZE] = 123;
> > +       bpf_for(i, 0, 2051)
> > +                       if (pages[i * PAGE_SIZE] != 123)
> > +                               return i;
> > +
> > +       bpf_arena_free_pages(&arena, pages, 2051);
> > +#endif
> > +       return 0;
> > +}
>
> CI says that it's failing on arm64.
> Error: #511/6 verifier_arena_large/big_alloc3
> run_subtest:FAIL:1299 Unexpected retval: -1 != 0
>
> cannot quite tell whether it's sporadic or caused by this patch set.

I tried reproducing it locally multiple times and it didn't fail. It
also doesn't fail on manual CI run:
https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475

I assume it is sporadic.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-23 14:51     ` Puranjay Mohan
@ 2025-12-23 19:35       ` Alexei Starovoitov
  2025-12-23 23:13         ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-23 19:35 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > >
> > >  int reserve_invalid_region(void *ctx)
> > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > >                 return 9;
> > >         return 0;
> > >  }
> > > +
> > > +SEC("socket")
> > > +__success __retval(0)
> > > +int big_alloc3(void *ctx)
> > > +{
> > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > +       char __arena *pages;
> > > +       u64 i;
> > > +
> > > +       /*
> > > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > +        * pages.
> > > +        */
> > > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > +       if (!pages)
> > > +               return -1;
> > > +
> > > +       bpf_for(i, 0, 2051)
> > > +                       pages[i * PAGE_SIZE] = 123;
> > > +       bpf_for(i, 0, 2051)
> > > +                       if (pages[i * PAGE_SIZE] != 123)
> > > +                               return i;
> > > +
> > > +       bpf_arena_free_pages(&arena, pages, 2051);
> > > +#endif
> > > +       return 0;
> > > +}
> >
> > CI says that it's failing on arm64.
> > Error: #511/6 verifier_arena_large/big_alloc3
> > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> >
> > cannot quite tell whether it's sporadic or caused by this patch set.
>
> I tried reproducing it locally multiple times and it didn't fail. It
> also doesn't fail on manual CI run:
> https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
>
> I assume it is sporadic.

Ok. Applied. Let's watch for this. If it's actually flaky
we need to fix it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-23 19:35       ` Alexei Starovoitov
@ 2025-12-23 23:13         ` Puranjay Mohan
  2025-12-24  0:02           ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-23 23:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > >
> > > >  int reserve_invalid_region(void *ctx)
> > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > >                 return 9;
> > > >         return 0;
> > > >  }
> > > > +
> > > > +SEC("socket")
> > > > +__success __retval(0)
> > > > +int big_alloc3(void *ctx)
> > > > +{
> > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > +       char __arena *pages;
> > > > +       u64 i;
> > > > +
> > > > +       /*
> > > > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > +        * pages.
> > > > +        */
> > > > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > +       if (!pages)
> > > > +               return -1;
> > > > +
> > > > +       bpf_for(i, 0, 2051)
> > > > +                       pages[i * PAGE_SIZE] = 123;
> > > > +       bpf_for(i, 0, 2051)
> > > > +                       if (pages[i * PAGE_SIZE] != 123)
> > > > +                               return i;
> > > > +
> > > > +       bpf_arena_free_pages(&arena, pages, 2051);
> > > > +#endif
> > > > +       return 0;
> > > > +}
> > >
> > > CI says that it's failing on arm64.
> > > Error: #511/6 verifier_arena_large/big_alloc3
> > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > >
> > > cannot quite tell whether it's sporadic or caused by this patch set.
> >
> > I tried reproducing it locally multiple times and it didn't fail. It
> > also doesn't fail on manual CI run:
> > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> >
> > I assume it is sporadic.
>
> Ok. Applied. Let's watch for this. If it's actually flaky
> we need to fix it.

I have found out why it fails sometimes:

arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
alloc_pages_nolock(1) this is called in a loop and fails sometimes,
from my debug prints:

__bpf_alloc_page: alloc_pages_nolock failed for nid=-1
bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
already allocated pages
bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages


The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
So, I think we can do the same for the CI.
The CI currently runs through vmtest which runs a VM with 4G of memory
an 2 CPUs by default:

I checked the logs of the CI and saw:

[ 0.626933] smp: Brought up 1 node, 2 CPUs
[ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
[...]
[ 0.629145] Memory: 3388084K/4193784K available


I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.

Here is a PR for this change: https://github.com/libbpf/ci/pull/206

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-23 23:13         ` Puranjay Mohan
@ 2025-12-24  0:02           ` Alexei Starovoitov
  2025-12-24  0:28             ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-24  0:02 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > >
> > > > >  int reserve_invalid_region(void *ctx)
> > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > >                 return 9;
> > > > >         return 0;
> > > > >  }
> > > > > +
> > > > > +SEC("socket")
> > > > > +__success __retval(0)
> > > > > +int big_alloc3(void *ctx)
> > > > > +{
> > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > +       char __arena *pages;
> > > > > +       u64 i;
> > > > > +
> > > > > +       /*
> > > > > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > +        * pages.
> > > > > +        */
> > > > > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > +       if (!pages)
> > > > > +               return -1;
> > > > > +
> > > > > +       bpf_for(i, 0, 2051)
> > > > > +                       pages[i * PAGE_SIZE] = 123;
> > > > > +       bpf_for(i, 0, 2051)
> > > > > +                       if (pages[i * PAGE_SIZE] != 123)
> > > > > +                               return i;
> > > > > +
> > > > > +       bpf_arena_free_pages(&arena, pages, 2051);
> > > > > +#endif
> > > > > +       return 0;
> > > > > +}
> > > >
> > > > CI says that it's failing on arm64.
> > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > >
> > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > >
> > > I tried reproducing it locally multiple times and it didn't fail. It
> > > also doesn't fail on manual CI run:
> > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > >
> > > I assume it is sporadic.
> >
> > Ok. Applied. Let's watch for this. If it's actually flaky
> > we need to fix it.
>
> I have found out why it fails sometimes:
>
> arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> from my debug prints:
>
> __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> already allocated pages
> bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
>
>
> The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.

That doesn't quite make sense.
The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
close to a Gbyte. So 4Gb should be plenty.
Number of cpus shouldn't matter either.

> So, I think we can do the same for the CI.
> The CI currently runs through vmtest which runs a VM with 4G of memory
> an 2 CPUs by default:
>
> I checked the logs of the CI and saw:
>
> [ 0.626933] smp: Brought up 1 node, 2 CPUs
> [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> [...]
> [ 0.629145] Memory: 3388084K/4193784K available
>
>
> I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
>
> Here is a PR for this change: https://github.com/libbpf/ci/pull/206

I don't think we should bump it without full understanding.
It's better to make selftest recover on page alloc failure.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-24  0:02           ` Alexei Starovoitov
@ 2025-12-24  0:28             ` Puranjay Mohan
  2025-12-24  0:29               ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-24  0:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Wed, Dec 24, 2025 at 12:02 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > > >
> > > > > >  int reserve_invalid_region(void *ctx)
> > > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > >                 return 9;
> > > > > >         return 0;
> > > > > >  }
> > > > > > +
> > > > > > +SEC("socket")
> > > > > > +__success __retval(0)
> > > > > > +int big_alloc3(void *ctx)
> > > > > > +{
> > > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > > +       char __arena *pages;
> > > > > > +       u64 i;
> > > > > > +
> > > > > > +       /*
> > > > > > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > > +        * pages.
> > > > > > +        */
> > > > > > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > > +       if (!pages)
> > > > > > +               return -1;
> > > > > > +
> > > > > > +       bpf_for(i, 0, 2051)
> > > > > > +                       pages[i * PAGE_SIZE] = 123;
> > > > > > +       bpf_for(i, 0, 2051)
> > > > > > +                       if (pages[i * PAGE_SIZE] != 123)
> > > > > > +                               return i;
> > > > > > +
> > > > > > +       bpf_arena_free_pages(&arena, pages, 2051);
> > > > > > +#endif
> > > > > > +       return 0;
> > > > > > +}
> > > > >
> > > > > CI says that it's failing on arm64.
> > > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > > >
> > > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > > >
> > > > I tried reproducing it locally multiple times and it didn't fail. It
> > > > also doesn't fail on manual CI run:
> > > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > > >
> > > > I assume it is sporadic.
> > >
> > > Ok. Applied. Let's watch for this. If it's actually flaky
> > > we need to fix it.
> >
> > I have found out why it fails sometimes:
> >
> > arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> > alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> > from my debug prints:
> >
> > __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> > bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> > already allocated pages
> > bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> > fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
> >
> >
> > The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
>
> That doesn't quite make sense.
> The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
> close to a Gbyte. So 4Gb should be plenty.
> Number of cpus shouldn't matter either.
>
> > So, I think we can do the same for the CI.
> > The CI currently runs through vmtest which runs a VM with 4G of memory
> > an 2 CPUs by default:
> >
> > I checked the logs of the CI and saw:
> >
> > [ 0.626933] smp: Brought up 1 node, 2 CPUs
> > [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> > [...]
> > [ 0.629145] Memory: 3388084K/4193784K available
> >
> >
> > I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
> >
> > Here is a PR for this change: https://github.com/libbpf/ci/pull/206
>
> I don't think we should bump it without full understanding.
> It's better to make selftest recover on page alloc failure.


Okay, I will debug deeper to find out exactly where it fails in
alloc_pages_nolock().
For now do we want to allow the CI to fail or I can send a patch with following:

--- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
@@ -300,7 +300,7 @@ int big_alloc3(void *ctx)
         */
        pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
        if (!pages)
-               return -1;
+               return 0;

        bpf_for(i, 0, 2051)
                        pages[i * PAGE_SIZE] = 123;

This will make this test unconditionally pass.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-24  0:28             ` Puranjay Mohan
@ 2025-12-24  0:29               ` Alexei Starovoitov
  2025-12-24 19:06                 ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Alexei Starovoitov @ 2025-12-24  0:29 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Tue, Dec 23, 2025 at 2:28 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> On Wed, Dec 24, 2025 at 12:02 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > > > >
> > > > > > >  int reserve_invalid_region(void *ctx)
> > > > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > > >                 return 9;
> > > > > > >         return 0;
> > > > > > >  }
> > > > > > > +
> > > > > > > +SEC("socket")
> > > > > > > +__success __retval(0)
> > > > > > > +int big_alloc3(void *ctx)
> > > > > > > +{
> > > > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > > > +       char __arena *pages;
> > > > > > > +       u64 i;
> > > > > > > +
> > > > > > > +       /*
> > > > > > > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > > > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > > > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > > > +        * pages.
> > > > > > > +        */
> > > > > > > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > > > +       if (!pages)
> > > > > > > +               return -1;
> > > > > > > +
> > > > > > > +       bpf_for(i, 0, 2051)
> > > > > > > +                       pages[i * PAGE_SIZE] = 123;
> > > > > > > +       bpf_for(i, 0, 2051)
> > > > > > > +                       if (pages[i * PAGE_SIZE] != 123)
> > > > > > > +                               return i;
> > > > > > > +
> > > > > > > +       bpf_arena_free_pages(&arena, pages, 2051);
> > > > > > > +#endif
> > > > > > > +       return 0;
> > > > > > > +}
> > > > > >
> > > > > > CI says that it's failing on arm64.
> > > > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > > > >
> > > > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > > > >
> > > > > I tried reproducing it locally multiple times and it didn't fail. It
> > > > > also doesn't fail on manual CI run:
> > > > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > > > >
> > > > > I assume it is sporadic.
> > > >
> > > > Ok. Applied. Let's watch for this. If it's actually flaky
> > > > we need to fix it.
> > >
> > > I have found out why it fails sometimes:
> > >
> > > arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> > > alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> > > from my debug prints:
> > >
> > > __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> > > bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> > > already allocated pages
> > > bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> > > fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
> > >
> > >
> > > The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
> >
> > That doesn't quite make sense.
> > The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
> > close to a Gbyte. So 4Gb should be plenty.
> > Number of cpus shouldn't matter either.
> >
> > > So, I think we can do the same for the CI.
> > > The CI currently runs through vmtest which runs a VM with 4G of memory
> > > an 2 CPUs by default:
> > >
> > > I checked the logs of the CI and saw:
> > >
> > > [ 0.626933] smp: Brought up 1 node, 2 CPUs
> > > [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> > > [...]
> > > [ 0.629145] Memory: 3388084K/4193784K available
> > >
> > >
> > > I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
> > >
> > > Here is a PR for this change: https://github.com/libbpf/ci/pull/206
> >
> > I don't think we should bump it without full understanding.
> > It's better to make selftest recover on page alloc failure.
>
>
> Okay, I will debug deeper to find out exactly where it fails in
> alloc_pages_nolock().
> For now do we want to allow the CI to fail or I can send a patch with following:
>
> --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> @@ -300,7 +300,7 @@ int big_alloc3(void *ctx)
>          */
>         pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
>         if (!pages)
> -               return -1;
> +               return 0;
>
>         bpf_for(i, 0, 2051)
>                         pages[i * PAGE_SIZE] = 123;
>
> This will make this test unconditionally pass.

Pls make it skip on failure instead of pass.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-12-24  0:29               ` Alexei Starovoitov
@ 2025-12-24 19:06                 ` Puranjay Mohan
  0 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-12-24 19:06 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Wed, Dec 24, 2025 at 12:29 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Dec 23, 2025 at 2:28 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >
> > On Wed, Dec 24, 2025 at 12:02 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Dec 23, 2025 at 1:13 PM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > >
> > > > On Tue, Dec 23, 2025 at 7:36 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Tue, Dec 23, 2025 at 4:51 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, Dec 23, 2025 at 5:04 AM Alexei Starovoitov
> > > > > > <alexei.starovoitov@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, Dec 22, 2025 at 9:50 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> > > > > > > >
> > > > > > > >  int reserve_invalid_region(void *ctx)
> > > > > > > > diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > > index 2b8cf2a4d880..4ca491cbe8d1 100644
> > > > > > > > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > > > > > > > @@ -283,5 +283,34 @@ int big_alloc2(void *ctx)
> > > > > > > >                 return 9;
> > > > > > > >         return 0;
> > > > > > > >  }
> > > > > > > > +
> > > > > > > > +SEC("socket")
> > > > > > > > +__success __retval(0)
> > > > > > > > +int big_alloc3(void *ctx)
> > > > > > > > +{
> > > > > > > > +#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
> > > > > > > > +       char __arena *pages;
> > > > > > > > +       u64 i;
> > > > > > > > +
> > > > > > > > +       /*
> > > > > > > > +        * Allocate 2051 pages in one go to check how kmalloc_nolock() handles large requests.
> > > > > > > > +        * Since kmalloc_nolock() can allocate up to 1024 struct page * at a time, this call should
> > > > > > > > +        * result in three batches: two batches of 1024 pages each, followed by a final batch of 3
> > > > > > > > +        * pages.
> > > > > > > > +        */
> > > > > > > > +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> > > > > > > > +       if (!pages)
> > > > > > > > +               return -1;
> > > > > > > > +
> > > > > > > > +       bpf_for(i, 0, 2051)
> > > > > > > > +                       pages[i * PAGE_SIZE] = 123;
> > > > > > > > +       bpf_for(i, 0, 2051)
> > > > > > > > +                       if (pages[i * PAGE_SIZE] != 123)
> > > > > > > > +                               return i;
> > > > > > > > +
> > > > > > > > +       bpf_arena_free_pages(&arena, pages, 2051);
> > > > > > > > +#endif
> > > > > > > > +       return 0;
> > > > > > > > +}
> > > > > > >
> > > > > > > CI says that it's failing on arm64.
> > > > > > > Error: #511/6 verifier_arena_large/big_alloc3
> > > > > > > run_subtest:FAIL:1299 Unexpected retval: -1 != 0
> > > > > > >
> > > > > > > cannot quite tell whether it's sporadic or caused by this patch set.
> > > > > >
> > > > > > I tried reproducing it locally multiple times and it didn't fail. It
> > > > > > also doesn't fail on manual CI run:
> > > > > > https://github.com/kernel-patches/bpf/actions/runs/20442781110/job/58740000164?pr=10475
> > > > > >
> > > > > > I assume it is sporadic.
> > > > >
> > > > > Ok. Applied. Let's watch for this. If it's actually flaky
> > > > > we need to fix it.
> > > >
> > > > I have found out why it fails sometimes:
> > > >
> > > > arena_alloc_pages() -> bpf_map_alloc_pages(1024) ->
> > > > alloc_pages_nolock(1) this is called in a loop and fails sometimes,
> > > > from my debug prints:
> > > >
> > > > __bpf_alloc_page: alloc_pages_nolock failed for nid=-1
> > > > bpf_map_alloc_pages: allocation failed at page 435/1024, freeing 435
> > > > already allocated pages
> > > > bpf_map_alloc_pages: returning ret=-12, allocated 435/1024 pages
> > > > fail: bpf_map_alloc_pages failed with ret=-12 for 1024 pages
> > > >
> > > >
> > > > The VM runs with 4G of memory, when I changed this to 8G, this stopped failing.
> > >
> > > That doesn't quite make sense.
> > > The test allocates 2051 pages, that's just 8 Mbyte. Nowhere
> > > close to a Gbyte. So 4Gb should be plenty.
> > > Number of cpus shouldn't matter either.
> > >
> > > > So, I think we can do the same for the CI.
> > > > The CI currently runs through vmtest which runs a VM with 4G of memory
> > > > an 2 CPUs by default:
> > > >
> > > > I checked the logs of the CI and saw:
> > > >
> > > > [ 0.626933] smp: Brought up 1 node, 2 CPUs
> > > > [ 0.628387] smpboot: Total of 2 processors activated (12029.10 BogoMIPS)
> > > > [...]
> > > > [ 0.629145] Memory: 3388084K/4193784K available
> > > >
> > > >
> > > > I think we should change the CI to run vmtest with 8 CPUs and 16G of memory.
> > > >
> > > > Here is a PR for this change: https://github.com/libbpf/ci/pull/206
> > >
> > > I don't think we should bump it without full understanding.
> > > It's better to make selftest recover on page alloc failure.
> >
> >
> > Okay, I will debug deeper to find out exactly where it fails in
> > alloc_pages_nolock().
> > For now do we want to allow the CI to fail or I can send a patch with following:
> >
> > --- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > +++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
> > @@ -300,7 +300,7 @@ int big_alloc3(void *ctx)
> >          */
> >         pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
> >         if (!pages)
> > -               return -1;
> > +               return 0;
> >
> >         bpf_for(i, 0, 2051)
> >                         pages[i * PAGE_SIZE] = 123;
> >
> > This will make this test unconditionally pass.
>
> Pls make it skip on failure instead of pass.


Extracted more information using some debug prints (AI Generated):


[   29.946603]   [mm/page_alloc.c:3386] PCP list[0] empty
[   29.946642]     Zone: DMA
[   29.946649]     CPU: 1
[   29.946655]     Migratetype: 0
[   29.946662]     Order: 0
[   29.946668]     Total PCP count: 491 (all migratetypes)
[   29.946681]     PCP high: 46754
[   29.946689]   [mm/page_alloc.c:3214] spin_trylock_irqsave(&zone->lock) FAILED
[   29.946706]     Zone: DMA
[   29.946713]     CPU: 1
[   29.946719]     Retry attempts: 3
[   29.946727]     Cycles spent: 384
[   29.946734]     Zone free pages: 221198
[   29.946743]     Zone watermarks: min=9903 low=12378 high=14853
[   29.946757]     Zone managed pages: 751344
[   29.946767]   [mm/page_alloc.c:3977] rmqueue() returned NULL for zone DMA
[   29.946783]   [mm/page_alloc.c:4010] get_page_from_freelist() failed
[   29.946797]     Zones attempted: 1
[   29.946805]     skip_kswapd_nodes: 0
[   29.946814]     skipped_kswapd_nodes: 0
[   29.946823] ============================================================
[   29.946838] alloc_pages_nolock() FAILED
[   29.946847]   Order: 0
[   29.946852]   Node: 0
[   29.946858]   Context: preempt_count=514 irqs_disabled=1 in_interrupt=512
[   29.946874]   Architecture: ARM64
[   29.946881]   Page size: 4096 bytes
[   29.946889] ============================================================
[   29.946905] bpf_map_alloc_pages() failed: page 670/1024 (nid=-1)


The failure occurs when allocating 1024 pages one-by-one in softirq
context on ARM64:

  1. PCP Exhaustion (mm/page_alloc.c:3386): After ~670 pages, the PCP
list for migratetype 0 (MIGRATE_UNMOVABLE) becomes empty, despite 491
pages remaining in other
  migratetype lists
  2. Zone Lock Contention (mm/page_alloc.c:3214): Fallback to buddy
allocator requires zone->lock, but spin_trylock_irqsave() fails after
3 attempts (384 cycles), even
  though 221,198 free pages are available

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-12-24 19:06 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-22 19:50 [PATCH bpf-next v8 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
2025-12-22 19:50 ` [PATCH bpf-next v8 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
2025-12-23  5:03   ` Alexei Starovoitov
2025-12-23 14:51     ` Puranjay Mohan
2025-12-23 19:35       ` Alexei Starovoitov
2025-12-23 23:13         ` Puranjay Mohan
2025-12-24  0:02           ` Alexei Starovoitov
2025-12-24  0:28             ` Puranjay Mohan
2025-12-24  0:29               ` Alexei Starovoitov
2025-12-24 19:06                 ` Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox