BPF List
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/4] Remove KF_SLEEPABLE from arena kfuncs
@ 2025-11-11 16:34 Puranjay Mohan
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-11-11 16:34 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:

The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.

bpf_arena_alloc_pages() had four points where it could sleep:

1. Mutex to protect range_tree: now replaced with rqspinlock

2. kvcalloc() for allocations: now replaced with kmalloc_nolock()

3. Allocating pages with bpf_map_alloc_pages(): this already calls
   alloc_pages_nolock() in non-sleepable contexts and therefore is safe.

4. Setting up kernel page tables with vm_area_map_pages():
   vm_area_map_pages() may allocate memory while inserting pages into bpf
   arena's vm_area. Now, at arena creation time populate all page table
   levels except the last level when new pages need to be inserted call
   apply_to_page_range() again which will only set_pte_at() those pages and
   will not allocate memory.

The above four changes make bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages() has to do the following steps:

1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done by 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.

The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
   table levels, apply_to_existing_page_range() with a callback is used
   that only does pte_clear() on the last level and leaves the intermediate
   page table levels intact. This is needed to make sure that
   bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
   intermediate page tables.

When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable context.

arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.

apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.

The arena selftest fails to load on s390x, this is due to an unrelated
bug in the verifier that is being exposed by the selftest that I here. I
have already sent a patch[1] to fix this.


[1] https://lore.kernel.org/all/20251111160949.45623-1-puranjay@kernel.org/

Puranjay Mohan (4):
  bpf: arena: populate vm_area without allocating memory
  bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  bpf: arena: make arena kfuncs any context safe
  selftests: bpf: test non-sleepable arena allocations

 include/linux/bpf.h                           |   2 +
 kernel/bpf/arena.c                            | 290 +++++++++++++++---
 kernel/bpf/verifier.c                         |   5 +
 .../selftests/bpf/prog_tests/arena_list.c     |  20 +-
 .../testing/selftests/bpf/progs/arena_list.c  |  11 +
 .../selftests/bpf/progs/verifier_arena.c      | 185 +++++++++++
 6 files changed, 472 insertions(+), 41 deletions(-)

-- 
2.47.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-11 16:34 [PATCH bpf-next 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
@ 2025-11-11 16:34 ` Puranjay Mohan
  2025-11-11 17:01   ` bot+bpf-ci
                     ` (3 more replies)
  2025-11-11 16:34 ` [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
                   ` (2 subsequent siblings)
  3 siblings, 4 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-11-11 16:34 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
  the last level
- when new pages need to be inserted call apply_to_page_range() again
  with apply_range_set_cb() which will only set_pte_at() those pages and
  will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
  apply_range_clear_cb() to clear the pte for the page to be removed. This
  doesn't free intermediate page table levels.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/arena.c | 74 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 68 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 1074ac4459f2..dd5100a2f93c 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -92,6 +92,63 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
 	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
 }
 
+struct apply_range_data {
+	struct page **pages;
+	int i;
+};
+
+static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
+{
+	struct apply_range_data *d = data;
+	struct page *page;
+
+	if (!data)
+		return 0;
+	/* sanity check */
+	if (unlikely(!pte_none(ptep_get(pte))))
+		return -EBUSY;
+
+	page = d->pages[d->i++];
+	/* paranoia, similar to vmap_pages_pte_range() */
+	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
+		return -EINVAL;
+
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+	return 0;
+}
+
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+{
+	struct mm_struct *mm = &init_mm;
+	pte_t old_pte;
+	struct page *page;
+
+	/* sanity check */
+	old_pte = ptep_get(pte);
+	if (pte_none(old_pte) || !pte_present(old_pte))
+		return 0; /* nothing to do */
+
+	/* get page and free it */
+	page = pte_page(old_pte);
+	if (WARN_ON_ONCE(!page))
+		return -EINVAL;
+
+	pte_clear(mm, addr, pte);
+
+	/* ensure no stale TLB entries */
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+	__free_page(page);
+
+	return 0;
+}
+
+static int populate_pgtable_except_pte(struct bpf_arena *arena)
+{
+	return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+				   KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
+}
+
 static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 {
 	struct vm_struct *kern_vm;
@@ -144,6 +201,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		goto err;
 	}
 	mutex_init(&arena->lock);
+	err = populate_pgtable_except_pte(arena);
+	if (err)
+		goto err;
 
 	return &arena->map;
 err:
@@ -286,6 +346,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 	if (ret)
 		return VM_FAULT_SIGSEGV;
 
+	struct apply_range_data data = { .pages = &page, .i = 0 };
 	/* Account into memcg of the process that created bpf_arena */
 	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
@@ -293,7 +354,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGSEGV;
 	}
 
-	ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
+	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
 		__free_page(page);
@@ -428,7 +489,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	/* user_vm_end/start are fixed before bpf prog runs */
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
 	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
-	struct page **pages;
+	struct page **pages = NULL;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -465,6 +526,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (ret)
 		goto out_free_pages;
 
+	struct apply_range_data data = { .pages = pages, .i = 0 };
 	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
 	if (ret)
 		goto out;
@@ -477,8 +539,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
 	 * lower 32-bit and it's ok.
 	 */
-	ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
-				kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
+	ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
+				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
 	if (ret) {
 		for (i = 0; i < page_cnt; i++)
 			__free_page(pages[i]);
@@ -545,8 +607,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 			 * page_cnt is big it's faster to do the batched zap.
 			 */
 			zap_pages(arena, full_uaddr, 1);
-		vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
-		__free_page(page);
+		apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+					     apply_range_clear_cb, NULL);
 	}
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-11 16:34 [PATCH bpf-next 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
@ 2025-11-11 16:34 ` Puranjay Mohan
  2025-11-11 17:01   ` bot+bpf-ci
  2025-11-11 16:34 ` [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
  2025-11-11 16:34 ` [PATCH bpf-next 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
  3 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-11-11 16:34 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/arena.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index dd5100a2f93c..9d8a8eb447fe 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -506,8 +506,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			return 0;
 	}
 
-	/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
-	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
+	pages = kmalloc_nolock(page_cnt * sizeof(struct page *), __GFP_ZERO, -1);
 	if (!pages)
 		return 0;
 
@@ -546,12 +545,12 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			__free_page(pages[i]);
 		goto out;
 	}
-	kvfree(pages);
+	kfree_nolock(pages);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
 	range_tree_set(&arena->rt, pgoff, page_cnt);
 out_free_pages:
-	kvfree(pages);
+	kfree_nolock(pages);
 	return 0;
 }
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-11 16:34 [PATCH bpf-next 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
  2025-11-11 16:34 ` [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
@ 2025-11-11 16:34 ` Puranjay Mohan
  2025-11-11 17:01   ` bot+bpf-ci
  2025-11-11 16:34 ` [PATCH bpf-next 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
  3 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2025-11-11 16:34 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Make arena related kfuncs any context safe by the following changes:

bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.

specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.

In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context.  arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h   |   2 +
 kernel/bpf/arena.c    | 227 +++++++++++++++++++++++++++++++++++-------
 kernel/bpf/verifier.c |   5 +
 3 files changed, 199 insertions(+), 35 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 09d5dc541d1c..5279212694b4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -673,6 +673,8 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj);
 int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags,
 				   struct bpf_dynptr *ptr__uninit);
 
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt);
+
 extern const struct bpf_map_ops bpf_map_offload_ops;
 
 /* bpf_type_flag contains a set of flags that are applicable to the values of
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 9d8a8eb447fe..f330b51ded7b 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -3,7 +3,9 @@
 #include <linux/bpf.h>
 #include <linux/btf.h>
 #include <linux/err.h>
+#include <linux/irq_work.h>
 #include "linux/filter.h"
+#include <linux/llist.h>
 #include <linux/btf_ids.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
@@ -48,8 +50,23 @@ struct bpf_arena {
 	u64 user_vm_end;
 	struct vm_struct *kern_vm;
 	struct range_tree rt;
+	/* protects rt */
+	rqspinlock_t spinlock;
 	struct list_head vma_list;
+	/* protects vma_list */
 	struct mutex lock;
+	struct irq_work     free_irq;
+	struct work_struct  free_work;
+	struct llist_head   free_spans;
+};
+
+static void arena_free_worker(struct work_struct *work);
+static void arena_free_irq(struct irq_work *iw);
+
+struct arena_free_span {
+	struct llist_node node;
+	unsigned long uaddr;
+	u32 page_cnt;
 };
 
 u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
@@ -117,7 +134,7 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
 	return 0;
 }
 
-static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages)
 {
 	struct mm_struct *mm = &init_mm;
 	pte_t old_pte;
@@ -128,17 +145,16 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
 	if (pte_none(old_pte) || !pte_present(old_pte))
 		return 0; /* nothing to do */
 
-	/* get page and free it */
+	/* get page and clear pte */
 	page = pte_page(old_pte);
 	if (WARN_ON_ONCE(!page))
 		return -EINVAL;
 
 	pte_clear(mm, addr, pte);
 
-	/* ensure no stale TLB entries */
-	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
-	__free_page(page);
+	/* Add page to the list so it is freed later */
+	if (free_pages)
+		__llist_add(&page->pcp_llist, free_pages);
 
 	return 0;
 }
@@ -193,6 +209,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		arena->user_vm_end = arena->user_vm_start + vm_range;
 
 	INIT_LIST_HEAD(&arena->vma_list);
+	init_llist_head(&arena->free_spans);
+	init_irq_work(&arena->free_irq, arena_free_irq);
+	INIT_WORK(&arena->free_work, arena_free_worker);
 	bpf_map_init_from_attr(&arena->map, attr);
 	range_tree_init(&arena->rt);
 	err = range_tree_set(&arena->rt, 0, attr->max_entries);
@@ -201,6 +220,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		goto err;
 	}
 	mutex_init(&arena->lock);
+	raw_res_spin_lock_init(&arena->spinlock);
 	err = populate_pgtable_except_pte(arena);
 	if (err)
 		goto err;
@@ -244,6 +264,10 @@ static void arena_map_free(struct bpf_map *map)
 	if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
 		return;
 
+	/* Ensure no pending deferred frees */
+	irq_work_sync(&arena->free_irq);
+	flush_work(&arena->free_work);
+
 	/*
 	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
 	 * It unmaps everything from vmalloc area and clears pgtables.
@@ -327,12 +351,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
 	struct page *page;
 	long kbase, kaddr;
+	unsigned long flags;
 	int ret;
 
 	kbase = bpf_arena_get_kern_vm_start(arena);
 	kaddr = kbase + (u32)(vmf->address);
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		/*
+		 * This is an impossible case and would only trigger if res_spin_lock is buggy or
+		 * due to another kernel bug.
+		 */
+		return VM_FAULT_RETRY;
+
 	page = vmalloc_to_page((void *)kaddr);
 	if (page)
 		/* already have a page vmap-ed */
@@ -344,26 +375,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 
 	ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
 	if (ret)
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 
 	struct apply_range_data data = { .pages = &page, .i = 0 };
 	/* Account into memcg of the process that created bpf_arena */
 	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 	}
 
 	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
-		__free_page(page);
-		return VM_FAULT_SIGSEGV;
+		free_pages_nolock(page, 0);
+		goto out_unlock_sigsegv;
 	}
 out:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	page_ref_add(page, 1);
 	vmf->page = page;
 	return 0;
+out_unlock_sigsegv:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+	return VM_FAULT_SIGSEGV;
 }
 
 static const struct vm_operations_struct arena_vm_ops = {
@@ -490,6 +525,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
 	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
 	struct page **pages = NULL;
+	unsigned long flags;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -510,12 +546,13 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (!pages)
 		return 0;
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		goto out_free_pages;
 
 	if (uaddr) {
 		ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
 		if (ret)
-			goto out_free_pages;
+			goto out_unlock_free_pages;
 		ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
 	} else {
 		ret = pgoff = range_tree_find(&arena->rt, page_cnt);
@@ -523,7 +560,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
 	}
 	if (ret)
-		goto out_free_pages;
+		goto out_unlock_free_pages;
 
 	struct apply_range_data data = { .pages = pages, .i = 0 };
 	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
@@ -542,13 +579,16 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
 	if (ret) {
 		for (i = 0; i < page_cnt; i++)
-			__free_page(pages[i]);
+			free_pages_nolock(pages[i], 0);
 		goto out;
 	}
 	kfree_nolock(pages);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
 	range_tree_set(&arena->rt, pgoff, page_cnt);
+out_unlock_free_pages:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 out_free_pages:
 	kfree_nolock(pages);
 	return 0;
@@ -563,42 +603,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 {
 	struct vma_list *vml;
 
+	guard(mutex)(&arena->lock);
+	/* iterate link list under lock */
 	list_for_each_entry(vml, &arena->vma_list, head)
 		zap_page_range_single(vml->vma, uaddr,
 				      PAGE_SIZE * page_cnt, NULL);
 }
 
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable)
 {
 	u64 full_uaddr, uaddr_end;
-	long kaddr, pgoff, i;
+	long kaddr, pgoff;
 	struct page *page;
+	struct llist_head free_pages;
+	struct llist_node *pos, *t;
+	struct arena_free_span *s;
+	unsigned long flags;
+	int ret = 0;
 
 	/* only aligned lower 32-bit are relevant */
 	uaddr = (u32)uaddr;
 	uaddr &= PAGE_MASK;
+	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
 	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
 	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
 	if (full_uaddr >= uaddr_end)
 		return;
 
 	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
+	pgoff = compute_pgoff(arena, uaddr);
 
-	guard(mutex)(&arena->lock);
+	if (!sleepable)
+		goto defer;
+
+	ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags);
+	/*
+	 * Can't proceed without holding the spinlock so defer the free
+	 */
+	if (ret)
+		goto defer;
 
-	pgoff = compute_pgoff(arena, uaddr);
-	/* clear range */
 	range_tree_set(&arena->rt, pgoff, page_cnt);
 
+	init_llist_head(&free_pages);
+	/* clear ptes and collect struct pages */
+	apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+				     apply_range_clear_cb, &free_pages);
+
+	/* drop the lock to do the tlb flush and zap pages */
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+	/* ensure no stale TLB entries */
+	flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
 	if (page_cnt > 1)
 		/* bulk zap if multiple pages being freed */
 		zap_pages(arena, full_uaddr, page_cnt);
 
-	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
-	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
-		page = vmalloc_to_page((void *)kaddr);
-		if (!page)
-			continue;
+	llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {
+		page = llist_entry(pos, struct page, pcp_llist);
 		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
 			/* Optimization for the common case of page_cnt==1:
 			 * If page wasn't mapped into some user vma there
@@ -606,9 +669,20 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 			 * page_cnt is big it's faster to do the batched zap.
 			 */
 			zap_pages(arena, full_uaddr, 1);
-		apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
-					     apply_range_clear_cb, NULL);
+		__free_page(page);
 	}
+
+	return;
+
+defer:
+	s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1);
+	if (!s)
+		return;
+
+	s->page_cnt = page_cnt;
+	s->uaddr = uaddr;
+	llist_add(&s->node, &arena->free_spans);
+	irq_work_queue(&arena->free_irq);
 }
 
 /*
@@ -618,6 +692,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt)
 {
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
+	unsigned long flags;
 	long pgoff;
 	int ret;
 
@@ -628,15 +703,87 @@ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt
 	if (pgoff + page_cnt > page_cnt_max)
 		return -EINVAL;
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		return -EBUSY;
 
 	/* Cannot guard already allocated pages. */
 	ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
-	if (ret)
-		return -EBUSY;
+	if (ret) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	/* "Allocate" the region to prevent it from being allocated. */
-	return range_tree_clear(&arena->rt, pgoff, page_cnt);
+	ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
+out:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+	return ret;
+}
+
+static void arena_free_worker(struct work_struct *work)
+{
+	struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work);
+	struct llist_node *list, *pos, *t;
+	struct arena_free_span *s;
+	u64 arena_vm_start, user_vm_start;
+	struct llist_head free_pages;
+	struct page *page;
+	unsigned long full_uaddr;
+	long kaddr, page_cnt, pgoff;
+	unsigned long flags;
+
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) {
+		schedule_work(work);
+		return;
+	}
+
+	init_llist_head(&free_pages);
+	arena_vm_start = bpf_arena_get_kern_vm_start(arena);
+	user_vm_start = bpf_arena_get_user_vm_start(arena);
+
+	list = llist_del_all(&arena->free_spans);
+	llist_for_each(pos, list) {
+		s = llist_entry(pos, struct arena_free_span, node);
+		page_cnt = s->page_cnt;
+		kaddr = arena_vm_start + s->uaddr;
+		pgoff = compute_pgoff(arena, s->uaddr);
+
+		/* clear ptes and collect pages in free_pages llist */
+		apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+					     apply_range_clear_cb, &free_pages);
+
+		range_tree_set(&arena->rt, pgoff, page_cnt);
+	}
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+	/* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */
+	llist_for_each_safe(pos, t, list) {
+		s = llist_entry(pos, struct arena_free_span, node);
+		page_cnt = s->page_cnt;
+		full_uaddr = user_vm_start + s->uaddr;
+		kaddr = arena_vm_start + s->uaddr;
+
+		/* ensure no stale TLB entries */
+		flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
+		/* remove pages from user vmas */
+		zap_pages(arena, full_uaddr, page_cnt);
+
+		kfree_nolock(s);
+	}
+
+	/* free all pages collected by apply_to_existing_page_range() in the first loop */
+	llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {
+		page = llist_entry(pos, struct page, pcp_llist);
+		__free_page(page);
+	}
+}
+
+static void arena_free_irq(struct irq_work *iw)
+{
+	struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq);
+
+	schedule_work(&arena->free_work);
 }
 
 __bpf_kfunc_start_defs();
@@ -660,7 +807,17 @@ __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt
 
 	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
 		return;
-	arena_free_pages(arena, (long)ptr__ign, page_cnt);
+	arena_free_pages(arena, (long)ptr__ign, page_cnt, true);
+}
+
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
+		return;
+	arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
 }
 
 __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt)
@@ -679,9 +836,9 @@ __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_c
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(arena_kfuncs)
-BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
 BTF_KFUNCS_END(arena_kfuncs)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1268fa075d4c..407f75daa1cb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12319,6 +12319,7 @@ enum special_kfunc_type {
 	KF___bpf_trap,
 	KF_bpf_task_work_schedule_signal,
 	KF_bpf_task_work_schedule_resume,
+	KF_bpf_arena_free_pages,
 };
 
 BTF_ID_LIST(special_kfunc_list)
@@ -12393,6 +12394,7 @@ BTF_ID(func, bpf_dynptr_file_discard)
 BTF_ID(func, __bpf_trap)
 BTF_ID(func, bpf_task_work_schedule_signal)
 BTF_ID(func, bpf_task_work_schedule_resume)
+BTF_ID(func, bpf_arena_free_pages)
 
 static bool is_task_work_add_kfunc(u32 func_id)
 {
@@ -22350,6 +22352,9 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc
 	} else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) {
 		if (!env->insn_aux_data[insn_idx].non_sleepable)
 			addr = (unsigned long)bpf_dynptr_from_file_sleepable;
+	} else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) {
+		if (env->insn_aux_data[insn_idx].non_sleepable)
+			addr = (unsigned long)bpf_arena_free_pages_non_sleepable;
 	}
 
 set_imm:
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf-next 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-11-11 16:34 [PATCH bpf-next 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
                   ` (2 preceding siblings ...)
  2025-11-11 16:34 ` [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
@ 2025-11-11 16:34 ` Puranjay Mohan
  3 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2025-11-11 16:34 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

As arena kfuncs can now be called from non-sleepable contexts, test this
by adding non-sleepable copies of tests in verifier_arena, this is done
by using a socket program instead of syscall.

Augment the arena_list selftest to also run in non-sleepable context by
taking rcu_read_lock.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/prog_tests/arena_list.c     |  20 +-
 .../testing/selftests/bpf/progs/arena_list.c  |  11 ++
 .../selftests/bpf/progs/verifier_arena.c      | 185 ++++++++++++++++++
 3 files changed, 211 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c
index d15867cddde0..4f2866a615ce 100644
--- a/tools/testing/selftests/bpf/prog_tests/arena_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c
@@ -27,17 +27,23 @@ static int list_sum(struct arena_list_head *head)
 	return sum;
 }
 
-static void test_arena_list_add_del(int cnt)
+static void test_arena_list_add_del(int cnt, bool nonsleepable)
 {
 	LIBBPF_OPTS(bpf_test_run_opts, opts);
 	struct arena_list *skel;
 	int expected_sum = (u64)cnt * (cnt - 1) / 2;
 	int ret, sum;
 
-	skel = arena_list__open_and_load();
-	if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load"))
+	skel = arena_list__open();
+	if (!ASSERT_OK_PTR(skel, "arena_list__open"))
 		return;
 
+	skel->rodata->nonsleepable = nonsleepable;
+
+	ret = arena_list__load(skel);
+	if (!ASSERT_OK(ret, "arena_list__load"))
+		goto out;
+
 	skel->bss->cnt = cnt;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
 	ASSERT_OK(ret, "ret_add");
@@ -65,7 +71,11 @@ static void test_arena_list_add_del(int cnt)
 void test_arena_list(void)
 {
 	if (test__start_subtest("arena_list_1"))
-		test_arena_list_add_del(1);
+		test_arena_list_add_del(1, false);
 	if (test__start_subtest("arena_list_1000"))
-		test_arena_list_add_del(1000);
+		test_arena_list_add_del(1000, false);
+	if (test__start_subtest("arena_list_1_nonsleepable"))
+		test_arena_list_add_del(1, true);
+	if (test__start_subtest("arena_list_1000_nonsleepable"))
+		test_arena_list_add_del(1000, true);
 }
diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c
index 3a2ddcacbea6..235d8cc95bdd 100644
--- a/tools/testing/selftests/bpf/progs/arena_list.c
+++ b/tools/testing/selftests/bpf/progs/arena_list.c
@@ -30,6 +30,7 @@ struct arena_list_head __arena *list_head;
 int list_sum;
 int cnt;
 bool skip = false;
+const volatile bool nonsleepable = false;
 
 #ifdef __BPF_FEATURE_ADDR_SPACE_CAST
 long __arena arena_sum;
@@ -42,6 +43,9 @@ int test_val SEC(".addr_space.1");
 
 int zero;
 
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+
 SEC("syscall")
 int arena_list_add(void *ctx)
 {
@@ -71,6 +75,10 @@ int arena_list_del(void *ctx)
 	struct elem __arena *n;
 	int sum = 0;
 
+	/* Take rcu_read_lock to test non-sleepable context */
+	if (nonsleepable)
+		bpf_rcu_read_lock();
+
 	arena_sum = 0;
 	list_for_each_entry(n, list_head, node) {
 		sum += n->value;
@@ -79,6 +87,9 @@ int arena_list_del(void *ctx)
 		bpf_free(n);
 	}
 	list_sum = sum;
+
+	if (nonsleepable)
+		bpf_rcu_read_unlock();
 #else
 	skip = true;
 #endif
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c
index 7f4827eede3c..4a9d96344813 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena.c
@@ -21,6 +21,37 @@ struct {
 #endif
 } arena SEC(".maps");
 
+SEC("socket")
+__success __retval(0)
+int basic_alloc1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	volatile int __arena *page1, *page2, *no_page;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	*page1 = 1;
+	page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page2)
+		return 2;
+	*page2 = 2;
+	no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (no_page)
+		return 3;
+	if (*page1 != 1)
+		return 4;
+	if (*page2 != 2)
+		return 5;
+	bpf_arena_free_pages(&arena, (void __arena *)page2, 1);
+	if (*page1 != 1)
+		return 6;
+	if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */
+		return 7;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_alloc1(void *ctx)
@@ -60,6 +91,44 @@ int basic_alloc1(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_alloc2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	volatile char __arena *page1, *page2, *page3, *page4;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	page2 = page1 + __PAGE_SIZE;
+	page3 = page1 + __PAGE_SIZE * 2;
+	page4 = page1 - __PAGE_SIZE;
+	*page1 = 1;
+	*page2 = 2;
+	*page3 = 3;
+	*page4 = 4;
+	if (*page1 != 1)
+		return 1;
+	if (*page2 != 2)
+		return 2;
+	if (*page3 != 0)
+		return 3;
+	if (*page4 != 0)
+		return 4;
+	bpf_arena_free_pages(&arena, (void __arena *)page1, 2);
+	if (*page1 != 0 && *page1 != 1)
+		return 5;
+	if (*page2 != 0 && *page2 != 2)
+		return 6;
+	if (*page3 != 0)
+		return 7;
+	if (*page4 != 0)
+		return 8;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_alloc2(void *ctx)
@@ -102,6 +171,19 @@ struct bpf_arena___l {
         struct bpf_map map;
 } __attribute__((preserve_access_index));
 
+SEC("socket")
+__success __retval(0) __log_level(2)
+int basic_alloc3_nosleep(void *ctx)
+{
+	struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena;
+	volatile char __arena *pages;
+
+	pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0);
+	if (!pages)
+		return 1;
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0) __log_level(2)
 int basic_alloc3(void *ctx)
@@ -115,6 +197,38 @@ int basic_alloc3(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_reserve1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page)
+		return 1;
+
+	page += __PAGE_SIZE;
+
+	/* Reserve the second page */
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 2;
+
+	/* Try to explicitly allocate the reserved page. */
+	page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+	if (page)
+		return 3;
+
+	/* Try to implicitly allocate the page (since there's only 2 of them). */
+	page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (page)
+		return 4;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_reserve1(void *ctx)
@@ -147,6 +261,26 @@ int basic_reserve1(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_reserve2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = arena_base(&arena);
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 1;
+
+	page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+	if ((u64)page)
+		return 2;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_reserve2(void *ctx)
@@ -168,6 +302,27 @@ int basic_reserve2(void *ctx)
 }
 
 /* Reserve the same page twice, should return -EBUSY. */
+SEC("socket")
+__success __retval(0)
+int reserve_twice_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = arena_base(&arena);
+
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 1;
+
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret != -EBUSY)
+		return 2;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int reserve_twice(void *ctx)
@@ -190,6 +345,36 @@ int reserve_twice(void *ctx)
 }
 
 /* Try to reserve past the end of the arena. */
+SEC("socket")
+__success __retval(0)
+int reserve_invalid_region_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	/* Try a NULL pointer. */
+	ret = bpf_arena_reserve_pages(&arena, NULL, 3);
+	if (ret != -EINVAL)
+		return 1;
+
+	page = arena_base(&arena);
+
+	ret = bpf_arena_reserve_pages(&arena, page, 3);
+	if (ret != -EINVAL)
+		return 2;
+
+	ret = bpf_arena_reserve_pages(&arena, page, 4096);
+	if (ret != -EINVAL)
+		return 3;
+
+	ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1);
+	if (ret != -EINVAL)
+		return 4;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int reserve_invalid_region(void *ctx)
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-11 16:34 ` [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
@ 2025-11-11 17:01   ` bot+bpf-ci
  2025-11-11 17:47     ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: bot+bpf-ci @ 2025-11-11 17:01 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1911 bytes --]

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index dd5100a2f..9d8a8eb44 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c
> @@ -506,8 +506,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  			return 0;
>  	}
>
> -	/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
> -	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> +	pages = kmalloc_nolock(page_cnt * sizeof(struct page *), __GFP_ZERO, -1);

Can kmalloc_nolock() handle the same allocation sizes as kvcalloc()?

Looking at arena_alloc_pages(), page_cnt can be up to the full arena
size (4GB / 4KB = 1M pages). This means the allocation size can reach
1M * 8 = 8MB for the pages array.

In mm/slub.c:5686, kmalloc_nolock() returns NULL for allocations larger
than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems
with 4KB pages.

kvcalloc() could handle larger allocations by falling back to vmalloc.
With kmalloc_nolock(), arena allocations requiring more than 1024 pages
will fail where they previously succeeded.

For example, arena_alloc_pages(arena, 0, 1025, node_id) allocates
1025 * 8 = 8200 bytes, which exceeds the 8KB limit and will return NULL.

>  	if (!pages)
>  		return 0;
>
[ ... ]

> @@ -546,12 +545,12 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  			__free_page(pages[i]);
>  		goto out;
>  	}
> -	kvfree(pages);
> +	kfree_nolock(pages);
>  	return clear_lo32(arena->user_vm_start) + uaddr32;
>  out:
>  	range_tree_set(&arena->rt, pgoff, page_cnt);
>  out_free_pages:
> -	kvfree(pages);
> +	kfree_nolock(pages);
>  	return 0;
>  }


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19272481461

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-11 16:34 ` [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
@ 2025-11-11 17:01   ` bot+bpf-ci
  2025-11-11 17:53     ` Alexei Starovoitov
  0 siblings, 1 reply; 13+ messages in thread
From: bot+bpf-ci @ 2025-11-11 17:01 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3159 bytes --]

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 09d5dc541..527921269 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h

[ ... ]

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 9d8a8eb44..f330b51de 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -327,12 +351,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>  	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
>  	struct page *page;
>  	long kbase, kaddr;
> +	unsigned long flags;
>  	int ret;
>
>  	kbase = bpf_arena_get_kern_vm_start(arena);
>  	kaddr = kbase + (u32)(vmf->address);
>
> -	guard(mutex)(&arena->lock);
> +	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
> +		/*
> +		 * This is an impossible case and would only trigger if res_spin_lock is buggy or
> +		 * due to another kernel bug.
> +		 */
> +		return VM_FAULT_RETRY;
> +
>  	page = vmalloc_to_page((void *)kaddr);
>  	if (page)
>  		/* already have a page vmap-ed */
> @@ -344,26 +375,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>
>  	ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
>  	if (ret)
> -		return VM_FAULT_SIGSEGV;
> +		goto out_unlock_sigsegv;
>
>  	struct apply_range_data data = { .pages = &page, .i = 0 };
>  	/* Account into memcg of the process that created bpf_arena */
>  	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
> -		return VM_FAULT_SIGSEGV;
> +		goto out_unlock_sigsegv;
>  	}
>
>  	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
> -		__free_page(page);
> -		return VM_FAULT_SIGSEGV;
> +		free_pages_nolock(page, 0);
> +		goto out_unlock_sigsegv;
>  	}
>  out:
> +	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
>  	page_ref_add(page, 1);
>  	vmf->page = page;
>  	return 0;
> +out_unlock_sigsegv:
> +	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> +	return VM_FAULT_SIGSEGV;
>  }

Can the page reference race with arena_free_pages()? The sequence:

  Thread A (arena_vm_fault):
    bpf_map_alloc_pages() -> page refcount=1
    apply_to_page_range() -> maps PTE
    raw_res_spin_unlock_irqrestore()
    page_ref_add(page, 1) -> tries to increment to 2

  Thread B (arena_free_pages):
    raw_res_spin_lock_irqsave()
    apply_to_existing_page_range()->apply_range_clear_cb() -> clears PTE
    raw_res_spin_unlock_irqrestore()
    __free_page(page) -> refcount 1->0, page freed

Between Thread A's unlock and page_ref_add(), Thread B can acquire the
lock, clear the PTE, and free the page. Then Thread A's page_ref_add()
operates on freed memory.

Should page_ref_add() happen before releasing the spinlock, or should
the initial allocation start with refcount=2?

[ ... ]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19272481461

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
@ 2025-11-11 17:01   ` bot+bpf-ci
  2025-11-13  4:49   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: bot+bpf-ci @ 2025-11-11 17:01 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 5772 bytes --]

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 1074ac445..dd5100a2f 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -92,6 +92,63 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
>  	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
>  }
>
> +struct apply_range_data {
> +	struct page **pages;
> +	int i;
> +};
> +
> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> +{
> +	struct apply_range_data *d = data;
> +	struct page *page;
> +
> +	if (!data)
> +		return 0;
> +	/* sanity check */
> +	if (unlikely(!pte_none(ptep_get(pte))))
> +		return -EBUSY;
> +
> +	page = d->pages[d->i++];
> +	/* paranoia, similar to vmap_pages_pte_range() */
> +	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> +		return -EINVAL;
> +
> +	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> +	return 0;
> +}
> +
> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
> +{
> +	struct mm_struct *mm = &init_mm;
> +	pte_t old_pte;
> +	struct page *page;
> +
> +	/* sanity check */
> +	old_pte = ptep_get(pte);
> +	if (pte_none(old_pte) || !pte_present(old_pte))
> +		return 0; /* nothing to do */
> +
> +	/* get page and free it */
> +	page = pte_page(old_pte);
> +	if (WARN_ON_ONCE(!page))
> +		return -EINVAL;
> +
> +	pte_clear(mm, addr, pte);
> +
> +	/* ensure no stale TLB entries */
> +	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
> +	__free_page(page);
> +
> +	return 0;
> +}
> +
> +static int populate_pgtable_except_pte(struct bpf_arena *arena)
> +{
> +	return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +				   KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
> +}
> +
>  static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>  {
>  	struct vm_struct *kern_vm;

[ ... ]

> @@ -144,6 +201,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>  		goto err;
>  	}
>  	mutex_init(&arena->lock);
> +	err = populate_pgtable_except_pte(arena);
> +	if (err)
> +		goto err;

Does this leak the arena and range_tree on error? The error path at the
range_tree_set() failure (above these lines) calls bpf_map_area_free(arena)
before goto err, but this new error path doesn't clean up arena or call
range_tree_destroy() before jumping to err.

>
>  	return &arena->map;
>  err:

[ ... ]

> @@ -286,6 +346,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>  	if (ret)
>  		return VM_FAULT_SIGSEGV;
>
> +	struct apply_range_data data = { .pages = &page, .i = 0 };
>  	/* Account into memcg of the process that created bpf_arena */
>  	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
>  		return VM_FAULT_SIGSEGV;
>  	}
>
> -	ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
> +	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
>  		__free_page(page);

[ ... ]

> @@ -428,7 +489,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  	/* user_vm_end/start are fixed before bpf prog runs */
>  	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
>  	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
> -	struct page **pages;
> +	struct page **pages = NULL;
>  	long pgoff = 0;
>  	u32 uaddr32;
>  	int ret, i;

[ ... ]

> @@ -465,6 +526,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  	if (ret)
>  		goto out_free_pages;
>
> +	struct apply_range_data data = { .pages = pages, .i = 0 };
>  	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
>  	if (ret)
>  		goto out;
> @@ -477,8 +539,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
>  	 * lower 32-bit and it's ok.
>  	 */
> -	ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
> -				kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
> +	ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
> +				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
>  	if (ret) {
>  		for (i = 0; i < page_cnt; i++)
>  			__free_page(pages[i]);

[ ... ]

> @@ -545,8 +607,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
>  			 * page_cnt is big it's faster to do the batched zap.
>  			 */
>  			zap_pages(arena, full_uaddr, 1);
> -		vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
> -		__free_page(page);
> +		apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
> +					     apply_range_clear_cb, NULL);

Can this free pages multiple times? This is inside a loop that iterates
page_cnt times, with kaddr incrementing by PAGE_SIZE each iteration. But
apply_to_existing_page_range() is called with size page_cnt << PAGE_SHIFT,
which processes page_cnt pages starting from kaddr.

On the first iteration (i=0), it processes pages [0..page_cnt-1].
On the second iteration (i=1, kaddr+=PAGE_SIZE), it processes pages
[1..page_cnt], overlapping with the first iteration.

Since apply_range_clear_cb() calls __free_page() for each page, page 0
gets freed page_cnt times, page 1 gets freed page_cnt-1 times, etc.

The original code called vm_area_unmap_pages() with size PAGE_SIZE,
processing one page per iteration. Should the new code use PAGE_SIZE
instead of page_cnt << PAGE_SHIFT?

>  	}
>  }
>


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19272481461

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-11 17:01   ` bot+bpf-ci
@ 2025-11-11 17:47     ` Alexei Starovoitov
  0 siblings, 0 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2025-11-11 17:47 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: Puranjay Mohan, bpf, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason,
	Ihor Solodrai

On Tue, Nov 11, 2025 at 9:01 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> > index dd5100a2f..9d8a8eb44 100644
> > --- a/kernel/bpf/arena.c
> > +++ b/kernel/bpf/arena.c
> > @@ -506,8 +506,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
> >                       return 0;
> >       }
> >
> > -     /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
> > -     pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
> > +     pages = kmalloc_nolock(page_cnt * sizeof(struct page *), __GFP_ZERO, -1);
>
> Can kmalloc_nolock() handle the same allocation sizes as kvcalloc()?
>
> Looking at arena_alloc_pages(), page_cnt can be up to the full arena
> size (4GB / 4KB = 1M pages). This means the allocation size can reach
> 1M * 8 = 8MB for the pages array.
>
> In mm/slub.c:5686, kmalloc_nolock() returns NULL for allocations larger
> than KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems
> with 4KB pages.
>
> kvcalloc() could handle larger allocations by falling back to vmalloc.
> With kmalloc_nolock(), arena allocations requiring more than 1024 pages
> will fail where they previously succeeded.

sched-ext as a main user of arena allocates 512 pages at a time,
but AI has a good point. 1024 is too limiting.
Let's round down this kmalloc_nolock() to 1024 * 8 and add
an explicit loop to reuse this array while allocating all of page_cnt
pages. 1024 pages at a time.
Error path will be a bit more complex, since we would
need to call arena_free_pages(), since some pages might be
already populated and active.

pw-bot: cr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-11 17:01   ` bot+bpf-ci
@ 2025-11-11 17:53     ` Alexei Starovoitov
  0 siblings, 0 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2025-11-11 17:53 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: Puranjay Mohan, bpf, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason,
	Ihor Solodrai

On Tue, Nov 11, 2025 at 9:01 AM <bot+bpf-ci@kernel.org> wrote:
>
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 09d5dc541..527921269 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
>
> [ ... ]
>
> > diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> > index 9d8a8eb44..f330b51de 100644
> > --- a/kernel/bpf/arena.c
> > +++ b/kernel/bpf/arena.c
>
> [ ... ]
>
> > @@ -327,12 +351,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> >       struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
> >       struct page *page;
> >       long kbase, kaddr;
> > +     unsigned long flags;
> >       int ret;
> >
> >       kbase = bpf_arena_get_kern_vm_start(arena);
> >       kaddr = kbase + (u32)(vmf->address);
> >
> > -     guard(mutex)(&arena->lock);
> > +     if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
> > +             /*
> > +              * This is an impossible case and would only trigger if res_spin_lock is buggy or
> > +              * due to another kernel bug.
> > +              */
> > +             return VM_FAULT_RETRY;
> > +
> >       page = vmalloc_to_page((void *)kaddr);
> >       if (page)
> >               /* already have a page vmap-ed */
> > @@ -344,26 +375,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
> >
> >       ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
> >       if (ret)
> > -             return VM_FAULT_SIGSEGV;
> > +             goto out_unlock_sigsegv;
> >
> >       struct apply_range_data data = { .pages = &page, .i = 0 };
> >       /* Account into memcg of the process that created bpf_arena */
> >       ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
> >       if (ret) {
> >               range_tree_set(&arena->rt, vmf->pgoff, 1);
> > -             return VM_FAULT_SIGSEGV;
> > +             goto out_unlock_sigsegv;
> >       }
> >
> >       ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
> >       if (ret) {
> >               range_tree_set(&arena->rt, vmf->pgoff, 1);
> > -             __free_page(page);
> > -             return VM_FAULT_SIGSEGV;
> > +             free_pages_nolock(page, 0);
> > +             goto out_unlock_sigsegv;
> >       }
> >  out:
> > +     raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> >       page_ref_add(page, 1);
> >       vmf->page = page;
> >       return 0;
> > +out_unlock_sigsegv:
> > +     raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> > +     return VM_FAULT_SIGSEGV;
> >  }
>
> Can the page reference race with arena_free_pages()? The sequence:
>
>   Thread A (arena_vm_fault):
>     bpf_map_alloc_pages() -> page refcount=1
>     apply_to_page_range() -> maps PTE
>     raw_res_spin_unlock_irqrestore()
>     page_ref_add(page, 1) -> tries to increment to 2
>
>   Thread B (arena_free_pages):
>     raw_res_spin_lock_irqsave()
>     apply_to_existing_page_range()->apply_range_clear_cb() -> clears PTE
>     raw_res_spin_unlock_irqrestore()
>     __free_page(page) -> refcount 1->0, page freed
>
> Between Thread A's unlock and page_ref_add(), Thread B can acquire the
> lock, clear the PTE, and free the page. Then Thread A's page_ref_add()
> operates on freed memory.
>
> Should page_ref_add() happen before releasing the spinlock,

AI has a point. page_ref should be under lock.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
  2025-11-11 17:01   ` bot+bpf-ci
@ 2025-11-13  4:49   ` kernel test robot
  2025-11-13  4:51   ` kernel test robot
  2025-11-13  4:52   ` kernel test robot
  3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2025-11-13  4:49 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: llvm, oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Hi Puranjay,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251112-004253
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20251111163424.16471-2-puranjay%40kernel.org
patch subject: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
config: loongarch-defconfig (https://download.01.org/0day-ci/archive/20251112/202511122229.mivV7opC-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251112/202511122229.mivV7opC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511122229.mivV7opC-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/bpf/arena.c:139:2: error: call to undeclared function 'flush_tlb_kernel_range'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     139 |         flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
         |         ^
   1 error generated.


vim +/flush_tlb_kernel_range +139 kernel/bpf/arena.c

   119	
   120	static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
   121	{
   122		struct mm_struct *mm = &init_mm;
   123		pte_t old_pte;
   124		struct page *page;
   125	
   126		/* sanity check */
   127		old_pte = ptep_get(pte);
   128		if (pte_none(old_pte) || !pte_present(old_pte))
   129			return 0; /* nothing to do */
   130	
   131		/* get page and free it */
   132		page = pte_page(old_pte);
   133		if (WARN_ON_ONCE(!page))
   134			return -EINVAL;
   135	
   136		pte_clear(mm, addr, pte);
   137	
   138		/* ensure no stale TLB entries */
 > 139		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
   140	
   141		__free_page(page);
   142	
   143		return 0;
   144	}
   145	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
  2025-11-11 17:01   ` bot+bpf-ci
  2025-11-13  4:49   ` kernel test robot
@ 2025-11-13  4:51   ` kernel test robot
  2025-11-13  4:52   ` kernel test robot
  3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2025-11-13  4:51 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Hi Puranjay,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251112-004253
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20251111163424.16471-2-puranjay%40kernel.org
patch subject: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
config: arm64-randconfig-003-20251112 (https://download.01.org/0day-ci/archive/20251113/202511130329.jSq8tSKf-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251113/202511130329.jSq8tSKf-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511130329.jSq8tSKf-lkp@intel.com/

All warnings (new ones prefixed by >>):

   kernel/bpf/arena.c: In function 'apply_range_clear_cb':
>> kernel/bpf/arena.c:122:20: warning: unused variable 'mm' [-Wunused-variable]
     struct mm_struct *mm = &init_mm;
                       ^~


vim +/mm +122 kernel/bpf/arena.c

   119	
   120	static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
   121	{
 > 122		struct mm_struct *mm = &init_mm;
   123		pte_t old_pte;
   124		struct page *page;
   125	
   126		/* sanity check */
   127		old_pte = ptep_get(pte);
   128		if (pte_none(old_pte) || !pte_present(old_pte))
   129			return 0; /* nothing to do */
   130	
   131		/* get page and free it */
   132		page = pte_page(old_pte);
   133		if (WARN_ON_ONCE(!page))
   134			return -EINVAL;
   135	
   136		pte_clear(mm, addr, pte);
   137	
   138		/* ensure no stale TLB entries */
   139		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
   140	
   141		__free_page(page);
   142	
   143		return 0;
   144	}
   145	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
                     ` (2 preceding siblings ...)
  2025-11-13  4:51   ` kernel test robot
@ 2025-11-13  4:52   ` kernel test robot
  3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2025-11-13  4:52 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Hi Puranjay,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251112-004253
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20251111163424.16471-2-puranjay%40kernel.org
patch subject: [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory
config: loongarch-randconfig-002-20251112 (https://download.01.org/0day-ci/archive/20251112/202511122020.eyONeHrW-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 13.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251112/202511122020.eyONeHrW-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511122020.eyONeHrW-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/bpf/arena.c: In function 'apply_range_clear_cb':
>> kernel/bpf/arena.c:139:9: error: implicit declaration of function 'flush_tlb_kernel_range' [-Werror=implicit-function-declaration]
     139 |         flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
         |         ^~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/flush_tlb_kernel_range +139 kernel/bpf/arena.c

   119	
   120	static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
   121	{
   122		struct mm_struct *mm = &init_mm;
   123		pte_t old_pte;
   124		struct page *page;
   125	
   126		/* sanity check */
   127		old_pte = ptep_get(pte);
   128		if (pte_none(old_pte) || !pte_present(old_pte))
   129			return 0; /* nothing to do */
   130	
   131		/* get page and free it */
   132		page = pte_page(old_pte);
   133		if (WARN_ON_ONCE(!page))
   134			return -EINVAL;
   135	
   136		pte_clear(mm, addr, pte);
   137	
   138		/* ensure no stale TLB entries */
 > 139		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
   140	
   141		__free_page(page);
   142	
   143		return 0;
   144	}
   145	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-11-13  4:53 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-11 16:34 [PATCH bpf-next 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
2025-11-11 16:34 ` [PATCH bpf-next 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-11-11 17:01   ` bot+bpf-ci
2025-11-13  4:49   ` kernel test robot
2025-11-13  4:51   ` kernel test robot
2025-11-13  4:52   ` kernel test robot
2025-11-11 16:34 ` [PATCH bpf-next 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
2025-11-11 17:01   ` bot+bpf-ci
2025-11-11 17:47     ` Alexei Starovoitov
2025-11-11 16:34 ` [PATCH bpf-next 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
2025-11-11 17:01   ` bot+bpf-ci
2025-11-11 17:53     ` Alexei Starovoitov
2025-11-11 16:34 ` [PATCH bpf-next 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox