[PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs

BPF List
 help / color / mirror / Atom feed

* [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs
@ 2025-11-14 11:16 Puranjay Mohan
  2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
                   ` (3 more replies)
  0 siblings, 4 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
	- Import tlbflush.h to fix build issue in loongarch. (kernel
	  test robot)
	- Fix unused variable error in apply_range_clear_cb() (kernel
	  test robot)
	- Call bpf_map_area_free() on error path of
	  populate_pgtable_except_pte() (AI)
	- Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
	- Cap allocation made by kmalloc_nolock() for pages array to
	  KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
	  to overcome this limit. (AI)
Patch 3:
	- Do page_ref_add(page, 1); under the spinlock to mitigate a
	  race (AI)
Patch 4:
	- Add a new testcase big_alloc3() verifier_arena_large.c that
	  tries to allocate a large number of pages at once, this is to
	  trigger the kmalloc_nolock() limit in Patch 2 and see if the
	  loop logic works correctly.

This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:

The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.

bpf_arena_alloc_pages() had four points where it could sleep:

1. Mutex to protect range_tree: now replaced with rqspinlock

2. kvcalloc() for allocations: now replaced with kmalloc_nolock()

3. Allocating pages with bpf_map_alloc_pages(): this already calls
   alloc_pages_nolock() in non-sleepable contexts and therefore is safe.

4. Setting up kernel page tables with vm_area_map_pages():
   vm_area_map_pages() may allocate memory while inserting pages into
   bpf arena's vm_area. Now, at arena creation time populate all page
   table levels except the last level and when new pages need to be
   inserted call apply_to_page_range() again which will only do
   set_pte_at() for those pages and will not allocate memory.

The above four changes make bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages() has to do the following steps:

1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.

The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
   table levels, apply_to_existing_page_range() with a callback is used
   that only does pte_clear() on the last level and leaves the intermediate
   page table levels intact. This is needed to make sure that
   bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
   intermediate page tables.

When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.

arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.

apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.

NOTE: The arena list selftest fails to load on s390x, this is due to an
unrelated bug in the verifier that is being exposed by the selftest that
I add in this set. I have already sent a patch[1] to fix this.

[1] https://lore.kernel.org/all/20251111160949.45623-1-puranjay@kernel.org/

Puranjay Mohan (4):
  bpf: arena: populate vm_area without allocating memory
  bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  bpf: arena: make arena kfuncs any context safe
  selftests: bpf: test non-sleepable arena allocations

 include/linux/bpf.h                           |   2 +
 kernel/bpf/arena.c                            | 350 +++++++++++++++---
 kernel/bpf/verifier.c                         |   5 +
 .../selftests/bpf/prog_tests/arena_list.c     |  20 +-
 .../testing/selftests/bpf/progs/arena_list.c  |  11 +
 .../selftests/bpf/progs/verifier_arena.c      | 185 +++++++++
 .../bpf/progs/verifier_arena_large.c          |  24 ++
 7 files changed, 541 insertions(+), 56 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
@ 2025-11-14 11:16 ` Puranjay Mohan
  2025-11-14 11:47   ` bot+bpf-ci
  2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

vm_area_map_pages() may allocate memory while inserting pages into bpf
arena's vm_area. In order to make bpf_arena_alloc_pages() kfunc
non-sleepable change bpf arena to populate pages without
allocating memory:
- at arena creation time populate all page table levels except
  the last level
- when new pages need to be inserted call apply_to_page_range() again
  with apply_range_set_cb() which will only set_pte_at() those pages and
  will not allocate memory.
- when freeing pages call apply_to_existing_page_range with
  apply_range_clear_cb() to clear the pte for the page to be removed. This
  doesn't free intermediate page table levels.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/arena.c | 76 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 70 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 1074ac4459f2..48b8ffba3c88 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -7,6 +7,7 @@
 #include <linux/btf_ids.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
+#include <asm/tlbflush.h>
 #include "range_tree.h"
 
 /*
@@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
 	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
 }
 
+struct apply_range_data {
+	struct page **pages;
+	int i;
+};
+
+static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
+{
+	struct apply_range_data *d = data;
+	struct page *page;
+
+	if (!data)
+		return 0;
+	/* sanity check */
+	if (unlikely(!pte_none(ptep_get(pte))))
+		return -EBUSY;
+
+	page = d->pages[d->i++];
+	/* paranoia, similar to vmap_pages_pte_range() */
+	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
+		return -EINVAL;
+
+	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
+	return 0;
+}
+
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+{
+	pte_t old_pte;
+	struct page *page;
+
+	/* sanity check */
+	old_pte = ptep_get(pte);
+	if (pte_none(old_pte) || !pte_present(old_pte))
+		return 0; /* nothing to do */
+
+	/* get page and free it */
+	page = pte_page(old_pte);
+	if (WARN_ON_ONCE(!page))
+		return -EINVAL;
+
+	pte_clear(&init_mm, addr, pte);
+
+	/* ensure no stale TLB entries */
+	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+
+	__free_page(page);
+
+	return 0;
+}
+
+static int populate_pgtable_except_pte(struct bpf_arena *arena)
+{
+	return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
+				   KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
+}
+
 static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 {
 	struct vm_struct *kern_vm;
@@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		goto err;
 	}
 	mutex_init(&arena->lock);
+	err = populate_pgtable_except_pte(arena);
+	if (err) {
+		bpf_map_area_free(arena);
+		goto err;
+	}
 
 	return &arena->map;
 err:
@@ -286,6 +348,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 	if (ret)
 		return VM_FAULT_SIGSEGV;
 
+	struct apply_range_data data = { .pages = &page, .i = 0 };
 	/* Account into memcg of the process that created bpf_arena */
 	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
@@ -293,7 +356,7 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGSEGV;
 	}
 
-	ret = vm_area_map_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE, &page);
+	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
 		__free_page(page);
@@ -428,7 +491,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	/* user_vm_end/start are fixed before bpf prog runs */
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
 	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
-	struct page **pages;
+	struct page **pages = NULL;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -465,6 +528,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (ret)
 		goto out_free_pages;
 
+	struct apply_range_data data = { .pages = pages, .i = 0 };
 	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
 	if (ret)
 		goto out;
@@ -477,8 +541,8 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
 	 * lower 32-bit and it's ok.
 	 */
-	ret = vm_area_map_pages(arena->kern_vm, kern_vm_start + uaddr32,
-				kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE, pages);
+	ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
+				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
 	if (ret) {
 		for (i = 0; i < page_cnt; i++)
 			__free_page(pages[i]);
@@ -545,8 +609,8 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 			 * page_cnt is big it's faster to do the batched zap.
 			 */
 			zap_pages(arena, full_uaddr, 1);
-		vm_area_unmap_pages(arena->kern_vm, kaddr, kaddr + PAGE_SIZE);
-		__free_page(page);
+		apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb,
+					     NULL);
 	}
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
@ 2025-11-14 11:47   ` bot+bpf-ci
  2025-11-14 14:57     ` Puranjay Mohan
  0 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2025-11-14 11:47 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2793 bytes --]

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 1074ac445..48b8ffba3 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
>  	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
>  }
>
> +struct apply_range_data {
> +	struct page **pages;
> +	int i;
> +};
> +
> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> +{
> +	struct apply_range_data *d = data;
> +	struct page *page;
> +
> +	if (!data)
> +		return 0;
> +	/* sanity check */
> +	if (unlikely(!pte_none(ptep_get(pte))))
> +		return -EBUSY;
> +
> +	page = d->pages[d->i++];
> +	/* paranoia, similar to vmap_pages_pte_range() */
> +	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> +		return -EINVAL;
> +
> +	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> +	return 0;
> +}
> +
> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
> +{
> +	pte_t old_pte;
> +	struct page *page;
> +
> +	/* sanity check */
> +	old_pte = ptep_get(pte);
> +	if (pte_none(old_pte) || !pte_present(old_pte))
> +		return 0; /* nothing to do */
> +
> +	/* get page and free it */
> +	page = pte_page(old_pte);
> +	if (WARN_ON_ONCE(!page))
> +		return -EINVAL;
> +
> +	pte_clear(&init_mm, addr, pte);
> +
> +	/* ensure no stale TLB entries */
> +	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> +
> +	__free_page(page);
> +
> +	return 0;
> +}
> +
> +static int populate_pgtable_except_pte(struct bpf_arena *arena)
> +{
> +	return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> +				   KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
> +}
> +
>  static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>  {
>  	struct vm_struct *kern_vm;

[ ... ]

> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>  		goto err;
>  	}
>  	mutex_init(&arena->lock);
> +	err = populate_pgtable_except_pte(arena);
> +	if (err) {
> +		bpf_map_area_free(arena);
> +		goto err;
> +	}
                     ^^^^

Can this leak the range tree? Looking at the code, range_tree_set() was
called earlier in this function and can allocate range_node structures
via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte()
fails here, the error path calls bpf_map_area_free(arena) but doesn't call
range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which
always calls range_tree_destroy() before freeing the arena.

>
>  	return &arena->map;
>  err:

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-14 11:47   ` bot+bpf-ci
@ 2025-11-14 14:57     ` Puranjay Mohan
  2025-11-14 21:21       ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 14:57 UTC (permalink / raw)
  To: bot+bpf-ci, bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

bot+bpf-ci@kernel.org writes:

>> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
>> index 1074ac445..48b8ffba3 100644
>> --- a/kernel/bpf/arena.c
>> +++ b/kernel/bpf/arena.c
>
> [ ... ]
>
>> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
>>  	return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
>>  }
>>
>> +struct apply_range_data {
>> +	struct page **pages;
>> +	int i;
>> +};
>> +
>> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>> +{
>> +	struct apply_range_data *d = data;
>> +	struct page *page;
>> +
>> +	if (!data)
>> +		return 0;
>> +	/* sanity check */
>> +	if (unlikely(!pte_none(ptep_get(pte))))
>> +		return -EBUSY;
>> +
>> +	page = d->pages[d->i++];
>> +	/* paranoia, similar to vmap_pages_pte_range() */
>> +	if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
>> +		return -EINVAL;
>> +
>> +	set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
>> +	return 0;
>> +}
>> +
>> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
>> +{
>> +	pte_t old_pte;
>> +	struct page *page;
>> +
>> +	/* sanity check */
>> +	old_pte = ptep_get(pte);
>> +	if (pte_none(old_pte) || !pte_present(old_pte))
>> +		return 0; /* nothing to do */
>> +
>> +	/* get page and free it */
>> +	page = pte_page(old_pte);
>> +	if (WARN_ON_ONCE(!page))
>> +		return -EINVAL;
>> +
>> +	pte_clear(&init_mm, addr, pte);
>> +
>> +	/* ensure no stale TLB entries */
>> +	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>> +
>> +	__free_page(page);
>> +
>> +	return 0;
>> +}
>> +
>> +static int populate_pgtable_except_pte(struct bpf_arena *arena)
>> +{
>> +	return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
>> +				   KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
>> +}
>> +
>>  static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>>  {
>>  	struct vm_struct *kern_vm;
>
> [ ... ]
>
>> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>>  		goto err;
>>  	}
>>  	mutex_init(&arena->lock);
>> +	err = populate_pgtable_except_pte(arena);
>> +	if (err) {
>> +		bpf_map_area_free(arena);
>> +		goto err;
>> +	}
>                      ^^^^
>
> Can this leak the range tree? Looking at the code, range_tree_set() was
> called earlier in this function and can allocate range_node structures
> via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte()
> fails here, the error path calls bpf_map_area_free(arena) but doesn't call
> range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which
> always calls range_tree_destroy() before freeing the arena.

As the range tree is empty at this point, we can be sure that
range_tree_clear() in range_tree_set() will not allocate anything. 

>>
>>  	return &arena->map;
>>  err:
>
> [ ... ]
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-14 14:57     ` Puranjay Mohan
@ 2025-11-14 21:21       ` Alexei Starovoitov
  2025-11-15  0:52         ` Puranjay Mohan
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2025-11-14 21:21 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason,
	Ihor Solodrai

On Fri, Nov 14, 2025 at 6:57 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> bot+bpf-ci@kernel.org writes:
>
> >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> >> index 1074ac445..48b8ffba3 100644
> >> --- a/kernel/bpf/arena.c
> >> +++ b/kernel/bpf/arena.c
> >
> > [ ... ]
> >
> >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> >>      return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> >>  }
> >>
> >> +struct apply_range_data {
> >> +    struct page **pages;
> >> +    int i;
> >> +};
> >> +
> >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> >> +{
> >> +    struct apply_range_data *d = data;
> >> +    struct page *page;
> >> +
> >> +    if (!data)
> >> +            return 0;
> >> +    /* sanity check */
> >> +    if (unlikely(!pte_none(ptep_get(pte))))
> >> +            return -EBUSY;
> >> +
> >> +    page = d->pages[d->i++];
> >> +    /* paranoia, similar to vmap_pages_pte_range() */
> >> +    if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> >> +            return -EINVAL;
> >> +
> >> +    set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> >> +    return 0;
> >> +}
> >> +
> >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
> >> +{
> >> +    pte_t old_pte;
> >> +    struct page *page;
> >> +
> >> +    /* sanity check */
> >> +    old_pte = ptep_get(pte);
> >> +    if (pte_none(old_pte) || !pte_present(old_pte))
> >> +            return 0; /* nothing to do */
> >> +
> >> +    /* get page and free it */
> >> +    page = pte_page(old_pte);
> >> +    if (WARN_ON_ONCE(!page))
> >> +            return -EINVAL;
> >> +
> >> +    pte_clear(&init_mm, addr, pte);
> >> +
> >> +    /* ensure no stale TLB entries */
> >> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >> +
> >> +    __free_page(page);
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int populate_pgtable_except_pte(struct bpf_arena *arena)
> >> +{
> >> +    return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> >> +                               KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
> >> +}
> >> +
> >>  static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> >>  {
> >>      struct vm_struct *kern_vm;
> >
> > [ ... ]
> >
> >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> >>              goto err;
> >>      }
> >>      mutex_init(&arena->lock);
> >> +    err = populate_pgtable_except_pte(arena);
> >> +    if (err) {
> >> +            bpf_map_area_free(arena);
> >> +            goto err;
> >> +    }
> >                      ^^^^
> >
> > Can this leak the range tree? Looking at the code, range_tree_set() was
> > called earlier in this function and can allocate range_node structures
> > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte()
> > fails here, the error path calls bpf_map_area_free(arena) but doesn't call
> > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which
> > always calls range_tree_destroy() before freeing the arena.
>
> As the range tree is empty at this point, we can be sure that
> range_tree_clear() in range_tree_set() will not allocate anything.

range_tree_clear() won't clear anything, but AI pointed in
the right direction.
Look at what range_tree_set() does. It will allocate for sure.

pw-bot: cr

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-14 21:21       ` Alexei Starovoitov
@ 2025-11-15  0:52         ` Puranjay Mohan
  2025-11-15  1:26           ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-15  0:52 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason,
	Ihor Solodrai

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Nov 14, 2025 at 6:57 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> bot+bpf-ci@kernel.org writes:
>>
>> >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
>> >> index 1074ac445..48b8ffba3 100644
>> >> --- a/kernel/bpf/arena.c
>> >> +++ b/kernel/bpf/arena.c
>> >
>> > [ ... ]
>> >
>> >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
>> >>      return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
>> >>  }
>> >>
>> >> +struct apply_range_data {
>> >> +    struct page **pages;
>> >> +    int i;
>> >> +};
>> >> +
>> >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
>> >> +{
>> >> +    struct apply_range_data *d = data;
>> >> +    struct page *page;
>> >> +
>> >> +    if (!data)
>> >> +            return 0;
>> >> +    /* sanity check */
>> >> +    if (unlikely(!pte_none(ptep_get(pte))))
>> >> +            return -EBUSY;
>> >> +
>> >> +    page = d->pages[d->i++];
>> >> +    /* paranoia, similar to vmap_pages_pte_range() */
>> >> +    if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
>> >> +            return -EINVAL;
>> >> +
>> >> +    set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
>> >> +{
>> >> +    pte_t old_pte;
>> >> +    struct page *page;
>> >> +
>> >> +    /* sanity check */
>> >> +    old_pte = ptep_get(pte);
>> >> +    if (pte_none(old_pte) || !pte_present(old_pte))
>> >> +            return 0; /* nothing to do */
>> >> +
>> >> +    /* get page and free it */
>> >> +    page = pte_page(old_pte);
>> >> +    if (WARN_ON_ONCE(!page))
>> >> +            return -EINVAL;
>> >> +
>> >> +    pte_clear(&init_mm, addr, pte);
>> >> +
>> >> +    /* ensure no stale TLB entries */
>> >> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>> >> +
>> >> +    __free_page(page);
>> >> +
>> >> +    return 0;
>> >> +}
>> >> +
>> >> +static int populate_pgtable_except_pte(struct bpf_arena *arena)
>> >> +{
>> >> +    return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
>> >> +                               KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
>> >> +}
>> >> +
>> >>  static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>> >>  {
>> >>      struct vm_struct *kern_vm;
>> >
>> > [ ... ]
>> >
>> >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
>> >>              goto err;
>> >>      }
>> >>      mutex_init(&arena->lock);
>> >> +    err = populate_pgtable_except_pte(arena);
>> >> +    if (err) {
>> >> +            bpf_map_area_free(arena);
>> >> +            goto err;
>> >> +    }
>> >                      ^^^^
>> >
>> > Can this leak the range tree? Looking at the code, range_tree_set() was
>> > called earlier in this function and can allocate range_node structures
>> > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte()
>> > fails here, the error path calls bpf_map_area_free(arena) but doesn't call
>> > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which
>> > always calls range_tree_destroy() before freeing the arena.
>>
>> As the range tree is empty at this point, we can be sure that
>> range_tree_clear() in range_tree_set() will not allocate anything.
>
> range_tree_clear() won't clear anything, but AI pointed in
> the right direction.
> Look at what range_tree_set() does. It will allocate for sure.

If I am understanding it correctly, range_tree_set() allocates memory
using kmalloc_nolock() and it fails when this allocation fails, so in
the error path we don't need to do anything as no allocation was successful.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory
  2025-11-15  0:52         ` Puranjay Mohan
@ 2025-11-15  1:26           ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2025-11-15  1:26 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bot+bpf-ci, bpf, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard,
	Kumar Kartikeya Dwivedi, Kernel Team, Yonghong Song, Chris Mason,
	Ihor Solodrai

On Fri, Nov 14, 2025 at 4:52 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Fri, Nov 14, 2025 at 6:57 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >> bot+bpf-ci@kernel.org writes:
> >>
> >> >> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> >> >> index 1074ac445..48b8ffba3 100644
> >> >> --- a/kernel/bpf/arena.c
> >> >> +++ b/kernel/bpf/arena.c
> >> >
> >> > [ ... ]
> >> >
> >> >> @@ -92,6 +93,62 @@ static long compute_pgoff(struct bpf_arena *arena, long uaddr)
> >> >>      return (u32)(uaddr - (u32)arena->user_vm_start) >> PAGE_SHIFT;
> >> >>  }
> >> >>
> >> >> +struct apply_range_data {
> >> >> +    struct page **pages;
> >> >> +    int i;
> >> >> +};
> >> >> +
> >> >> +static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
> >> >> +{
> >> >> +    struct apply_range_data *d = data;
> >> >> +    struct page *page;
> >> >> +
> >> >> +    if (!data)
> >> >> +            return 0;
> >> >> +    /* sanity check */
> >> >> +    if (unlikely(!pte_none(ptep_get(pte))))
> >> >> +            return -EBUSY;
> >> >> +
> >> >> +    page = d->pages[d->i++];
> >> >> +    /* paranoia, similar to vmap_pages_pte_range() */
> >> >> +    if (WARN_ON_ONCE(!pfn_valid(page_to_pfn(page))))
> >> >> +            return -EINVAL;
> >> >> +
> >> >> +    set_pte_at(&init_mm, addr, pte, mk_pte(page, PAGE_KERNEL));
> >> >> +    return 0;
> >> >> +}
> >> >> +
> >> >> +static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
> >> >> +{
> >> >> +    pte_t old_pte;
> >> >> +    struct page *page;
> >> >> +
> >> >> +    /* sanity check */
> >> >> +    old_pte = ptep_get(pte);
> >> >> +    if (pte_none(old_pte) || !pte_present(old_pte))
> >> >> +            return 0; /* nothing to do */
> >> >> +
> >> >> +    /* get page and free it */
> >> >> +    page = pte_page(old_pte);
> >> >> +    if (WARN_ON_ONCE(!page))
> >> >> +            return -EINVAL;
> >> >> +
> >> >> +    pte_clear(&init_mm, addr, pte);
> >> >> +
> >> >> +    /* ensure no stale TLB entries */
> >> >> +    flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >> >> +
> >> >> +    __free_page(page);
> >> >> +
> >> >> +    return 0;
> >> >> +}
> >> >> +
> >> >> +static int populate_pgtable_except_pte(struct bpf_arena *arena)
> >> >> +{
> >> >> +    return apply_to_page_range(&init_mm, bpf_arena_get_kern_vm_start(arena),
> >> >> +                               KERN_VM_SZ - GUARD_SZ, apply_range_set_cb, NULL);
> >> >> +}
> >> >> +
> >> >>  static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> >> >>  {
> >> >>      struct vm_struct *kern_vm;
> >> >
> >> > [ ... ]
> >> >
> >> >> @@ -144,6 +201,11 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
> >> >>              goto err;
> >> >>      }
> >> >>      mutex_init(&arena->lock);
> >> >> +    err = populate_pgtable_except_pte(arena);
> >> >> +    if (err) {
> >> >> +            bpf_map_area_free(arena);
> >> >> +            goto err;
> >> >> +    }
> >> >                      ^^^^
> >> >
> >> > Can this leak the range tree? Looking at the code, range_tree_set() was
> >> > called earlier in this function and can allocate range_node structures
> >> > via range_tree_clear()->kmalloc_nolock(). If populate_pgtable_except_pte()
> >> > fails here, the error path calls bpf_map_area_free(arena) but doesn't call
> >> > range_tree_destroy(&arena->rt) first. Compare with arena_map_free() which
> >> > always calls range_tree_destroy() before freeing the arena.
> >>
> >> As the range tree is empty at this point, we can be sure that
> >> range_tree_clear() in range_tree_set() will not allocate anything.
> >
> > range_tree_clear() won't clear anything, but AI pointed in
> > the right direction.
> > Look at what range_tree_set() does. It will allocate for sure.
>
> If I am understanding it correctly, range_tree_set() allocates memory
> using kmalloc_nolock() and it fails when this allocation fails, so in
> the error path we don't need to do anything as no allocation was successful.

Not following. Why would kmalloc_nolock() inside range tree fail?
range_tree_set() will allocate memory and above hunk
after failed populate_pgtable_except_pte() will leak it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
  2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
@ 2025-11-14 11:16 ` Puranjay Mohan
  2025-11-14 11:39   ` bot+bpf-ci
  2025-11-14 21:25   ` Alexei Starovoitov
  2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
  2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
  3 siblings, 2 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

To make arena_alloc_pages() safe to be called from any context, replace
kvcalloc() with kmalloc_nolock() so as it doesn't sleep or take any
locks. kmalloc_nolock() returns NULL for allocations larger than
KMALLOC_MAX_CACHE_SIZE, which is (PAGE_SIZE * 2) = 8KB on systems with
4KB pages. So, round down the allocation done by kmalloc_nolock to 1024
* 8 and reuse the array in a loop.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/arena.c | 76 +++++++++++++++++++++++++++++++---------------
 1 file changed, 52 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 48b8ffba3c88..7fa6e40ab3fc 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -43,6 +43,8 @@
 #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1)
 #define KERN_VM_SZ (SZ_4G + GUARD_SZ)
 
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt);
+
 struct bpf_arena {
 	struct bpf_map map;
 	u64 user_vm_start;
@@ -491,7 +493,10 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	/* user_vm_end/start are fixed before bpf prog runs */
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
 	u64 kern_vm_start = bpf_arena_get_kern_vm_start(arena);
+	struct apply_range_data data;
 	struct page **pages = NULL;
+	long remaining, mapped = 0;
+	long alloc_pages;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -508,12 +513,16 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			return 0;
 	}
 
-	/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
-	pages = kvcalloc(page_cnt, sizeof(struct page *), GFP_KERNEL);
+	/*
+	 * Cap allocation size to KMALLOC_MAX_CACHE_SIZE so kmalloc_nolock() can succeed.
+	 */
+	alloc_pages = min(page_cnt, KMALLOC_MAX_CACHE_SIZE / sizeof(struct page *));
+	pages = kmalloc_nolock(alloc_pages * sizeof(struct page *), 0, NUMA_NO_NODE);
 	if (!pages)
 		return 0;
+	data.pages = pages;
 
-	guard(mutex)(&arena->lock);
+	mutex_lock(&arena->lock);
 
 	if (uaddr) {
 		ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
@@ -528,32 +537,51 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	if (ret)
 		goto out_free_pages;
 
-	struct apply_range_data data = { .pages = pages, .i = 0 };
-	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
-	if (ret)
-		goto out;
-
+	remaining = page_cnt;
 	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
-	/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
-	 * will not overflow 32-bit. Lower 32-bit need to represent
-	 * contiguous user address range.
-	 * Map these pages at kern_vm_start base.
-	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
-	 * lower 32-bit and it's ok.
-	 */
-	ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
-				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
-	if (ret) {
-		for (i = 0; i < page_cnt; i++)
-			__free_page(pages[i]);
-		goto out;
+
+	while(remaining) {
+		long this_batch = min(remaining, alloc_pages);
+		/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
+		memset(pages, 0, this_batch * sizeof(struct page *));
+		data.i = 0;
+
+		ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages);
+		if (ret)
+			goto out;
+
+		/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
+		 * will not overflow 32-bit. Lower 32-bit need to represent
+		 * contiguous user address range.
+		 * Map these pages at kern_vm_start base.
+		 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
+		 * lower 32-bit and it's ok.
+		 */
+		ret = apply_to_page_range(&init_mm,
+					  kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT),
+					  this_batch << PAGE_SHIFT, apply_range_set_cb, &data);
+		if (ret) {
+			/* data.i pages were mapped, account them and free the remaining */
+			mapped += data.i;
+			for (i = data.i; i < this_batch; i++)
+				__free_page(pages[i]);
+			goto out;
+		}
+
+		mapped += this_batch;
+		remaining -= this_batch;
 	}
-	kvfree(pages);
+	mutex_unlock(&arena->lock);
+	kfree_nolock(pages);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
-	range_tree_set(&arena->rt, pgoff, page_cnt);
+	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
+	mutex_unlock(&arena->lock);
+	if (mapped)
+		arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped);
 out_free_pages:
-	kvfree(pages);
+	mutex_unlock(&arena->lock);
+	kfree_nolock(pages);
 	return 0;
 }
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
@ 2025-11-14 11:39   ` bot+bpf-ci
  2025-11-14 15:13     ` Puranjay Mohan
  2025-11-14 21:25   ` Alexei Starovoitov
  1 sibling, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2025-11-14 11:39 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3480 bytes --]

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 48b8ffba3..7fa6e40ab 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -528,32 +537,51 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>  	if (ret)
>  		goto out_free_pages;
>
> -	struct apply_range_data data = { .pages = pages, .i = 0 };
> -	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
> -	if (ret)
> -		goto out;
> -
> +	remaining = page_cnt;
>  	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
> -	/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
> -	 * will not overflow 32-bit. Lower 32-bit need to represent
> -	 * contiguous user address range.
> -	 * Map these pages at kern_vm_start base.
> -	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
> -	 * lower 32-bit and it's ok.
> -	 */
> -	ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
> -				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
> -	if (ret) {
> -		for (i = 0; i < page_cnt; i++)
> -			__free_page(pages[i]);
> -		goto out;
> +
> +	while(remaining) {
> +		long this_batch = min(remaining, alloc_pages);
> +		/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
> +		memset(pages, 0, this_batch * sizeof(struct page *));
> +		data.i = 0;
> +
> +		ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages);
> +		if (ret)
> +			goto out;
> +
> +		/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
> +		 * will not overflow 32-bit. Lower 32-bit need to represent
> +		 * contiguous user address range.
> +		 * Map these pages at kern_vm_start base.
> +		 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
> +		 * lower 32-bit and it's ok.
> +		 */
> +		ret = apply_to_page_range(&init_mm,
> +					  kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT),
> +					  this_batch << PAGE_SHIFT, apply_range_set_cb, &data);
> +		if (ret) {
> +			/* data.i pages were mapped, account them and free the remaining */
> +			mapped += data.i;
> +			for (i = data.i; i < this_batch; i++)
> +				__free_page(pages[i]);
> +			goto out;
> +		}
> +
> +		mapped += this_batch;
> +		remaining -= this_batch;
>  	}
> -	kvfree(pages);
> +	mutex_unlock(&arena->lock);
> +	kfree_nolock(pages);
>  	return clear_lo32(arena->user_vm_start) + uaddr32;
>  out:
> -	range_tree_set(&arena->rt, pgoff, page_cnt);
> +	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
> +	mutex_unlock(&arena->lock);
> +	if (mapped)
> +		arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped);
                                      ^^^^

Can arena_free_pages() be called here while holding arena->lock?
arena_free_pages() at line 618 acquires arena->lock using guard(mutex),
which will deadlock since the lock was already acquired at line 525 in
arena_alloc_pages().

> +out_free_pages:
> +	mutex_unlock(&arena->lock);
             ^^^^

Does this code unlock arena->lock twice in the error path? When the
"goto out" executes from the while loop, the mutex is unlocked at the
"out:" label above, then falls through to "out_free_pages:" and unlocks
again here.

> +	kfree_nolock(pages);
>  	return 0;
>  }


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-14 11:39   ` bot+bpf-ci
@ 2025-11-14 15:13     ` Puranjay Mohan
  0 siblings, 0 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 15:13 UTC (permalink / raw)
  To: bot+bpf-ci, bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

bot+bpf-ci@kernel.org writes:

>> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
>> index 48b8ffba3..7fa6e40ab 100644
>> --- a/kernel/bpf/arena.c
>> +++ b/kernel/bpf/arena.c
>
> [ ... ]
>
>> @@ -528,32 +537,51 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
>>  	if (ret)
>>  		goto out_free_pages;
>>
>> -	struct apply_range_data data = { .pages = pages, .i = 0 };
>> -	ret = bpf_map_alloc_pages(&arena->map, node_id, page_cnt, pages);
>> -	if (ret)
>> -		goto out;
>> -
>> +	remaining = page_cnt;
>>  	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
>> -	/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
>> -	 * will not overflow 32-bit. Lower 32-bit need to represent
>> -	 * contiguous user address range.
>> -	 * Map these pages at kern_vm_start base.
>> -	 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
>> -	 * lower 32-bit and it's ok.
>> -	 */
>> -	ret = apply_to_page_range(&init_mm, kern_vm_start + uaddr32,
>> -				  page_cnt << PAGE_SHIFT, apply_range_set_cb, &data);
>> -	if (ret) {
>> -		for (i = 0; i < page_cnt; i++)
>> -			__free_page(pages[i]);
>> -		goto out;
>> +
>> +	while(remaining) {
>> +		long this_batch = min(remaining, alloc_pages);
>> +		/* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
>> +		memset(pages, 0, this_batch * sizeof(struct page *));
>> +		data.i = 0;
>> +
>> +		ret = bpf_map_alloc_pages(&arena->map, node_id, this_batch, pages);
>> +		if (ret)
>> +			goto out;
>> +
>> +		/* Earlier checks made sure that uaddr32 + page_cnt * PAGE_SIZE - 1
>> +		 * will not overflow 32-bit. Lower 32-bit need to represent
>> +		 * contiguous user address range.
>> +		 * Map these pages at kern_vm_start base.
>> +		 * kern_vm_start + uaddr32 + page_cnt * PAGE_SIZE - 1 can overflow
>> +		 * lower 32-bit and it's ok.
>> +		 */
>> +		ret = apply_to_page_range(&init_mm,
>> +					  kern_vm_start + uaddr32 + (mapped << PAGE_SHIFT),
>> +					  this_batch << PAGE_SHIFT, apply_range_set_cb, &data);
>> +		if (ret) {
>> +			/* data.i pages were mapped, account them and free the remaining */
>> +			mapped += data.i;
>> +			for (i = data.i; i < this_batch; i++)
>> +				__free_page(pages[i]);
>> +			goto out;
>> +		}
>> +
>> +		mapped += this_batch;
>> +		remaining -= this_batch;
>>  	}
>> -	kvfree(pages);
>> +	mutex_unlock(&arena->lock);
>> +	kfree_nolock(pages);
>>  	return clear_lo32(arena->user_vm_start) + uaddr32;
>>  out:
>> -	range_tree_set(&arena->rt, pgoff, page_cnt);
>> +	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
>> +	mutex_unlock(&arena->lock);
>> +	if (mapped)
>> +		arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped);
>                                       ^^^^
>
> Can arena_free_pages() be called here while holding arena->lock?
> arena_free_pages() at line 618 acquires arena->lock using guard(mutex),
> which will deadlock since the lock was already acquired at line 525 in
> arena_alloc_pages().

arena_free_pages() will not be called with arena->lock taken, there is
mutex_unlock() before calling arena_free_pages()

>> +out_free_pages:
>> +	mutex_unlock(&arena->lock);
>              ^^^^
>
> Does this code unlock arena->lock twice in the error path? When the
> "goto out" executes from the while loop, the mutex is unlocked at the
> "out:" label above, then falls through to "out_free_pages:" and unlocks
> again here.

This is fixed by the next patch by adding another label, but I missed it
here.
Will fix it in the next version.

>> +	kfree_nolock(pages);
>>  	return 0;
>>  }
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc()
  2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
  2025-11-14 11:39   ` bot+bpf-ci
@ 2025-11-14 21:25   ` Alexei Starovoitov
  1 sibling, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2025-11-14 21:25 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Kernel Team

On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> +
> +       while(remaining) {
> +               long this_batch = min(remaining, alloc_pages);
> +               /* zeroing is needed, since alloc_pages_bulk() only fills in non-zero entries */
> +               memset(pages, 0, this_batch * sizeof(struct page *));

run checkpatch pls. Above needs extra space after while and
empty line after 'long this_batch'.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
  2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
  2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
@ 2025-11-14 11:16 ` Puranjay Mohan
  2025-11-14 11:47   ` bot+bpf-ci
                     ` (3 more replies)
  2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
  3 siblings, 4 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Make arena related kfuncs any context safe by the following changes:

bpf_arena_alloc_pages() and bpf_arena_reserve_pages():
Replace the usage of the mutex with a rqspinlock for range tree and use
kmalloc_nolock() wherever needed. Use free_pages_nolock() to free pages
from any context.
apply_range_set/clear_cb() with apply_to_page_range() has already made
populating the vm_area in bpf_arena_alloc_pages() any context safe.

bpf_arena_free_pages(): defer the main logic to a workqueue if it is
called from a non-sleepable context.

specialize_kfunc() is used to replace the sleepable arena_free_pages()
with bpf_arena_free_pages_non_sleepable() when the verifier detects the
call is from a non-sleepable context.

In the non-sleepable case, arena_free_pages() queues the address and the
page count to be freed to a lock-less list of struct arena_free_spans
and raises an irq_work. The irq_work handler calls schedules_work() as
it is safe to be called from irq context.  arena_free_worker() (the work
queue handler) iterates these spans and clears ptes, flushes tlb, zaps
pages, and calls __free_page().

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 include/linux/bpf.h   |   2 +
 kernel/bpf/arena.c    | 236 +++++++++++++++++++++++++++++++++++-------
 kernel/bpf/verifier.c |   5 +
 3 files changed, 203 insertions(+), 40 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 09d5dc541d1c..5279212694b4 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -673,6 +673,8 @@ void bpf_map_free_internal_structs(struct bpf_map *map, void *obj);
 int bpf_dynptr_from_file_sleepable(struct file *file, u32 flags,
 				   struct bpf_dynptr *ptr__uninit);
 
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt);
+
 extern const struct bpf_map_ops bpf_map_offload_ops;
 
 /* bpf_type_flag contains a set of flags that are applicable to the values of
diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
index 7fa6e40ab3fc..ca443c113a1b 100644
--- a/kernel/bpf/arena.c
+++ b/kernel/bpf/arena.c
@@ -3,7 +3,9 @@
 #include <linux/bpf.h>
 #include <linux/btf.h>
 #include <linux/err.h>
+#include <linux/irq_work.h>
 #include "linux/filter.h"
+#include <linux/llist.h>
 #include <linux/btf_ids.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
@@ -43,7 +45,7 @@
 #define GUARD_SZ round_up(1ull << sizeof_field(struct bpf_insn, off) * 8, PAGE_SIZE << 1)
 #define KERN_VM_SZ (SZ_4G + GUARD_SZ)
 
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt);
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable);
 
 struct bpf_arena {
 	struct bpf_map map;
@@ -51,8 +53,23 @@ struct bpf_arena {
 	u64 user_vm_end;
 	struct vm_struct *kern_vm;
 	struct range_tree rt;
+	/* protects rt */
+	rqspinlock_t spinlock;
 	struct list_head vma_list;
+	/* protects vma_list */
 	struct mutex lock;
+	struct irq_work     free_irq;
+	struct work_struct  free_work;
+	struct llist_head   free_spans;
+};
+
+static void arena_free_worker(struct work_struct *work);
+static void arena_free_irq(struct irq_work *iw);
+
+struct arena_free_span {
+	struct llist_node node;
+	unsigned long uaddr;
+	u32 page_cnt;
 };
 
 u64 bpf_arena_get_kern_vm_start(struct bpf_arena *arena)
@@ -120,7 +137,7 @@ static int apply_range_set_cb(pte_t *pte, unsigned long addr, void *data)
 	return 0;
 }
 
-static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
+static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *free_pages)
 {
 	pte_t old_pte;
 	struct page *page;
@@ -130,17 +147,16 @@ static int apply_range_clear_cb(pte_t *pte, unsigned long addr, void *data)
 	if (pte_none(old_pte) || !pte_present(old_pte))
 		return 0; /* nothing to do */
 
-	/* get page and free it */
+	/* get page and clear pte */
 	page = pte_page(old_pte);
 	if (WARN_ON_ONCE(!page))
 		return -EINVAL;
 
 	pte_clear(&init_mm, addr, pte);
 
-	/* ensure no stale TLB entries */
-	flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-
-	__free_page(page);
+	/* Add page to the list so it is freed later */
+	if (free_pages)
+		__llist_add(&page->pcp_llist, free_pages);
 
 	return 0;
 }
@@ -195,6 +211,9 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		arena->user_vm_end = arena->user_vm_start + vm_range;
 
 	INIT_LIST_HEAD(&arena->vma_list);
+	init_llist_head(&arena->free_spans);
+	init_irq_work(&arena->free_irq, arena_free_irq);
+	INIT_WORK(&arena->free_work, arena_free_worker);
 	bpf_map_init_from_attr(&arena->map, attr);
 	range_tree_init(&arena->rt);
 	err = range_tree_set(&arena->rt, 0, attr->max_entries);
@@ -203,6 +222,7 @@ static struct bpf_map *arena_map_alloc(union bpf_attr *attr)
 		goto err;
 	}
 	mutex_init(&arena->lock);
+	raw_res_spin_lock_init(&arena->spinlock);
 	err = populate_pgtable_except_pte(arena);
 	if (err) {
 		bpf_map_area_free(arena);
@@ -248,6 +268,10 @@ static void arena_map_free(struct bpf_map *map)
 	if (WARN_ON_ONCE(!list_empty(&arena->vma_list)))
 		return;
 
+	/* Ensure no pending deferred frees */
+	irq_work_sync(&arena->free_irq);
+	flush_work(&arena->free_work);
+
 	/*
 	 * free_vm_area() calls remove_vm_area() that calls free_unmap_vmap_area().
 	 * It unmaps everything from vmalloc area and clears pgtables.
@@ -331,12 +355,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
 	struct page *page;
 	long kbase, kaddr;
+	unsigned long flags;
 	int ret;
 
 	kbase = bpf_arena_get_kern_vm_start(arena);
 	kaddr = kbase + (u32)(vmf->address);
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		/*
+		 * This is an impossible case and would only trigger if res_spin_lock is buggy or
+		 * due to another kernel bug.
+		 */
+		return VM_FAULT_RETRY;
+
 	page = vmalloc_to_page((void *)kaddr);
 	if (page)
 		/* already have a page vmap-ed */
@@ -348,26 +379,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
 
 	ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
 	if (ret)
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 
 	struct apply_range_data data = { .pages = &page, .i = 0 };
 	/* Account into memcg of the process that created bpf_arena */
 	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
-		return VM_FAULT_SIGSEGV;
+		goto out_unlock_sigsegv;
 	}
 
 	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
 	if (ret) {
 		range_tree_set(&arena->rt, vmf->pgoff, 1);
-		__free_page(page);
-		return VM_FAULT_SIGSEGV;
+		free_pages_nolock(page, 0);
+		goto out_unlock_sigsegv;
 	}
 out:
 	page_ref_add(page, 1);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	vmf->page = page;
 	return 0;
+out_unlock_sigsegv:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+	return VM_FAULT_SIGSEGV;
 }
 
 static const struct vm_operations_struct arena_vm_ops = {
@@ -497,6 +532,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 	struct page **pages = NULL;
 	long remaining, mapped = 0;
 	long alloc_pages;
+	unsigned long flags;
 	long pgoff = 0;
 	u32 uaddr32;
 	int ret, i;
@@ -522,12 +558,13 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 		return 0;
 	data.pages = pages;
 
-	mutex_lock(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		goto out_free_pages;
 
 	if (uaddr) {
 		ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
 		if (ret)
-			goto out_free_pages;
+			goto out_unlock_free_pages;
 		ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
 	} else {
 		ret = pgoff = range_tree_find(&arena->rt, page_cnt);
@@ -535,7 +572,7 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
 	}
 	if (ret)
-		goto out_free_pages;
+		goto out_unlock_free_pages;
 
 	remaining = page_cnt;
 	uaddr32 = (u32)(arena->user_vm_start + pgoff * PAGE_SIZE);
@@ -564,23 +601,25 @@ static long arena_alloc_pages(struct bpf_arena *arena, long uaddr, long page_cnt
 			/* data.i pages were mapped, account them and free the remaining */
 			mapped += data.i;
 			for (i = data.i; i < this_batch; i++)
-				__free_page(pages[i]);
+				free_pages_nolock(pages[i], 0);
 			goto out;
 		}
 
 		mapped += this_batch;
 		remaining -= this_batch;
 	}
-	mutex_unlock(&arena->lock);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	kfree_nolock(pages);
 	return clear_lo32(arena->user_vm_start) + uaddr32;
 out:
 	range_tree_set(&arena->rt, pgoff + mapped, page_cnt - mapped);
-	mutex_unlock(&arena->lock);
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 	if (mapped)
-		arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped);
+		arena_free_pages(arena, clear_lo32(arena->user_vm_start) + uaddr32, mapped, false);
+	goto out_free_pages;
+out_unlock_free_pages:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
 out_free_pages:
-	mutex_unlock(&arena->lock);
 	kfree_nolock(pages);
 	return 0;
 }
@@ -594,42 +633,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 {
 	struct vma_list *vml;
 
+	guard(mutex)(&arena->lock);
+	/* iterate link list under lock */
 	list_for_each_entry(vml, &arena->vma_list, head)
 		zap_page_range_single(vml->vma, uaddr,
 				      PAGE_SIZE * page_cnt, NULL);
 }
 
-static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
+static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable)
 {
 	u64 full_uaddr, uaddr_end;
-	long kaddr, pgoff, i;
+	long kaddr, pgoff;
 	struct page *page;
+	struct llist_head free_pages;
+	struct llist_node *pos, *t;
+	struct arena_free_span *s;
+	unsigned long flags;
+	int ret = 0;
 
 	/* only aligned lower 32-bit are relevant */
 	uaddr = (u32)uaddr;
 	uaddr &= PAGE_MASK;
+	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
 	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
 	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
 	if (full_uaddr >= uaddr_end)
 		return;
 
 	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
+	pgoff = compute_pgoff(arena, uaddr);
 
-	guard(mutex)(&arena->lock);
+	if (!sleepable)
+		goto defer;
+
+	ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags);
+	/*
+	 * Can't proceed without holding the spinlock so defer the free
+	 */
+	if (ret)
+		goto defer;
 
-	pgoff = compute_pgoff(arena, uaddr);
-	/* clear range */
 	range_tree_set(&arena->rt, pgoff, page_cnt);
 
+	init_llist_head(&free_pages);
+	/* clear ptes and collect struct pages */
+	apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+				     apply_range_clear_cb, &free_pages);
+
+	/* drop the lock to do the tlb flush and zap pages */
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+	/* ensure no stale TLB entries */
+	flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
 	if (page_cnt > 1)
 		/* bulk zap if multiple pages being freed */
 		zap_pages(arena, full_uaddr, page_cnt);
 
-	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
-	for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
-		page = vmalloc_to_page((void *)kaddr);
-		if (!page)
-			continue;
+	llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {
+		page = llist_entry(pos, struct page, pcp_llist);
 		if (page_cnt == 1 && page_mapped(page)) /* mapped by some user process */
 			/* Optimization for the common case of page_cnt==1:
 			 * If page wasn't mapped into some user vma there
@@ -637,9 +699,20 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 			 * page_cnt is big it's faster to do the batched zap.
 			 */
 			zap_pages(arena, full_uaddr, 1);
-		apply_to_existing_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_clear_cb,
-					     NULL);
+		__free_page(page);
 	}
+
+	return;
+
+defer:
+	s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1);
+	if (!s)
+		return;
+
+	s->page_cnt = page_cnt;
+	s->uaddr = uaddr;
+	llist_add(&s->node, &arena->free_spans);
+	irq_work_queue(&arena->free_irq);
 }
 
 /*
@@ -649,6 +722,7 @@ static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
 static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt)
 {
 	long page_cnt_max = (arena->user_vm_end - arena->user_vm_start) >> PAGE_SHIFT;
+	unsigned long flags;
 	long pgoff;
 	int ret;
 
@@ -659,15 +733,87 @@ static int arena_reserve_pages(struct bpf_arena *arena, long uaddr, u32 page_cnt
 	if (pgoff + page_cnt > page_cnt_max)
 		return -EINVAL;
 
-	guard(mutex)(&arena->lock);
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
+		return -EBUSY;
 
 	/* Cannot guard already allocated pages. */
 	ret = is_range_tree_set(&arena->rt, pgoff, page_cnt);
-	if (ret)
-		return -EBUSY;
+	if (ret) {
+		ret = -EBUSY;
+		goto out;
+	}
 
 	/* "Allocate" the region to prevent it from being allocated. */
-	return range_tree_clear(&arena->rt, pgoff, page_cnt);
+	ret = range_tree_clear(&arena->rt, pgoff, page_cnt);
+out:
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+	return ret;
+}
+
+static void arena_free_worker(struct work_struct *work)
+{
+	struct bpf_arena *arena = container_of(work, struct bpf_arena, free_work);
+	struct llist_node *list, *pos, *t;
+	struct arena_free_span *s;
+	u64 arena_vm_start, user_vm_start;
+	struct llist_head free_pages;
+	struct page *page;
+	unsigned long full_uaddr;
+	long kaddr, page_cnt, pgoff;
+	unsigned long flags;
+
+	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags)) {
+		schedule_work(work);
+		return;
+	}
+
+	init_llist_head(&free_pages);
+	arena_vm_start = bpf_arena_get_kern_vm_start(arena);
+	user_vm_start = bpf_arena_get_user_vm_start(arena);
+
+	list = llist_del_all(&arena->free_spans);
+	llist_for_each(pos, list) {
+		s = llist_entry(pos, struct arena_free_span, node);
+		page_cnt = s->page_cnt;
+		kaddr = arena_vm_start + s->uaddr;
+		pgoff = compute_pgoff(arena, s->uaddr);
+
+		/* clear ptes and collect pages in free_pages llist */
+		apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
+					     apply_range_clear_cb, &free_pages);
+
+		range_tree_set(&arena->rt, pgoff, page_cnt);
+	}
+	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
+
+	/* Iterate the list again without holding spinlock to do the tlb flush and zap_pages */
+	llist_for_each_safe(pos, t, list) {
+		s = llist_entry(pos, struct arena_free_span, node);
+		page_cnt = s->page_cnt;
+		full_uaddr = user_vm_start + s->uaddr;
+		kaddr = arena_vm_start + s->uaddr;
+
+		/* ensure no stale TLB entries */
+		flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
+
+		/* remove pages from user vmas */
+		zap_pages(arena, full_uaddr, page_cnt);
+
+		kfree_nolock(s);
+	}
+
+	/* free all pages collected by apply_to_existing_page_range() in the first loop */
+	llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {
+		page = llist_entry(pos, struct page, pcp_llist);
+		__free_page(page);
+	}
+}
+
+static void arena_free_irq(struct irq_work *iw)
+{
+	struct bpf_arena *arena = container_of(iw, struct bpf_arena, free_irq);
+
+	schedule_work(&arena->free_work);
 }
 
 __bpf_kfunc_start_defs();
@@ -691,7 +837,17 @@ __bpf_kfunc void bpf_arena_free_pages(void *p__map, void *ptr__ign, u32 page_cnt
 
 	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
 		return;
-	arena_free_pages(arena, (long)ptr__ign, page_cnt);
+	arena_free_pages(arena, (long)ptr__ign, page_cnt, true);
+}
+
+void bpf_arena_free_pages_non_sleepable(void *p__map, void *ptr__ign, u32 page_cnt)
+{
+	struct bpf_map *map = p__map;
+	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
+
+	if (map->map_type != BPF_MAP_TYPE_ARENA || !page_cnt || !ptr__ign)
+		return;
+	arena_free_pages(arena, (long)ptr__ign, page_cnt, false);
 }
 
 __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_cnt)
@@ -710,9 +866,9 @@ __bpf_kfunc int bpf_arena_reserve_pages(void *p__map, void *ptr__ign, u32 page_c
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(arena_kfuncs)
-BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_RET | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
-BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_SLEEPABLE | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_alloc_pages, KF_TRUSTED_ARGS | KF_ARENA_RET | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_free_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
+BTF_ID_FLAGS(func, bpf_arena_reserve_pages, KF_TRUSTED_ARGS | KF_ARENA_ARG2)
 BTF_KFUNCS_END(arena_kfuncs)
 
 static const struct btf_kfunc_id_set common_kfunc_set = {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 1268fa075d4c..407f75daa1cb 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -12319,6 +12319,7 @@ enum special_kfunc_type {
 	KF___bpf_trap,
 	KF_bpf_task_work_schedule_signal,
 	KF_bpf_task_work_schedule_resume,
+	KF_bpf_arena_free_pages,
 };
 
 BTF_ID_LIST(special_kfunc_list)
@@ -12393,6 +12394,7 @@ BTF_ID(func, bpf_dynptr_file_discard)
 BTF_ID(func, __bpf_trap)
 BTF_ID(func, bpf_task_work_schedule_signal)
 BTF_ID(func, bpf_task_work_schedule_resume)
+BTF_ID(func, bpf_arena_free_pages)
 
 static bool is_task_work_add_kfunc(u32 func_id)
 {
@@ -22350,6 +22352,9 @@ static int specialize_kfunc(struct bpf_verifier_env *env, struct bpf_kfunc_desc
 	} else if (func_id == special_kfunc_list[KF_bpf_dynptr_from_file]) {
 		if (!env->insn_aux_data[insn_idx].non_sleepable)
 			addr = (unsigned long)bpf_dynptr_from_file_sleepable;
+	} else if (func_id == special_kfunc_list[KF_bpf_arena_free_pages]) {
+		if (env->insn_aux_data[insn_idx].non_sleepable)
+			addr = (unsigned long)bpf_arena_free_pages_non_sleepable;
 	}
 
 set_imm:
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
@ 2025-11-14 11:47   ` bot+bpf-ci
  2025-11-14 15:28     ` Puranjay Mohan
  2025-11-14 21:27   ` Alexei Starovoitov
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 22+ messages in thread
From: bot+bpf-ci @ 2025-11-14 11:47 UTC (permalink / raw)
  To: puranjay, bpf
  Cc: puranjay, puranjay12, ast, andrii, daniel, martin.lau, eddyz87,
	memxor, kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 5133 bytes --]

> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
> index 7fa6e40ab..ca443c113 100644
> --- a/kernel/bpf/arena.c
> +++ b/kernel/bpf/arena.c

[ ... ]

> @@ -331,12 +355,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>  	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
>  	struct page *page;
>  	long kbase, kaddr;
> +	unsigned long flags;
>  	int ret;
>
>  	kbase = bpf_arena_get_kern_vm_start(arena);
>  	kaddr = kbase + (u32)(vmf->address);
>
> -	guard(mutex)(&arena->lock);
> +	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
> +		/*
> +		 * This is an impossible case and would only trigger if res_spin_lock is buggy or
> +		 * due to another kernel bug.
> +		 */
> +		return VM_FAULT_RETRY;
> +
>  	page = vmalloc_to_page((void *)kaddr);
>  	if (page)
>  		/* already have a page vmap-ed */
> @@ -348,26 +379,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>
>  	ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
>  	if (ret)
> -		return VM_FAULT_SIGSEGV;
> +		goto out_unlock_sigsegv;
>
>  	struct apply_range_data data = { .pages = &page, .i = 0 };
>  	/* Account into memcg of the process that created bpf_arena */
>  	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
> -		return VM_FAULT_SIGSEGV;
> +		goto out_unlock_sigsegv;
>  	}
>
>  	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
>  	if (ret) {
>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
> -		__free_page(page);
> -		return VM_FAULT_SIGSEGV;
> +		free_pages_nolock(page, 0);
> +		goto out_unlock_sigsegv;
>  	}
>  out:
>  	page_ref_add(page, 1);
> +	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
>  	vmf->page = page;
>  	return 0;

In arena_vm_fault(), there's a path between acquiring the spinlock and
the new goto labels that doesn't release the lock:

>  	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
>  		return VM_FAULT_SIGSEGV;

This return statement is reached after raw_res_spin_lock_irqsave()
succeeds but before any of the unlock paths. If BPF_F_SEGV_ON_FAULT is
set, subsequent page faults on this arena will deadlock. All other
return paths in the function properly jump to out_unlock_sigsegv to
release the lock.

> +out_unlock_sigsegv:
> +	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> +	return VM_FAULT_SIGSEGV;
>  }

[ ... ]

> @@ -594,42 +633,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
>  {
>  	struct vma_list *vml;
>
> +	guard(mutex)(&arena->lock);
> +	/* iterate link list under lock */
>  	list_for_each_entry(vml, &arena->vma_list, head)
>  		zap_page_range_single(vml->vma, uaddr,
>  				      PAGE_SIZE * page_cnt, NULL);
>  }
>
> -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable)
>  {
>  	u64 full_uaddr, uaddr_end;
> -	long kaddr, pgoff, i;
> +	long kaddr, pgoff;
>  	struct page *page;
> +	struct llist_head free_pages;
> +	struct llist_node *pos, *t;
> +	struct arena_free_span *s;
> +	unsigned long flags;
> +	int ret = 0;
>
>  	/* only aligned lower 32-bit are relevant */
>  	uaddr = (u32)uaddr;
>  	uaddr &= PAGE_MASK;
> +	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
>  	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
>  	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
>  	if (full_uaddr >= uaddr_end)
>  		return;
>
>  	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
> +	pgoff = compute_pgoff(arena, uaddr);
>
> -	guard(mutex)(&arena->lock);
> +	if (!sleepable)
> +		goto defer;
> +
> +	ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags);
> +	/*
> +	 * Can't proceed without holding the spinlock so defer the free
> +	 */
> +	if (ret)
> +		goto defer;

[ ... ]

> +	return;
> +
> +defer:
> +	s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1);
> +	if (!s)
> +		return;

In arena_free_pages(), when the deferred free path is taken but
kmalloc_nolock() fails, the function returns silently without freeing
the pages. These pages remain mapped in the kernel page tables and
marked as allocated in the range tree, but become unusable.

This can happen when:
- The function is called from a non-sleepable context (sleepable=false)
- Lock acquisition fails and falls through to defer
- kmalloc_nolock() returns NULL (PREEMPT_RT+IRQ, lock contention, or
  unsupported arch)

The deferred free path is called from arena_alloc_pages() error
handling and bpf_arena_free_pages_non_sleepable(). In both cases, the
pages that should be freed will leak.

Should there be a fallback mechanism or at least a warning when
deferral fails?

> +
> +	s->page_cnt = page_cnt;
> +	s->uaddr = uaddr;
> +	llist_add(&s->node, &arena->free_spans);
> +	irq_work_queue(&arena->free_irq);
>  }


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 11:47   ` bot+bpf-ci
@ 2025-11-14 15:28     ` Puranjay Mohan
  0 siblings, 0 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 15:28 UTC (permalink / raw)
  To: bot+bpf-ci, bpf
  Cc: ast, andrii, daniel, martin.lau, eddyz87, memxor, kernel-team,
	ast, andrii, daniel, martin.lau, eddyz87, yonghong.song, clm,
	ihor.solodrai

bot+bpf-ci@kernel.org writes:

>> diff --git a/kernel/bpf/arena.c b/kernel/bpf/arena.c
>> index 7fa6e40ab..ca443c113 100644
>> --- a/kernel/bpf/arena.c
>> +++ b/kernel/bpf/arena.c
>
> [ ... ]
>
>> @@ -331,12 +355,19 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>>  	struct bpf_arena *arena = container_of(map, struct bpf_arena, map);
>>  	struct page *page;
>>  	long kbase, kaddr;
>> +	unsigned long flags;
>>  	int ret;
>>
>>  	kbase = bpf_arena_get_kern_vm_start(arena);
>>  	kaddr = kbase + (u32)(vmf->address);
>>
>> -	guard(mutex)(&arena->lock);
>> +	if (raw_res_spin_lock_irqsave(&arena->spinlock, flags))
>> +		/*
>> +		 * This is an impossible case and would only trigger if res_spin_lock is buggy or
>> +		 * due to another kernel bug.
>> +		 */
>> +		return VM_FAULT_RETRY;
>> +
>>  	page = vmalloc_to_page((void *)kaddr);
>>  	if (page)
>>  		/* already have a page vmap-ed */
>> @@ -348,26 +379,30 @@ static vm_fault_t arena_vm_fault(struct vm_fault *vmf)
>>
>>  	ret = range_tree_clear(&arena->rt, vmf->pgoff, 1);
>>  	if (ret)
>> -		return VM_FAULT_SIGSEGV;
>> +		goto out_unlock_sigsegv;
>>
>>  	struct apply_range_data data = { .pages = &page, .i = 0 };
>>  	/* Account into memcg of the process that created bpf_arena */
>>  	ret = bpf_map_alloc_pages(map, NUMA_NO_NODE, 1, &page);
>>  	if (ret) {
>>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
>> -		return VM_FAULT_SIGSEGV;
>> +		goto out_unlock_sigsegv;
>>  	}
>>
>>  	ret = apply_to_page_range(&init_mm, kaddr, PAGE_SIZE, apply_range_set_cb, &data);
>>  	if (ret) {
>>  		range_tree_set(&arena->rt, vmf->pgoff, 1);
>> -		__free_page(page);
>> -		return VM_FAULT_SIGSEGV;
>> +		free_pages_nolock(page, 0);
>> +		goto out_unlock_sigsegv;
>>  	}
>>  out:
>>  	page_ref_add(page, 1);
>> +	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
>>  	vmf->page = page;
>>  	return 0;
>
> In arena_vm_fault(), there's a path between acquiring the spinlock and
> the new goto labels that doesn't release the lock:
>
>>  	if (arena->map.map_flags & BPF_F_SEGV_ON_FAULT)
>>  		return VM_FAULT_SIGSEGV;
>
> This return statement is reached after raw_res_spin_lock_irqsave()
> succeeds but before any of the unlock paths. If BPF_F_SEGV_ON_FAULT is
> set, subsequent page faults on this arena will deadlock. All other
> return paths in the function properly jump to out_unlock_sigsegv to
> release the lock.

Good catch, this needs to be goto out_unlock_sigsegv;

>> +out_unlock_sigsegv:
>> +	raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
>> +	return VM_FAULT_SIGSEGV;
>>  }
>
> [ ... ]
>
>> @@ -594,42 +633,65 @@ static void zap_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
>>  {
>>  	struct vma_list *vml;
>>
>> +	guard(mutex)(&arena->lock);
>> +	/* iterate link list under lock */
>>  	list_for_each_entry(vml, &arena->vma_list, head)
>>  		zap_page_range_single(vml->vma, uaddr,
>>  				      PAGE_SIZE * page_cnt, NULL);
>>  }
>>
>> -static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt)
>> +static void arena_free_pages(struct bpf_arena *arena, long uaddr, long page_cnt, bool sleepable)
>>  {
>>  	u64 full_uaddr, uaddr_end;
>> -	long kaddr, pgoff, i;
>> +	long kaddr, pgoff;
>>  	struct page *page;
>> +	struct llist_head free_pages;
>> +	struct llist_node *pos, *t;
>> +	struct arena_free_span *s;
>> +	unsigned long flags;
>> +	int ret = 0;
>>
>>  	/* only aligned lower 32-bit are relevant */
>>  	uaddr = (u32)uaddr;
>>  	uaddr &= PAGE_MASK;
>> +	kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
>>  	full_uaddr = clear_lo32(arena->user_vm_start) + uaddr;
>>  	uaddr_end = min(arena->user_vm_end, full_uaddr + (page_cnt << PAGE_SHIFT));
>>  	if (full_uaddr >= uaddr_end)
>>  		return;
>>
>>  	page_cnt = (uaddr_end - full_uaddr) >> PAGE_SHIFT;
>> +	pgoff = compute_pgoff(arena, uaddr);
>>
>> -	guard(mutex)(&arena->lock);
>> +	if (!sleepable)
>> +		goto defer;
>> +
>> +	ret = raw_res_spin_lock_irqsave(&arena->spinlock, flags);
>> +	/*
>> +	 * Can't proceed without holding the spinlock so defer the free
>> +	 */
>> +	if (ret)
>> +		goto defer;
>
> [ ... ]
>
>> +	return;
>> +
>> +defer:
>> +	s = kmalloc_nolock(sizeof(struct arena_free_span), 0, -1);
>> +	if (!s)
>> +		return;
>
> In arena_free_pages(), when the deferred free path is taken but
> kmalloc_nolock() fails, the function returns silently without freeing
> the pages. These pages remain mapped in the kernel page tables and
> marked as allocated in the range tree, but become unusable.
>
> This can happen when:
> - The function is called from a non-sleepable context (sleepable=false)
> - Lock acquisition fails and falls through to defer
> - kmalloc_nolock() returns NULL (PREEMPT_RT+IRQ, lock contention, or
>   unsupported arch)
>
> The deferred free path is called from arena_alloc_pages() error
> handling and bpf_arena_free_pages_non_sleepable(). In both cases, the
> pages that should be freed will leak.
>
> Should there be a fallback mechanism or at least a warning when
> deferral fails?

Yes, this is expected because if kmalloc_nolock() fails in non-sleepable
context, we don't have another way to get memory for arena_free_span, we
are accepting this trade-off, failing a 24 byte allocation here is
highly unlikely, but we can have a WARN_ONCE here. We had an offline
discussion about having debug counters from arena, maybe we can add a
counter here.

And for failure path of arena_alloc_pages, we could make
arena_alloc_pages get a sleepable parameter and call arena_free_pages()
with this parameter so we don't defer unnecessarily when
arena_alloc_pages() is called in sleepable context.

>> +
>> +	s->page_cnt = page_cnt;
>> +	s->uaddr = uaddr;
>> +	llist_add(&s->node, &arena->free_spans);
>> +	irq_work_queue(&arena->free_irq);
>>  }
>
>
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
>
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/19363121319

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
  2025-11-14 11:47   ` bot+bpf-ci
@ 2025-11-14 21:27   ` Alexei Starovoitov
  2025-11-15  0:56     ` Puranjay Mohan
  2025-11-15  8:18   ` kernel test robot
  2025-11-16  1:15   ` kernel test robot
  3 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2025-11-14 21:27 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Kernel Team

On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
>
> +       init_llist_head(&free_pages);
> +       /* clear ptes and collect struct pages */
> +       apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
> +                                    apply_range_clear_cb, &free_pages);
> +
> +       /* drop the lock to do the tlb flush and zap pages */
> +       raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> +
> +       /* ensure no stale TLB entries */
> +       flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
> +
>         if (page_cnt > 1)
>                 /* bulk zap if multiple pages being freed */
>                 zap_pages(arena, full_uaddr, page_cnt);
>
> -       kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
> -       for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
> -               page = vmalloc_to_page((void *)kaddr);
> -               if (!page)
> -                       continue;
> +       llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {

llist_del_all() ?! Why? it's a variable on stack. There is no race.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 21:27   ` Alexei Starovoitov
@ 2025-11-15  0:56     ` Puranjay Mohan
  2025-11-15  1:28       ` Alexei Starovoitov
  0 siblings, 1 reply; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-15  0:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>>
>> +       init_llist_head(&free_pages);
>> +       /* clear ptes and collect struct pages */
>> +       apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
>> +                                    apply_range_clear_cb, &free_pages);
>> +
>> +       /* drop the lock to do the tlb flush and zap pages */
>> +       raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
>> +
>> +       /* ensure no stale TLB entries */
>> +       flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
>> +
>>         if (page_cnt > 1)
>>                 /* bulk zap if multiple pages being freed */
>>                 zap_pages(arena, full_uaddr, page_cnt);
>>
>> -       kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
>> -       for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
>> -               page = vmalloc_to_page((void *)kaddr);
>> -               if (!page)
>> -                       continue;
>> +       llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {
>
> llist_del_all() ?! Why? it's a variable on stack. There is no race.

Yeah, I should have used __llist_del_all() which doesn't do an xchg() or
in this case I can just use free_pages.first

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-15  0:56     ` Puranjay Mohan
@ 2025-11-15  1:28       ` Alexei Starovoitov
  0 siblings, 0 replies; 22+ messages in thread
From: Alexei Starovoitov @ 2025-11-15  1:28 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

On Fri, Nov 14, 2025 at 4:56 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >>
> >>
> >> +       init_llist_head(&free_pages);
> >> +       /* clear ptes and collect struct pages */
> >> +       apply_to_existing_page_range(&init_mm, kaddr, page_cnt << PAGE_SHIFT,
> >> +                                    apply_range_clear_cb, &free_pages);
> >> +
> >> +       /* drop the lock to do the tlb flush and zap pages */
> >> +       raw_res_spin_unlock_irqrestore(&arena->spinlock, flags);
> >> +
> >> +       /* ensure no stale TLB entries */
> >> +       flush_tlb_kernel_range(kaddr, kaddr + (page_cnt * PAGE_SIZE));
> >> +
> >>         if (page_cnt > 1)
> >>                 /* bulk zap if multiple pages being freed */
> >>                 zap_pages(arena, full_uaddr, page_cnt);
> >>
> >> -       kaddr = bpf_arena_get_kern_vm_start(arena) + uaddr;
> >> -       for (i = 0; i < page_cnt; i++, kaddr += PAGE_SIZE, full_uaddr += PAGE_SIZE) {
> >> -               page = vmalloc_to_page((void *)kaddr);
> >> -               if (!page)
> >> -                       continue;
> >> +       llist_for_each_safe(pos, t, llist_del_all(&free_pages)) {
> >
> > llist_del_all() ?! Why? it's a variable on stack. There is no race.
>
> Yeah, I should have used __llist_del_all() which doesn't do an xchg() or
> in this case I can just use free_pages.first

Either one works. Slight preference for __llist_del_all() to avoid
peaking into llist details.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
  2025-11-14 11:47   ` bot+bpf-ci
  2025-11-14 21:27   ` Alexei Starovoitov
@ 2025-11-15  8:18   ` kernel test robot
  2025-11-16  1:15   ` kernel test robot
  3 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-11-15  8:18 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Hi Puranjay,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251114-192509
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20251114111700.43292-4-puranjay%40kernel.org
patch subject: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
config: xtensa-randconfig-r132-20251115 (https://download.01.org/0day-ci/archive/20251115/202511151534.L0gsQeTi-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 8.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251115/202511151534.L0gsQeTi-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511151534.L0gsQeTi-lkp@intel.com/

All errors (new ones prefixed by >>):

   xtensa-linux-ld: kernel/bpf/verifier.o: in function `convert_ctx_accesses':
>> kernel/bpf/verifier.c:21986: undefined reference to `bpf_arena_free_pages_non_sleepable'


vim +21986 kernel/bpf/verifier.c

a4b1d3c1ddf6cb Jiong Wang              2019-05-24  21682  
c64b7983288e63 Joe Stringer            2018-10-02  21683  /* convert load instructions that access fields of a context type into a
c64b7983288e63 Joe Stringer            2018-10-02  21684   * sequence of instructions that access fields of the underlying structure:
c64b7983288e63 Joe Stringer            2018-10-02  21685   *     struct __sk_buff    -> struct sk_buff
c64b7983288e63 Joe Stringer            2018-10-02  21686   *     struct bpf_sock_ops -> struct sock
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21687   */
58e2af8b3a6b58 Jakub Kicinski          2016-09-21  21688  static int convert_ctx_accesses(struct bpf_verifier_env *env)
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21689  {
169c31761c8d7f Martin KaFai Lau        2024-08-29  21690  	struct bpf_subprog_info *subprogs = env->subprog_info;
00176a34d9e27a Jakub Kicinski          2017-10-16  21691  	const struct bpf_verifier_ops *ops = env->ops;
d519594ee2445d Amery Hung              2025-02-25  21692  	int i, cnt, size, ctx_field_size, ret, delta = 0, epilogue_cnt = 0;
3df126f35f88dc Jakub Kicinski          2016-09-21  21693  	const int insn_cnt = env->prog->len;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21694  	struct bpf_insn *epilogue_buf = env->epilogue_buf;
6f606ffd6dd758 Martin KaFai Lau        2024-08-29  21695  	struct bpf_insn *insn_buf = env->insn_buf;
6f606ffd6dd758 Martin KaFai Lau        2024-08-29  21696  	struct bpf_insn *insn;
46f53a65d2de3e Andrey Ignatov          2018-11-10  21697  	u32 target_size, size_default, off;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21698  	struct bpf_prog *new_prog;
d691f9e8d4405c Alexei Starovoitov      2015-06-04  21699  	enum bpf_access_type type;
f96da09473b52c Daniel Borkmann         2017-07-02  21700  	bool is_narrower_load;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21701  	int epilogue_idx = 0;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21702  
169c31761c8d7f Martin KaFai Lau        2024-08-29  21703  	if (ops->gen_epilogue) {
169c31761c8d7f Martin KaFai Lau        2024-08-29  21704  		epilogue_cnt = ops->gen_epilogue(epilogue_buf, env->prog,
169c31761c8d7f Martin KaFai Lau        2024-08-29  21705  						 -(subprogs[0].stack_depth + 8));
169c31761c8d7f Martin KaFai Lau        2024-08-29  21706  		if (epilogue_cnt >= INSN_BUF_SIZE) {
0df1a55afa832f Paul Chaignon           2025-07-01  21707  			verifier_bug(env, "epilogue is too long");
fd508bde5d646f Luis Gerhorst           2025-06-03  21708  			return -EFAULT;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21709  		} else if (epilogue_cnt) {
169c31761c8d7f Martin KaFai Lau        2024-08-29  21710  			/* Save the ARG_PTR_TO_CTX for the epilogue to use */
169c31761c8d7f Martin KaFai Lau        2024-08-29  21711  			cnt = 0;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21712  			subprogs[0].stack_depth += 8;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21713  			insn_buf[cnt++] = BPF_STX_MEM(BPF_DW, BPF_REG_FP, BPF_REG_1,
169c31761c8d7f Martin KaFai Lau        2024-08-29  21714  						      -subprogs[0].stack_depth);
169c31761c8d7f Martin KaFai Lau        2024-08-29  21715  			insn_buf[cnt++] = env->prog->insnsi[0];
169c31761c8d7f Martin KaFai Lau        2024-08-29  21716  			new_prog = bpf_patch_insn_data(env, 0, insn_buf, cnt);
169c31761c8d7f Martin KaFai Lau        2024-08-29  21717  			if (!new_prog)
169c31761c8d7f Martin KaFai Lau        2024-08-29  21718  				return -ENOMEM;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21719  			env->prog = new_prog;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21720  			delta += cnt - 1;
d519594ee2445d Amery Hung              2025-02-25  21721  
d519594ee2445d Amery Hung              2025-02-25  21722  			ret = add_kfunc_in_insns(env, epilogue_buf, epilogue_cnt - 1);
d519594ee2445d Amery Hung              2025-02-25  21723  			if (ret < 0)
d519594ee2445d Amery Hung              2025-02-25  21724  				return ret;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21725  		}
169c31761c8d7f Martin KaFai Lau        2024-08-29  21726  	}
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21727  
b09928b976280d Daniel Borkmann         2018-10-24  21728  	if (ops->gen_prologue || env->seen_direct_write) {
b09928b976280d Daniel Borkmann         2018-10-24  21729  		if (!ops->gen_prologue) {
0df1a55afa832f Paul Chaignon           2025-07-01  21730  			verifier_bug(env, "gen_prologue is null");
fd508bde5d646f Luis Gerhorst           2025-06-03  21731  			return -EFAULT;
b09928b976280d Daniel Borkmann         2018-10-24  21732  		}
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21733  		cnt = ops->gen_prologue(insn_buf, env->seen_direct_write,
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21734  					env->prog);
6f606ffd6dd758 Martin KaFai Lau        2024-08-29  21735  		if (cnt >= INSN_BUF_SIZE) {
0df1a55afa832f Paul Chaignon           2025-07-01  21736  			verifier_bug(env, "prologue is too long");
fd508bde5d646f Luis Gerhorst           2025-06-03  21737  			return -EFAULT;
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21738  		} else if (cnt) {
8041902dae5299 Alexei Starovoitov      2017-03-15  21739  			new_prog = bpf_patch_insn_data(env, 0, insn_buf, cnt);
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21740  			if (!new_prog)
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21741  				return -ENOMEM;
8041902dae5299 Alexei Starovoitov      2017-03-15  21742  
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21743  			env->prog = new_prog;
3df126f35f88dc Jakub Kicinski          2016-09-21  21744  			delta += cnt - 1;
d519594ee2445d Amery Hung              2025-02-25  21745  
d519594ee2445d Amery Hung              2025-02-25  21746  			ret = add_kfunc_in_insns(env, insn_buf, cnt - 1);
d519594ee2445d Amery Hung              2025-02-25  21747  			if (ret < 0)
d519594ee2445d Amery Hung              2025-02-25  21748  				return ret;
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21749  		}
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21750  	}
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21751  
d5c47719f24438 Martin KaFai Lau        2024-08-29  21752  	if (delta)
d5c47719f24438 Martin KaFai Lau        2024-08-29  21753  		WARN_ON(adjust_jmp_off(env->prog, 0, delta));
d5c47719f24438 Martin KaFai Lau        2024-08-29  21754  
9d03ebc71a027c Stanislav Fomichev      2023-01-19  21755  	if (bpf_prog_is_offloaded(env->prog->aux))
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21756  		return 0;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21757  
3df126f35f88dc Jakub Kicinski          2016-09-21  21758  	insn = env->prog->insnsi + delta;
36bbef52c7eb64 Daniel Borkmann         2016-09-20  21759  
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21760  	for (i = 0; i < insn_cnt; i++, insn++) {
c64b7983288e63 Joe Stringer            2018-10-02  21761  		bpf_convert_ctx_access_t convert_ctx_access;
1f1e864b65554e Yonghong Song           2023-07-27  21762  		u8 mode;
c64b7983288e63 Joe Stringer            2018-10-02  21763  
d6f1c85f22534d Luis Gerhorst           2025-06-03  21764  		if (env->insn_aux_data[i + delta].nospec) {
d6f1c85f22534d Luis Gerhorst           2025-06-03  21765  			WARN_ON_ONCE(env->insn_aux_data[i + delta].alu_state);
45e9cd38aa8df9 Yonghong Song           2025-07-03  21766  			struct bpf_insn *patch = insn_buf;
d6f1c85f22534d Luis Gerhorst           2025-06-03  21767  
45e9cd38aa8df9 Yonghong Song           2025-07-03  21768  			*patch++ = BPF_ST_NOSPEC();
45e9cd38aa8df9 Yonghong Song           2025-07-03  21769  			*patch++ = *insn;
45e9cd38aa8df9 Yonghong Song           2025-07-03  21770  			cnt = patch - insn_buf;
45e9cd38aa8df9 Yonghong Song           2025-07-03  21771  			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
d6f1c85f22534d Luis Gerhorst           2025-06-03  21772  			if (!new_prog)
d6f1c85f22534d Luis Gerhorst           2025-06-03  21773  				return -ENOMEM;
d6f1c85f22534d Luis Gerhorst           2025-06-03  21774  
d6f1c85f22534d Luis Gerhorst           2025-06-03  21775  			delta    += cnt - 1;
d6f1c85f22534d Luis Gerhorst           2025-06-03  21776  			env->prog = new_prog;
d6f1c85f22534d Luis Gerhorst           2025-06-03  21777  			insn      = new_prog->insnsi + i + delta;
d6f1c85f22534d Luis Gerhorst           2025-06-03  21778  			/* This can not be easily merged with the
d6f1c85f22534d Luis Gerhorst           2025-06-03  21779  			 * nospec_result-case, because an insn may require a
d6f1c85f22534d Luis Gerhorst           2025-06-03  21780  			 * nospec before and after itself. Therefore also do not
d6f1c85f22534d Luis Gerhorst           2025-06-03  21781  			 * 'continue' here but potentially apply further
d6f1c85f22534d Luis Gerhorst           2025-06-03  21782  			 * patching to insn. *insn should equal patch[1] now.
d6f1c85f22534d Luis Gerhorst           2025-06-03  21783  			 */
d6f1c85f22534d Luis Gerhorst           2025-06-03  21784  		}
d6f1c85f22534d Luis Gerhorst           2025-06-03  21785  
62c7989b24dbd3 Daniel Borkmann         2017-01-12  21786  		if (insn->code == (BPF_LDX | BPF_MEM | BPF_B) ||
62c7989b24dbd3 Daniel Borkmann         2017-01-12  21787  		    insn->code == (BPF_LDX | BPF_MEM | BPF_H) ||
62c7989b24dbd3 Daniel Borkmann         2017-01-12  21788  		    insn->code == (BPF_LDX | BPF_MEM | BPF_W) ||
1f9a1ea821ff25 Yonghong Song           2023-07-27  21789  		    insn->code == (BPF_LDX | BPF_MEM | BPF_DW) ||
1f9a1ea821ff25 Yonghong Song           2023-07-27  21790  		    insn->code == (BPF_LDX | BPF_MEMSX | BPF_B) ||
1f9a1ea821ff25 Yonghong Song           2023-07-27  21791  		    insn->code == (BPF_LDX | BPF_MEMSX | BPF_H) ||
1f9a1ea821ff25 Yonghong Song           2023-07-27  21792  		    insn->code == (BPF_LDX | BPF_MEMSX | BPF_W)) {
d691f9e8d4405c Alexei Starovoitov      2015-06-04  21793  			type = BPF_READ;
2039f26f3aca5b Daniel Borkmann         2021-07-13  21794  		} else if (insn->code == (BPF_STX | BPF_MEM | BPF_B) ||
62c7989b24dbd3 Daniel Borkmann         2017-01-12  21795  			   insn->code == (BPF_STX | BPF_MEM | BPF_H) ||
62c7989b24dbd3 Daniel Borkmann         2017-01-12  21796  			   insn->code == (BPF_STX | BPF_MEM | BPF_W) ||
2039f26f3aca5b Daniel Borkmann         2021-07-13  21797  			   insn->code == (BPF_STX | BPF_MEM | BPF_DW) ||
2039f26f3aca5b Daniel Borkmann         2021-07-13  21798  			   insn->code == (BPF_ST | BPF_MEM | BPF_B) ||
2039f26f3aca5b Daniel Borkmann         2021-07-13  21799  			   insn->code == (BPF_ST | BPF_MEM | BPF_H) ||
2039f26f3aca5b Daniel Borkmann         2021-07-13  21800  			   insn->code == (BPF_ST | BPF_MEM | BPF_W) ||
2039f26f3aca5b Daniel Borkmann         2021-07-13  21801  			   insn->code == (BPF_ST | BPF_MEM | BPF_DW)) {
d691f9e8d4405c Alexei Starovoitov      2015-06-04  21802  			type = BPF_WRITE;
880442305a3908 Peilin Ye               2025-03-04  21803  		} else if ((insn->code == (BPF_STX | BPF_ATOMIC | BPF_B) ||
880442305a3908 Peilin Ye               2025-03-04  21804  			    insn->code == (BPF_STX | BPF_ATOMIC | BPF_H) ||
880442305a3908 Peilin Ye               2025-03-04  21805  			    insn->code == (BPF_STX | BPF_ATOMIC | BPF_W) ||
d503a04f8bc0c7 Alexei Starovoitov      2024-04-05  21806  			    insn->code == (BPF_STX | BPF_ATOMIC | BPF_DW)) &&
d503a04f8bc0c7 Alexei Starovoitov      2024-04-05  21807  			   env->insn_aux_data[i + delta].ptr_type == PTR_TO_ARENA) {
d503a04f8bc0c7 Alexei Starovoitov      2024-04-05  21808  			insn->code = BPF_STX | BPF_PROBE_ATOMIC | BPF_SIZE(insn->code);
d503a04f8bc0c7 Alexei Starovoitov      2024-04-05  21809  			env->prog->aux->num_exentries++;
d503a04f8bc0c7 Alexei Starovoitov      2024-04-05  21810  			continue;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21811  		} else if (insn->code == (BPF_JMP | BPF_EXIT) &&
169c31761c8d7f Martin KaFai Lau        2024-08-29  21812  			   epilogue_cnt &&
169c31761c8d7f Martin KaFai Lau        2024-08-29  21813  			   i + delta < subprogs[1].start) {
169c31761c8d7f Martin KaFai Lau        2024-08-29  21814  			/* Generate epilogue for the main prog */
169c31761c8d7f Martin KaFai Lau        2024-08-29  21815  			if (epilogue_idx) {
169c31761c8d7f Martin KaFai Lau        2024-08-29  21816  				/* jump back to the earlier generated epilogue */
169c31761c8d7f Martin KaFai Lau        2024-08-29  21817  				insn_buf[0] = BPF_JMP32_A(epilogue_idx - i - delta - 1);
169c31761c8d7f Martin KaFai Lau        2024-08-29  21818  				cnt = 1;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21819  			} else {
169c31761c8d7f Martin KaFai Lau        2024-08-29  21820  				memcpy(insn_buf, epilogue_buf,
169c31761c8d7f Martin KaFai Lau        2024-08-29  21821  				       epilogue_cnt * sizeof(*epilogue_buf));
169c31761c8d7f Martin KaFai Lau        2024-08-29  21822  				cnt = epilogue_cnt;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21823  				/* epilogue_idx cannot be 0. It must have at
169c31761c8d7f Martin KaFai Lau        2024-08-29  21824  				 * least one ctx ptr saving insn before the
169c31761c8d7f Martin KaFai Lau        2024-08-29  21825  				 * epilogue.
169c31761c8d7f Martin KaFai Lau        2024-08-29  21826  				 */
169c31761c8d7f Martin KaFai Lau        2024-08-29  21827  				epilogue_idx = i + delta;
169c31761c8d7f Martin KaFai Lau        2024-08-29  21828  			}
169c31761c8d7f Martin KaFai Lau        2024-08-29  21829  			goto patch_insn_buf;
2039f26f3aca5b Daniel Borkmann         2021-07-13  21830  		} else {
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21831  			continue;
2039f26f3aca5b Daniel Borkmann         2021-07-13  21832  		}
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21833  
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21834  		if (type == BPF_WRITE &&
9124a4508007f1 Luis Gerhorst           2025-06-03  21835  		    env->insn_aux_data[i + delta].nospec_result) {
d6f1c85f22534d Luis Gerhorst           2025-06-03  21836  			/* nospec_result is only used to mitigate Spectre v4 and
d6f1c85f22534d Luis Gerhorst           2025-06-03  21837  			 * to limit verification-time for Spectre v1.
d6f1c85f22534d Luis Gerhorst           2025-06-03  21838  			 */
45e9cd38aa8df9 Yonghong Song           2025-07-03  21839  			struct bpf_insn *patch = insn_buf;
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21840  
45e9cd38aa8df9 Yonghong Song           2025-07-03  21841  			*patch++ = *insn;
45e9cd38aa8df9 Yonghong Song           2025-07-03  21842  			*patch++ = BPF_ST_NOSPEC();
45e9cd38aa8df9 Yonghong Song           2025-07-03  21843  			cnt = patch - insn_buf;
45e9cd38aa8df9 Yonghong Song           2025-07-03  21844  			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21845  			if (!new_prog)
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21846  				return -ENOMEM;
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21847  
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21848  			delta    += cnt - 1;
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21849  			env->prog = new_prog;
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21850  			insn      = new_prog->insnsi + i + delta;
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21851  			continue;
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21852  		}
af86ca4e3088fe Alexei Starovoitov      2018-05-15  21853  
6efe152d4061a8 Kumar Kartikeya Dwivedi 2022-04-25  21854  		switch ((int)env->insn_aux_data[i + delta].ptr_type) {
c64b7983288e63 Joe Stringer            2018-10-02  21855  		case PTR_TO_CTX:
c64b7983288e63 Joe Stringer            2018-10-02  21856  			if (!ops->convert_ctx_access)
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21857  				continue;
c64b7983288e63 Joe Stringer            2018-10-02  21858  			convert_ctx_access = ops->convert_ctx_access;
c64b7983288e63 Joe Stringer            2018-10-02  21859  			break;
c64b7983288e63 Joe Stringer            2018-10-02  21860  		case PTR_TO_SOCKET:
46f8bc92758c62 Martin KaFai Lau        2019-02-09  21861  		case PTR_TO_SOCK_COMMON:
c64b7983288e63 Joe Stringer            2018-10-02  21862  			convert_ctx_access = bpf_sock_convert_ctx_access;
c64b7983288e63 Joe Stringer            2018-10-02  21863  			break;
655a51e536c09d Martin KaFai Lau        2019-02-09  21864  		case PTR_TO_TCP_SOCK:
655a51e536c09d Martin KaFai Lau        2019-02-09  21865  			convert_ctx_access = bpf_tcp_sock_convert_ctx_access;
655a51e536c09d Martin KaFai Lau        2019-02-09  21866  			break;
fada7fdc83c0bf Jonathan Lemon          2019-06-06  21867  		case PTR_TO_XDP_SOCK:
fada7fdc83c0bf Jonathan Lemon          2019-06-06  21868  			convert_ctx_access = bpf_xdp_sock_convert_ctx_access;
fada7fdc83c0bf Jonathan Lemon          2019-06-06  21869  			break;
2a02759ef5f8a3 Alexei Starovoitov      2019-10-15  21870  		case PTR_TO_BTF_ID:
6efe152d4061a8 Kumar Kartikeya Dwivedi 2022-04-25  21871  		case PTR_TO_BTF_ID | PTR_UNTRUSTED:
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21872  		/* PTR_TO_BTF_ID | MEM_ALLOC always has a valid lifetime, unlike
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21873  		 * PTR_TO_BTF_ID, and an active ref_obj_id, but the same cannot
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21874  		 * be said once it is marked PTR_UNTRUSTED, hence we must handle
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21875  		 * any faults for loads into such types. BPF_WRITE is disallowed
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21876  		 * for this case.
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21877  		 */
282de143ead96a Kumar Kartikeya Dwivedi 2022-11-18  21878  		case PTR_TO_BTF_ID | MEM_ALLOC | PTR_UNTRUSTED:
f2362a57aefff5 Eduard Zingerman        2025-06-25  21879  		case PTR_TO_MEM | MEM_RDONLY | PTR_UNTRUSTED:
27ae7997a66174 Martin KaFai Lau        2020-01-08  21880  			if (type == BPF_READ) {
1f9a1ea821ff25 Yonghong Song           2023-07-27  21881  				if (BPF_MODE(insn->code) == BPF_MEM)
27ae7997a66174 Martin KaFai Lau        2020-01-08  21882  					insn->code = BPF_LDX | BPF_PROBE_MEM |
27ae7997a66174 Martin KaFai Lau        2020-01-08  21883  						     BPF_SIZE((insn)->code);
1f9a1ea821ff25 Yonghong Song           2023-07-27  21884  				else
1f9a1ea821ff25 Yonghong Song           2023-07-27  21885  					insn->code = BPF_LDX | BPF_PROBE_MEMSX |
1f9a1ea821ff25 Yonghong Song           2023-07-27  21886  						     BPF_SIZE((insn)->code);
27ae7997a66174 Martin KaFai Lau        2020-01-08  21887  				env->prog->aux->num_exentries++;
2a02759ef5f8a3 Alexei Starovoitov      2019-10-15  21888  			}
2a02759ef5f8a3 Alexei Starovoitov      2019-10-15  21889  			continue;
6082b6c328b548 Alexei Starovoitov      2024-03-07  21890  		case PTR_TO_ARENA:
6082b6c328b548 Alexei Starovoitov      2024-03-07  21891  			if (BPF_MODE(insn->code) == BPF_MEMSX) {
a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23  21892  				if (!bpf_jit_supports_insn(insn, true)) {
6082b6c328b548 Alexei Starovoitov      2024-03-07  21893  					verbose(env, "sign extending loads from arena are not supported yet\n");
6082b6c328b548 Alexei Starovoitov      2024-03-07  21894  					return -EOPNOTSUPP;
6082b6c328b548 Alexei Starovoitov      2024-03-07  21895  				}
a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23  21896  				insn->code = BPF_CLASS(insn->code) | BPF_PROBE_MEM32SX | BPF_SIZE(insn->code);
a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23  21897  			} else {
6082b6c328b548 Alexei Starovoitov      2024-03-07  21898  				insn->code = BPF_CLASS(insn->code) | BPF_PROBE_MEM32 | BPF_SIZE(insn->code);
a91ae3c8931164 Kumar Kartikeya Dwivedi 2025-09-23  21899  			}
6082b6c328b548 Alexei Starovoitov      2024-03-07  21900  			env->prog->aux->num_exentries++;
6082b6c328b548 Alexei Starovoitov      2024-03-07  21901  			continue;
c64b7983288e63 Joe Stringer            2018-10-02  21902  		default:
c64b7983288e63 Joe Stringer            2018-10-02  21903  			continue;
c64b7983288e63 Joe Stringer            2018-10-02  21904  		}
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21905  
31fd85816dbe3a Yonghong Song           2017-06-13  21906  		ctx_field_size = env->insn_aux_data[i + delta].ctx_field_size;
f96da09473b52c Daniel Borkmann         2017-07-02  21907  		size = BPF_LDST_BYTES(insn);
1f1e864b65554e Yonghong Song           2023-07-27  21908  		mode = BPF_MODE(insn->code);
31fd85816dbe3a Yonghong Song           2017-06-13  21909  
31fd85816dbe3a Yonghong Song           2017-06-13  21910  		/* If the read access is a narrower load of the field,
31fd85816dbe3a Yonghong Song           2017-06-13  21911  		 * convert to a 4/8-byte load, to minimum program type specific
31fd85816dbe3a Yonghong Song           2017-06-13  21912  		 * convert_ctx_access changes. If conversion is successful,
31fd85816dbe3a Yonghong Song           2017-06-13  21913  		 * we will apply proper mask to the result.
31fd85816dbe3a Yonghong Song           2017-06-13  21914  		 */
f96da09473b52c Daniel Borkmann         2017-07-02  21915  		is_narrower_load = size < ctx_field_size;
46f53a65d2de3e Andrey Ignatov          2018-11-10  21916  		size_default = bpf_ctx_off_adjust_machine(ctx_field_size);
46f53a65d2de3e Andrey Ignatov          2018-11-10  21917  		off = insn->off;
31fd85816dbe3a Yonghong Song           2017-06-13  21918  		if (is_narrower_load) {
f96da09473b52c Daniel Borkmann         2017-07-02  21919  			u8 size_code;
31fd85816dbe3a Yonghong Song           2017-06-13  21920  
f96da09473b52c Daniel Borkmann         2017-07-02  21921  			if (type == BPF_WRITE) {
0df1a55afa832f Paul Chaignon           2025-07-01  21922  				verifier_bug(env, "narrow ctx access misconfigured");
fd508bde5d646f Luis Gerhorst           2025-06-03  21923  				return -EFAULT;
f96da09473b52c Daniel Borkmann         2017-07-02  21924  			}
f96da09473b52c Daniel Borkmann         2017-07-02  21925  
f96da09473b52c Daniel Borkmann         2017-07-02  21926  			size_code = BPF_H;
31fd85816dbe3a Yonghong Song           2017-06-13  21927  			if (ctx_field_size == 4)
31fd85816dbe3a Yonghong Song           2017-06-13  21928  				size_code = BPF_W;
31fd85816dbe3a Yonghong Song           2017-06-13  21929  			else if (ctx_field_size == 8)
31fd85816dbe3a Yonghong Song           2017-06-13  21930  				size_code = BPF_DW;
f96da09473b52c Daniel Borkmann         2017-07-02  21931  
bc23105ca0abde Daniel Borkmann         2018-06-02  21932  			insn->off = off & ~(size_default - 1);
31fd85816dbe3a Yonghong Song           2017-06-13  21933  			insn->code = BPF_LDX | BPF_MEM | size_code;
31fd85816dbe3a Yonghong Song           2017-06-13  21934  		}
f96da09473b52c Daniel Borkmann         2017-07-02  21935  
f96da09473b52c Daniel Borkmann         2017-07-02  21936  		target_size = 0;
c64b7983288e63 Joe Stringer            2018-10-02  21937  		cnt = convert_ctx_access(type, insn, insn_buf, env->prog,
f96da09473b52c Daniel Borkmann         2017-07-02  21938  					 &target_size);
6f606ffd6dd758 Martin KaFai Lau        2024-08-29  21939  		if (cnt == 0 || cnt >= INSN_BUF_SIZE ||
f96da09473b52c Daniel Borkmann         2017-07-02  21940  		    (ctx_field_size && !target_size)) {
f914876eec9e72 Paul Chaignon           2025-08-01  21941  			verifier_bug(env, "error during ctx access conversion (%d)", cnt);
fd508bde5d646f Luis Gerhorst           2025-06-03  21942  			return -EFAULT;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21943  		}
f96da09473b52c Daniel Borkmann         2017-07-02  21944  
f96da09473b52c Daniel Borkmann         2017-07-02  21945  		if (is_narrower_load && size < target_size) {
d895a0f16fadb2 Ilya Leoshkevich        2019-08-16  21946  			u8 shift = bpf_ctx_narrow_access_offset(
d895a0f16fadb2 Ilya Leoshkevich        2019-08-16  21947  				off, size, size_default) * 8;
6f606ffd6dd758 Martin KaFai Lau        2024-08-29  21948  			if (shift && cnt + 1 >= INSN_BUF_SIZE) {
0df1a55afa832f Paul Chaignon           2025-07-01  21949  				verifier_bug(env, "narrow ctx load misconfigured");
fd508bde5d646f Luis Gerhorst           2025-06-03  21950  				return -EFAULT;
d7af7e497f0308 Andrey Ignatov          2021-08-20  21951  			}
46f53a65d2de3e Andrey Ignatov          2018-11-10  21952  			if (ctx_field_size <= 4) {
46f53a65d2de3e Andrey Ignatov          2018-11-10  21953  				if (shift)
46f53a65d2de3e Andrey Ignatov          2018-11-10  21954  					insn_buf[cnt++] = BPF_ALU32_IMM(BPF_RSH,
46f53a65d2de3e Andrey Ignatov          2018-11-10  21955  									insn->dst_reg,
46f53a65d2de3e Andrey Ignatov          2018-11-10  21956  									shift);
31fd85816dbe3a Yonghong Song           2017-06-13  21957  				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
31fd85816dbe3a Yonghong Song           2017-06-13  21958  								(1 << size * 8) - 1);
46f53a65d2de3e Andrey Ignatov          2018-11-10  21959  			} else {
46f53a65d2de3e Andrey Ignatov          2018-11-10  21960  				if (shift)
46f53a65d2de3e Andrey Ignatov          2018-11-10  21961  					insn_buf[cnt++] = BPF_ALU64_IMM(BPF_RSH,
46f53a65d2de3e Andrey Ignatov          2018-11-10  21962  									insn->dst_reg,
46f53a65d2de3e Andrey Ignatov          2018-11-10  21963  									shift);
0613d8ca9ab382 Will Deacon             2023-05-18  21964  				insn_buf[cnt++] = BPF_ALU32_IMM(BPF_AND, insn->dst_reg,
e2f7fc0ac6957c Krzesimir Nowak         2019-05-08  21965  								(1ULL << size * 8) - 1);
31fd85816dbe3a Yonghong Song           2017-06-13  21966  			}
46f53a65d2de3e Andrey Ignatov          2018-11-10  21967  		}
1f1e864b65554e Yonghong Song           2023-07-27  21968  		if (mode == BPF_MEMSX)
1f1e864b65554e Yonghong Song           2023-07-27  21969  			insn_buf[cnt++] = BPF_RAW_INSN(BPF_ALU64 | BPF_MOV | BPF_X,
1f1e864b65554e Yonghong Song           2023-07-27  21970  						       insn->dst_reg, insn->dst_reg,
1f1e864b65554e Yonghong Song           2023-07-27  21971  						       size * 8, 0);
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21972  
169c31761c8d7f Martin KaFai Lau        2024-08-29  21973  patch_insn_buf:
8041902dae5299 Alexei Starovoitov      2017-03-15  21974  		new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21975  		if (!new_prog)
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21976  			return -ENOMEM;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21977  
3df126f35f88dc Jakub Kicinski          2016-09-21  21978  		delta += cnt - 1;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21979  
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21980  		/* keep walking new program and skip insns we just inserted */
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21981  		env->prog = new_prog;
3df126f35f88dc Jakub Kicinski          2016-09-21  21982  		insn      = new_prog->insnsi + i + delta;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21983  	}
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21984  
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21985  	return 0;
9bac3d6d548e5c Alexei Starovoitov      2015-03-13 @21986  }
9bac3d6d548e5c Alexei Starovoitov      2015-03-13  21987  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
  2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
                     ` (2 preceding siblings ...)
  2025-11-15  8:18   ` kernel test robot
@ 2025-11-16  1:15   ` kernel test robot
  3 siblings, 0 replies; 22+ messages in thread
From: kernel test robot @ 2025-11-16  1:15 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Hi Puranjay,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-arena-populate-vm_area-without-allocating-memory/20251114-192509
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20251114111700.43292-4-puranjay%40kernel.org
patch subject: [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe
config: sh-randconfig-r071-20251115 (https://download.01.org/0day-ci/archive/20251116/202511160836.5Ca6PimB-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.1.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251116/202511160836.5Ca6PimB-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511160836.5Ca6PimB-lkp@intel.com/

All errors (new ones prefixed by >>):

   sh4-linux-ld: kernel/bpf/verifier.o: in function `fixup_kfunc_call':
>> kernel/bpf/verifier.c:22428:(.text+0x7748): undefined reference to `bpf_arena_free_pages_non_sleepable'
   sh4-linux-ld: drivers/net/phy/air_en8811h.o: in function `en8811h_resume':
   drivers/net/phy/air_en8811h.c:1178:(.text+0x544): undefined reference to `clk_restore_context'
   sh4-linux-ld: drivers/net/phy/air_en8811h.o: in function `en8811h_suspend':
   drivers/net/phy/air_en8811h.c:1185:(.text+0x56c): undefined reference to `clk_save_context'
   sh4-linux-ld: drivers/media/i2c/tc358746.o: in function `tc358746_probe':
   drivers/media/i2c/tc358746.c:1585:(.text+0x1408): undefined reference to `devm_clk_hw_register'

Kconfig warnings: (for reference only)
   WARNING: unmet direct dependencies detected for OF_GPIO
   Depends on [n]: GPIOLIB [=y] && OF [=n] && HAS_IOMEM [=y]
   Selected by [y]:
   - GPIO_TB10X [=y] && GPIOLIB [=y] && HAS_IOMEM [=y] && (ARC_PLAT_TB10X || COMPILE_TEST [=y])
   WARNING: unmet direct dependencies detected for GPIO_SYSCON
   Depends on [n]: GPIOLIB [=y] && HAS_IOMEM [=y] && MFD_SYSCON [=y] && OF [=n]
   Selected by [y]:
   - GPIO_SAMA5D2_PIOBU [=y] && GPIOLIB [=y] && HAS_IOMEM [=y] && MFD_SYSCON [=y] && OF_GPIO [=y] && (ARCH_AT91 || COMPILE_TEST [=y])
   WARNING: unmet direct dependencies detected for I2C_K1
   Depends on [n]: I2C [=y] && HAS_IOMEM [=y] && (ARCH_SPACEMIT || COMPILE_TEST [=y]) && OF [=n]
   Selected by [y]:
   - MFD_SPACEMIT_P1 [=y] && HAS_IOMEM [=y] && (ARCH_SPACEMIT || COMPILE_TEST [=y]) && I2C [=y]


vim +22428 kernel/bpf/verifier.c

d2dcc67df910dd Dave Marchevsky         2023-04-15  22392  
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22393  static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22394  			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22395  {
d869d56ca84841 Mykyta Yatsenko         2025-10-26  22396  	struct bpf_kfunc_desc *desc;
d869d56ca84841 Mykyta Yatsenko         2025-10-26  22397  	int err;
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22398  
a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02  22399  	if (!insn->imm) {
a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02  22400  		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02  22401  		return -EINVAL;
a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02  22402  	}
a5d8272752416e Kumar Kartikeya Dwivedi 2021-10-02  22403  
3d76a4d3d4e591 Stanislav Fomichev      2023-01-19  22404  	*cnt = 0;
3d76a4d3d4e591 Stanislav Fomichev      2023-01-19  22405  
1cf3bfc60f9836 Ilya Leoshkevich        2023-04-13  22406  	/* insn->imm has the btf func_id. Replace it with an offset relative to
1cf3bfc60f9836 Ilya Leoshkevich        2023-04-13  22407  	 * __bpf_call_base, unless the JIT needs to call functions that are
1cf3bfc60f9836 Ilya Leoshkevich        2023-04-13  22408  	 * further than 32 bits away (bpf_jit_supports_far_kfunc_call()).
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22409  	 */
2357672c54c3f7 Kumar Kartikeya Dwivedi 2021-10-02  22410  	desc = find_kfunc_desc(env->prog, insn->imm, insn->off);
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22411  	if (!desc) {
0df1a55afa832f Paul Chaignon           2025-07-01  22412  		verifier_bug(env, "kernel function descriptor not found for func_id %u",
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22413  			     insn->imm);
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22414  		return -EFAULT;
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22415  	}
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22416  
2c52e8943a437a Mykyta Yatsenko         2025-10-26  22417  	err = specialize_kfunc(env, desc, insn_idx);
d869d56ca84841 Mykyta Yatsenko         2025-10-26  22418  	if (err)
d869d56ca84841 Mykyta Yatsenko         2025-10-26  22419  		return err;
d869d56ca84841 Mykyta Yatsenko         2025-10-26  22420  
1cf3bfc60f9836 Ilya Leoshkevich        2023-04-13  22421  	if (!bpf_jit_supports_far_kfunc_call())
1cf3bfc60f9836 Ilya Leoshkevich        2023-04-13  22422  		insn->imm = BPF_CALL_IMM(desc->addr);
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22423  	if (insn->off)
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22424  		return 0;
36d8bdf75a9319 Yonghong Song           2023-08-27  22425  	if (desc->func_id == special_kfunc_list[KF_bpf_obj_new_impl] ||
36d8bdf75a9319 Yonghong Song           2023-08-27  22426  	    desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl]) {
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22427  		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18 @22428  		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22429  		u64 obj_new_size = env->insn_aux_data[insn_idx].obj_new_size;
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22430  
36d8bdf75a9319 Yonghong Song           2023-08-27  22431  		if (desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_new_impl] && kptr_struct_meta) {
0df1a55afa832f Paul Chaignon           2025-07-01  22432  			verifier_bug(env, "NULL kptr_struct_meta expected at insn_idx %d",
36d8bdf75a9319 Yonghong Song           2023-08-27  22433  				     insn_idx);
36d8bdf75a9319 Yonghong Song           2023-08-27  22434  			return -EFAULT;
36d8bdf75a9319 Yonghong Song           2023-08-27  22435  		}
36d8bdf75a9319 Yonghong Song           2023-08-27  22436  
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22437  		insn_buf[0] = BPF_MOV64_IMM(BPF_REG_1, obj_new_size);
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22438  		insn_buf[1] = addr[0];
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22439  		insn_buf[2] = addr[1];
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22440  		insn_buf[3] = *insn;
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22441  		*cnt = 4;
7c50b1cb76aca4 Dave Marchevsky         2023-04-15  22442  	} else if (desc->func_id == special_kfunc_list[KF_bpf_obj_drop_impl] ||
36d8bdf75a9319 Yonghong Song           2023-08-27  22443  		   desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_drop_impl] ||
7c50b1cb76aca4 Dave Marchevsky         2023-04-15  22444  		   desc->func_id == special_kfunc_list[KF_bpf_refcount_acquire_impl]) {
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22445  		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22446  		struct bpf_insn addr[2] = { BPF_LD_IMM64(BPF_REG_2, (long)kptr_struct_meta) };
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22447  
36d8bdf75a9319 Yonghong Song           2023-08-27  22448  		if (desc->func_id == special_kfunc_list[KF_bpf_percpu_obj_drop_impl] && kptr_struct_meta) {
0df1a55afa832f Paul Chaignon           2025-07-01  22449  			verifier_bug(env, "NULL kptr_struct_meta expected at insn_idx %d",
36d8bdf75a9319 Yonghong Song           2023-08-27  22450  				     insn_idx);
36d8bdf75a9319 Yonghong Song           2023-08-27  22451  			return -EFAULT;
36d8bdf75a9319 Yonghong Song           2023-08-27  22452  		}
36d8bdf75a9319 Yonghong Song           2023-08-27  22453  
f0d991a070750a Dave Marchevsky         2023-08-21  22454  		if (desc->func_id == special_kfunc_list[KF_bpf_refcount_acquire_impl] &&
f0d991a070750a Dave Marchevsky         2023-08-21  22455  		    !kptr_struct_meta) {
0df1a55afa832f Paul Chaignon           2025-07-01  22456  			verifier_bug(env, "kptr_struct_meta expected at insn_idx %d",
f0d991a070750a Dave Marchevsky         2023-08-21  22457  				     insn_idx);
f0d991a070750a Dave Marchevsky         2023-08-21  22458  			return -EFAULT;
f0d991a070750a Dave Marchevsky         2023-08-21  22459  		}
f0d991a070750a Dave Marchevsky         2023-08-21  22460  
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22461  		insn_buf[0] = addr[0];
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22462  		insn_buf[1] = addr[1];
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22463  		insn_buf[2] = *insn;
ac9f06050a3580 Kumar Kartikeya Dwivedi 2022-11-18  22464  		*cnt = 3;
d2dcc67df910dd Dave Marchevsky         2023-04-15  22465  	} else if (desc->func_id == special_kfunc_list[KF_bpf_list_push_back_impl] ||
d2dcc67df910dd Dave Marchevsky         2023-04-15  22466  		   desc->func_id == special_kfunc_list[KF_bpf_list_push_front_impl] ||
d2dcc67df910dd Dave Marchevsky         2023-04-15  22467  		   desc->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) {
f0d991a070750a Dave Marchevsky         2023-08-21  22468  		struct btf_struct_meta *kptr_struct_meta = env->insn_aux_data[insn_idx].kptr_struct_meta;
d2dcc67df910dd Dave Marchevsky         2023-04-15  22469  		int struct_meta_reg = BPF_REG_3;
d2dcc67df910dd Dave Marchevsky         2023-04-15  22470  		int node_offset_reg = BPF_REG_4;
d2dcc67df910dd Dave Marchevsky         2023-04-15  22471  
d2dcc67df910dd Dave Marchevsky         2023-04-15  22472  		/* rbtree_add has extra 'less' arg, so args-to-fixup are in diff regs */
d2dcc67df910dd Dave Marchevsky         2023-04-15  22473  		if (desc->func_id == special_kfunc_list[KF_bpf_rbtree_add_impl]) {
d2dcc67df910dd Dave Marchevsky         2023-04-15  22474  			struct_meta_reg = BPF_REG_4;
d2dcc67df910dd Dave Marchevsky         2023-04-15  22475  			node_offset_reg = BPF_REG_5;
d2dcc67df910dd Dave Marchevsky         2023-04-15  22476  		}
d2dcc67df910dd Dave Marchevsky         2023-04-15  22477  
f0d991a070750a Dave Marchevsky         2023-08-21  22478  		if (!kptr_struct_meta) {
0df1a55afa832f Paul Chaignon           2025-07-01  22479  			verifier_bug(env, "kptr_struct_meta expected at insn_idx %d",
f0d991a070750a Dave Marchevsky         2023-08-21  22480  				     insn_idx);
f0d991a070750a Dave Marchevsky         2023-08-21  22481  			return -EFAULT;
f0d991a070750a Dave Marchevsky         2023-08-21  22482  		}
f0d991a070750a Dave Marchevsky         2023-08-21  22483  
d2dcc67df910dd Dave Marchevsky         2023-04-15  22484  		__fixup_collection_insert_kfunc(&env->insn_aux_data[insn_idx], struct_meta_reg,
d2dcc67df910dd Dave Marchevsky         2023-04-15  22485  						node_offset_reg, insn, insn_buf, cnt);
a35b9af4ec2c7f Yonghong Song           2022-11-20  22486  	} else if (desc->func_id == special_kfunc_list[KF_bpf_cast_to_kern_ctx] ||
a35b9af4ec2c7f Yonghong Song           2022-11-20  22487  		   desc->func_id == special_kfunc_list[KF_bpf_rdonly_cast]) {
fd264ca020948a Yonghong Song           2022-11-20  22488  		insn_buf[0] = BPF_MOV64_REG(BPF_REG_0, BPF_REG_1);
fd264ca020948a Yonghong Song           2022-11-20  22489  		*cnt = 1;
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22490  	}
81f1d7a583fa1f Benjamin Tissoires      2024-04-20  22491  
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22492  	if (env->insn_aux_data[insn_idx].arg_prog) {
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22493  		u32 regno = env->insn_aux_data[insn_idx].arg_prog;
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22494  		struct bpf_insn ld_addrs[2] = { BPF_LD_IMM64(regno, (long)env->prog->aux) };
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22495  		int idx = *cnt;
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22496  
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22497  		insn_buf[idx++] = ld_addrs[0];
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22498  		insn_buf[idx++] = ld_addrs[1];
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22499  		insn_buf[idx++] = *insn;
bc049387b41f41 Kumar Kartikeya Dwivedi 2025-05-13  22500  		*cnt = idx;
958cf2e273f092 Kumar Kartikeya Dwivedi 2022-11-18  22501  	}
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22502  	return 0;
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22503  }
e6ac2450d6dee3 Martin KaFai Lau        2021-03-24  22504  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
                   ` (2 preceding siblings ...)
  2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
@ 2025-11-14 11:16 ` Puranjay Mohan
  2025-11-14 22:18   ` Alexei Starovoitov
  3 siblings, 1 reply; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-14 11:16 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

As arena kfuncs can now be called from non-sleepable contexts, test this
by adding non-sleepable copies of tests in verifier_arena, this is done
by using a socket program instead of syscall.

Add a new test case in verifier_arena_large to check that the
bpf_arena_alloc_pages() works for more than 1024 pages.
1024 * sizeof(struct page *) is the upper limit of kmalloc_nolock() but
bpf_arena_alloc_pages() should still succeed because it re-uses this
array in a loop.

Augment the arena_list selftest to also run in non-sleepable context by
taking rcu_read_lock.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/prog_tests/arena_list.c     |  20 +-
 .../testing/selftests/bpf/progs/arena_list.c  |  11 ++
 .../selftests/bpf/progs/verifier_arena.c      | 185 ++++++++++++++++++
 .../bpf/progs/verifier_arena_large.c          |  24 +++
 4 files changed, 235 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/arena_list.c b/tools/testing/selftests/bpf/prog_tests/arena_list.c
index d15867cddde0..4f2866a615ce 100644
--- a/tools/testing/selftests/bpf/prog_tests/arena_list.c
+++ b/tools/testing/selftests/bpf/prog_tests/arena_list.c
@@ -27,17 +27,23 @@ static int list_sum(struct arena_list_head *head)
 	return sum;
 }
 
-static void test_arena_list_add_del(int cnt)
+static void test_arena_list_add_del(int cnt, bool nonsleepable)
 {
 	LIBBPF_OPTS(bpf_test_run_opts, opts);
 	struct arena_list *skel;
 	int expected_sum = (u64)cnt * (cnt - 1) / 2;
 	int ret, sum;
 
-	skel = arena_list__open_and_load();
-	if (!ASSERT_OK_PTR(skel, "arena_list__open_and_load"))
+	skel = arena_list__open();
+	if (!ASSERT_OK_PTR(skel, "arena_list__open"))
 		return;
 
+	skel->rodata->nonsleepable = nonsleepable;
+
+	ret = arena_list__load(skel);
+	if (!ASSERT_OK(ret, "arena_list__load"))
+		goto out;
+
 	skel->bss->cnt = cnt;
 	ret = bpf_prog_test_run_opts(bpf_program__fd(skel->progs.arena_list_add), &opts);
 	ASSERT_OK(ret, "ret_add");
@@ -65,7 +71,11 @@ static void test_arena_list_add_del(int cnt)
 void test_arena_list(void)
 {
 	if (test__start_subtest("arena_list_1"))
-		test_arena_list_add_del(1);
+		test_arena_list_add_del(1, false);
 	if (test__start_subtest("arena_list_1000"))
-		test_arena_list_add_del(1000);
+		test_arena_list_add_del(1000, false);
+	if (test__start_subtest("arena_list_1_nonsleepable"))
+		test_arena_list_add_del(1, true);
+	if (test__start_subtest("arena_list_1000_nonsleepable"))
+		test_arena_list_add_del(1000, true);
 }
diff --git a/tools/testing/selftests/bpf/progs/arena_list.c b/tools/testing/selftests/bpf/progs/arena_list.c
index 3a2ddcacbea6..235d8cc95bdd 100644
--- a/tools/testing/selftests/bpf/progs/arena_list.c
+++ b/tools/testing/selftests/bpf/progs/arena_list.c
@@ -30,6 +30,7 @@ struct arena_list_head __arena *list_head;
 int list_sum;
 int cnt;
 bool skip = false;
+const volatile bool nonsleepable = false;
 
 #ifdef __BPF_FEATURE_ADDR_SPACE_CAST
 long __arena arena_sum;
@@ -42,6 +43,9 @@ int test_val SEC(".addr_space.1");
 
 int zero;
 
+void bpf_rcu_read_lock(void) __ksym;
+void bpf_rcu_read_unlock(void) __ksym;
+
 SEC("syscall")
 int arena_list_add(void *ctx)
 {
@@ -71,6 +75,10 @@ int arena_list_del(void *ctx)
 	struct elem __arena *n;
 	int sum = 0;
 
+	/* Take rcu_read_lock to test non-sleepable context */
+	if (nonsleepable)
+		bpf_rcu_read_lock();
+
 	arena_sum = 0;
 	list_for_each_entry(n, list_head, node) {
 		sum += n->value;
@@ -79,6 +87,9 @@ int arena_list_del(void *ctx)
 		bpf_free(n);
 	}
 	list_sum = sum;
+
+	if (nonsleepable)
+		bpf_rcu_read_unlock();
 #else
 	skip = true;
 #endif
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena.c b/tools/testing/selftests/bpf/progs/verifier_arena.c
index 7f4827eede3c..4a9d96344813 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena.c
@@ -21,6 +21,37 @@ struct {
 #endif
 } arena SEC(".maps");
 
+SEC("socket")
+__success __retval(0)
+int basic_alloc1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	volatile int __arena *page1, *page2, *no_page;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	*page1 = 1;
+	page2 = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page2)
+		return 2;
+	*page2 = 2;
+	no_page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (no_page)
+		return 3;
+	if (*page1 != 1)
+		return 4;
+	if (*page2 != 2)
+		return 5;
+	bpf_arena_free_pages(&arena, (void __arena *)page2, 1);
+	if (*page1 != 1)
+		return 6;
+	if (*page2 != 0 && *page2 != 2) /* use-after-free should return 0 or the stored value */
+		return 7;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_alloc1(void *ctx)
@@ -60,6 +91,44 @@ int basic_alloc1(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_alloc2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	volatile char __arena *page1, *page2, *page3, *page4;
+
+	page1 = bpf_arena_alloc_pages(&arena, NULL, 2, NUMA_NO_NODE, 0);
+	if (!page1)
+		return 1;
+	page2 = page1 + __PAGE_SIZE;
+	page3 = page1 + __PAGE_SIZE * 2;
+	page4 = page1 - __PAGE_SIZE;
+	*page1 = 1;
+	*page2 = 2;
+	*page3 = 3;
+	*page4 = 4;
+	if (*page1 != 1)
+		return 1;
+	if (*page2 != 2)
+		return 2;
+	if (*page3 != 0)
+		return 3;
+	if (*page4 != 0)
+		return 4;
+	bpf_arena_free_pages(&arena, (void __arena *)page1, 2);
+	if (*page1 != 0 && *page1 != 1)
+		return 5;
+	if (*page2 != 0 && *page2 != 2)
+		return 6;
+	if (*page3 != 0)
+		return 7;
+	if (*page4 != 0)
+		return 8;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_alloc2(void *ctx)
@@ -102,6 +171,19 @@ struct bpf_arena___l {
         struct bpf_map map;
 } __attribute__((preserve_access_index));
 
+SEC("socket")
+__success __retval(0) __log_level(2)
+int basic_alloc3_nosleep(void *ctx)
+{
+	struct bpf_arena___l *ar = (struct bpf_arena___l *)&arena;
+	volatile char __arena *pages;
+
+	pages = bpf_arena_alloc_pages(&ar->map, NULL, ar->map.max_entries, NUMA_NO_NODE, 0);
+	if (!pages)
+		return 1;
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0) __log_level(2)
 int basic_alloc3(void *ctx)
@@ -115,6 +197,38 @@ int basic_alloc3(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_reserve1_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (!page)
+		return 1;
+
+	page += __PAGE_SIZE;
+
+	/* Reserve the second page */
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 2;
+
+	/* Try to explicitly allocate the reserved page. */
+	page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+	if (page)
+		return 3;
+
+	/* Try to implicitly allocate the page (since there's only 2 of them). */
+	page = bpf_arena_alloc_pages(&arena, NULL, 1, NUMA_NO_NODE, 0);
+	if (page)
+		return 4;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_reserve1(void *ctx)
@@ -147,6 +261,26 @@ int basic_reserve1(void *ctx)
 	return 0;
 }
 
+SEC("socket")
+__success __retval(0)
+int basic_reserve2_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = arena_base(&arena);
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 1;
+
+	page = bpf_arena_alloc_pages(&arena, page, 1, NUMA_NO_NODE, 0);
+	if ((u64)page)
+		return 2;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int basic_reserve2(void *ctx)
@@ -168,6 +302,27 @@ int basic_reserve2(void *ctx)
 }
 
 /* Reserve the same page twice, should return -EBUSY. */
+SEC("socket")
+__success __retval(0)
+int reserve_twice_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	page = arena_base(&arena);
+
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret)
+		return 1;
+
+	ret = bpf_arena_reserve_pages(&arena, page, 1);
+	if (ret != -EBUSY)
+		return 2;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int reserve_twice(void *ctx)
@@ -190,6 +345,36 @@ int reserve_twice(void *ctx)
 }
 
 /* Try to reserve past the end of the arena. */
+SEC("socket")
+__success __retval(0)
+int reserve_invalid_region_nosleep(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *page;
+	int ret;
+
+	/* Try a NULL pointer. */
+	ret = bpf_arena_reserve_pages(&arena, NULL, 3);
+	if (ret != -EINVAL)
+		return 1;
+
+	page = arena_base(&arena);
+
+	ret = bpf_arena_reserve_pages(&arena, page, 3);
+	if (ret != -EINVAL)
+		return 2;
+
+	ret = bpf_arena_reserve_pages(&arena, page, 4096);
+	if (ret != -EINVAL)
+		return 3;
+
+	ret = bpf_arena_reserve_pages(&arena, page, (1ULL << 32) - 1);
+	if (ret != -EINVAL)
+		return 4;
+#endif
+	return 0;
+}
+
 SEC("syscall")
 __success __retval(0)
 int reserve_invalid_region(void *ctx)
diff --git a/tools/testing/selftests/bpf/progs/verifier_arena_large.c b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
index f19e15400b3e..507cd489e3e2 100644
--- a/tools/testing/selftests/bpf/progs/verifier_arena_large.c
+++ b/tools/testing/selftests/bpf/progs/verifier_arena_large.c
@@ -270,5 +270,29 @@ int big_alloc2(void *ctx)
 		return 9;
 	return 0;
 }
+
+SEC("socket")
+__success __retval(0)
+int big_alloc3(void *ctx)
+{
+#if defined(__BPF_FEATURE_ADDR_SPACE_CAST)
+	char __arena *pages;
+	u64 i;
+
+	/* Allocate 2051 pages (more than 1024) at once to test the limit of kmalloc_nolock() */
+	pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
+	if (!pages)
+		return -1;
+
+	bpf_for(i, 0, 2051)
+		        pages[i * PAGE_SIZE] = 123;
+	bpf_for(i, 0, 2051)
+		        if (pages[i * PAGE_SIZE] != 123)
+				return i;
+
+	bpf_arena_free_pages(&arena, pages, 1025);
+#endif
+	return 0;
+}
 #endif
 char _license[] SEC("license") = "GPL";
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
@ 2025-11-14 22:18   ` Alexei Starovoitov
  2025-11-15  0:58     ` Puranjay Mohan
  0 siblings, 1 reply; 22+ messages in thread
From: Alexei Starovoitov @ 2025-11-14 22:18 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Kernel Team

On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> +
> +       /* Allocate 2051 pages (more than 1024) at once to test the limit of kmalloc_nolock() */
> +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);

Please explain the choice of 2051 a bit better.
I think you wanted to do 3 steps and last one not aligned to 1024 ?

> +       if (!pages)
> +               return -1;
> +
> +       bpf_for(i, 0, 2051)
> +                       pages[i * PAGE_SIZE] = 123;
> +       bpf_for(i, 0, 2051)
> +                       if (pages[i * PAGE_SIZE] != 123)
> +                               return i;
> +
> +       bpf_arena_free_pages(&arena, pages, 1025);

free less on purpose?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations
  2025-11-14 22:18   ` Alexei Starovoitov
@ 2025-11-15  0:58     ` Puranjay Mohan
  0 siblings, 0 replies; 22+ messages in thread
From: Puranjay Mohan @ 2025-11-15  0:58 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	Kernel Team

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Fri, Nov 14, 2025 at 3:17 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> +
>> +       /* Allocate 2051 pages (more than 1024) at once to test the limit of kmalloc_nolock() */
>> +       pages = bpf_arena_alloc_pages(&arena, NULL, 2051, NUMA_NO_NODE, 0);
>
> Please explain the choice of 2051 a bit better.
> I think you wanted to do 3 steps and last one not aligned to 1024 ?

Yes, I wanted to exercise the loop a couple of times and also do an
iteration that is not aligned to test all edge cases. Will add a better comment.

>> +       if (!pages)
>> +               return -1;
>> +
>> +       bpf_for(i, 0, 2051)
>> +                       pages[i * PAGE_SIZE] = 123;
>> +       bpf_for(i, 0, 2051)
>> +                       if (pages[i * PAGE_SIZE] != 123)
>> +                               return i;
>> +
>> +       bpf_arena_free_pages(&arena, pages, 1025);
>
> free less on purpose?

This is should be 2051 too, missed updating it here.


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-11-16  1:16 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-14 11:16 [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs Puranjay Mohan
2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-11-14 11:47   ` bot+bpf-ci
2025-11-14 14:57     ` Puranjay Mohan
2025-11-14 21:21       ` Alexei Starovoitov
2025-11-15  0:52         ` Puranjay Mohan
2025-11-15  1:26           ` Alexei Starovoitov
2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
2025-11-14 11:39   ` bot+bpf-ci
2025-11-14 15:13     ` Puranjay Mohan
2025-11-14 21:25   ` Alexei Starovoitov
2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
2025-11-14 11:47   ` bot+bpf-ci
2025-11-14 15:28     ` Puranjay Mohan
2025-11-14 21:27   ` Alexei Starovoitov
2025-11-15  0:56     ` Puranjay Mohan
2025-11-15  1:28       ` Alexei Starovoitov
2025-11-15  8:18   ` kernel test robot
2025-11-16  1:15   ` kernel test robot
2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
2025-11-14 22:18   ` Alexei Starovoitov
2025-11-15  0:58     ` Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox