From: Puranjay Mohan <puranjay@kernel.org>
To: bpf@vger.kernel.org
Cc: Puranjay Mohan <puranjay@kernel.org>,
Puranjay Mohan <puranjay12@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
kernel-team@meta.com
Subject: [PATCH bpf-next v2 0/4] Remove KF_SLEEPABLE from arena kfuncs
Date: Fri, 14 Nov 2025 11:16:55 +0000 [thread overview]
Message-ID: <20251114111700.43292-1-puranjay@kernel.org> (raw)
v1: https://lore.kernel.org/all/20251111163424.16471-1-puranjay@kernel.org/
Changes in v1->v2:
Patch 1:
- Import tlbflush.h to fix build issue in loongarch. (kernel
test robot)
- Fix unused variable error in apply_range_clear_cb() (kernel
test robot)
- Call bpf_map_area_free() on error path of
populate_pgtable_except_pte() (AI)
- Use PAGE_SIZE in apply_to_existing_page_range() (AI)
Patch 2:
- Cap allocation made by kmalloc_nolock() for pages array to
KMALLOC_MAX_CACHE_SIZE and reuse the array in an explicit loop
to overcome this limit. (AI)
Patch 3:
- Do page_ref_add(page, 1); under the spinlock to mitigate a
race (AI)
Patch 4:
- Add a new testcase big_alloc3() verifier_arena_large.c that
tries to allocate a large number of pages at once, this is to
trigger the kmalloc_nolock() limit in Patch 2 and see if the
loop logic works correctly.
This set allows arena kfuncs to be called from non-sleepable contexts.
It is acheived by the following changes:
The range_tree is now protected with a rqspinlock and not a mutex,
this change is enough to make bpf_arena_reserve_pages() any context
safe.
bpf_arena_alloc_pages() had four points where it could sleep:
1. Mutex to protect range_tree: now replaced with rqspinlock
2. kvcalloc() for allocations: now replaced with kmalloc_nolock()
3. Allocating pages with bpf_map_alloc_pages(): this already calls
alloc_pages_nolock() in non-sleepable contexts and therefore is safe.
4. Setting up kernel page tables with vm_area_map_pages():
vm_area_map_pages() may allocate memory while inserting pages into
bpf arena's vm_area. Now, at arena creation time populate all page
table levels except the last level and when new pages need to be
inserted call apply_to_page_range() again which will only do
set_pte_at() for those pages and will not allocate memory.
The above four changes make bpf_arena_alloc_pages() any context safe.
bpf_arena_free_pages() has to do the following steps:
1. Update the range_tree
2. vm_area_unmap_pages(): to unmap pages from kernel vm_area
3. flush the tlb: done in step 2, already.
4. zap_pages(): to unmap pages from user page tables
5. free pages.
The third patch in this set makes bpf_arena_free_pages() polymorphic using
the specialize_kfunc() mechanism. When called from a sleepable context,
arena_free_pages() remains mostly unchanged except the following:
1. rqspinlock is taken now instead of the mutex for the range tree
2. Instead of using vm_area_unmap_pages() that can free intermediate page
table levels, apply_to_existing_page_range() with a callback is used
that only does pte_clear() on the last level and leaves the intermediate
page table levels intact. This is needed to make sure that
bpf_arena_alloc_pages() can safely do set_pte_at() without allocating
intermediate page tables.
When arena_free_pages() is called from a non-sleepable context or it fails to
acquire the rqspinlock in the sleepable case, a lock-less list of struct
arena_free_span is used to queue the uaddr and page cnt. kmalloc_nolock()
is used to allocate this arena_free_span, this can fail but we need to make
this trade-off for frees done from non-sleepable contexts.
arena_free_pages() then raises an irq_work whose handler in turn schedules
work that iterate this list and clears ptes, flushes tlbs, zap pages, and
frees pages for the queued uaddr and page cnts.
apply_range_clear_cb() with apply_to_existing_page_range() is used to
clear PTEs and collect pages to be freed, struct llist_node pcp_llist;
in the struct page is used to do this.
NOTE: The arena list selftest fails to load on s390x, this is due to an
unrelated bug in the verifier that is being exposed by the selftest that
I add in this set. I have already sent a patch[1] to fix this.
[1] https://lore.kernel.org/all/20251111160949.45623-1-puranjay@kernel.org/
Puranjay Mohan (4):
bpf: arena: populate vm_area without allocating memory
bpf: arena: use kmalloc_nolock() in place of kvcalloc()
bpf: arena: make arena kfuncs any context safe
selftests: bpf: test non-sleepable arena allocations
include/linux/bpf.h | 2 +
kernel/bpf/arena.c | 350 +++++++++++++++---
kernel/bpf/verifier.c | 5 +
.../selftests/bpf/prog_tests/arena_list.c | 20 +-
.../testing/selftests/bpf/progs/arena_list.c | 11 +
.../selftests/bpf/progs/verifier_arena.c | 185 +++++++++
.../bpf/progs/verifier_arena_large.c | 24 ++
7 files changed, 541 insertions(+), 56 deletions(-)
--
2.47.1
next reply other threads:[~2025-11-14 11:17 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-14 11:16 Puranjay Mohan [this message]
2025-11-14 11:16 ` [PATCH bpf-next v2 1/4] bpf: arena: populate vm_area without allocating memory Puranjay Mohan
2025-11-14 11:47 ` bot+bpf-ci
2025-11-14 14:57 ` Puranjay Mohan
2025-11-14 21:21 ` Alexei Starovoitov
2025-11-15 0:52 ` Puranjay Mohan
2025-11-15 1:26 ` Alexei Starovoitov
2025-11-14 11:16 ` [PATCH bpf-next v2 2/4] bpf: arena: use kmalloc_nolock() in place of kvcalloc() Puranjay Mohan
2025-11-14 11:39 ` bot+bpf-ci
2025-11-14 15:13 ` Puranjay Mohan
2025-11-14 21:25 ` Alexei Starovoitov
2025-11-14 11:16 ` [PATCH bpf-next v2 3/4] bpf: arena: make arena kfuncs any context safe Puranjay Mohan
2025-11-14 11:47 ` bot+bpf-ci
2025-11-14 15:28 ` Puranjay Mohan
2025-11-14 21:27 ` Alexei Starovoitov
2025-11-15 0:56 ` Puranjay Mohan
2025-11-15 1:28 ` Alexei Starovoitov
2025-11-15 8:18 ` kernel test robot
2025-11-16 1:15 ` kernel test robot
2025-11-14 11:16 ` [PATCH bpf-next v2 4/4] selftests: bpf: test non-sleepable arena allocations Puranjay Mohan
2025-11-14 22:18 ` Alexei Starovoitov
2025-11-15 0:58 ` Puranjay Mohan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251114111700.43292-1-puranjay@kernel.org \
--to=puranjay@kernel.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=kernel-team@meta.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=puranjay12@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox