* [PATCH bpf v3 0/3] bpf: Implement stack_map_get_build_id_offset_sleepable()
@ 2026-05-12 3:29 Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 1/3] bpf: Factor out stack_map_build_id_set_ip() in stackmap.c Ihor Solodrai
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Ihor Solodrai @ 2026-05-12 3:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Puranjay Mohan, Shakeel Butt, Mykyta Yatsenko, bpf, linux-kernel,
kernel-team
I noticed a following deadlock splat on bpf-next, when running a
subset of selftests/bpf:
#526 uprobe_multi_test:OK
#28 bpf_obj_id:OK
[ 27.561465]
[ 27.561603] ======================================================
[ 27.561899] WARNING: possible circular locking dependency detected
[ 27.562193] 7.0.0-rc5-g0eeb0094ba03 #3 Tainted: G OE
[ 27.562523] ------------------------------------------------------
[ 27.562820] uprobe_multi/561 is trying to acquire lock:
[ 27.563071] ff1100010768ba18 (&sb->s_type->i_mutex_key#8){++++}-{4:4}, at: netfs_start_io_read+0x24/0x110
[ 27.563556]
[ 27.563556] but task is already holding lock:
[ 27.563845] ff1100010b12a640 (&mm->mmap_lock){++++}-{4:4}, at: stack_map_get_build_id_offset+0xfa/0x650
[ 27.564296]
[ 27.564296] which lock already depends on the new lock.
[ 27.564296]
[ 27.564696]
[ 27.564696] the existing dependency chain (in reverse order) is:
[ 27.565044]
[ 27.565044] -> #1 (&mm->mmap_lock){++++}-{4:4}:
[ 27.565340] __might_fault+0xa3/0x100
[ 27.565572] _copy_to_iter+0xdf/0x1520
[ 27.565786] copy_page_to_iter+0x1c6/0x2d0
[ 27.566011] filemap_read+0x5f7/0xbf0
[ 27.566218] netfs_file_read_iter+0x1ca/0x240
[ 27.566453] vfs_read+0x482/0x9e0
[ 27.566658] ksys_read+0x115/0x1f0
[ 27.566852] do_syscall_64+0xde/0x390
[ 27.567057] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 27.567322]
[ 27.567322] -> #0 (&sb->s_type->i_mutex_key#8){++++}-{4:4}:
[ 27.567701] __lock_acquire+0x1529/0x2a10
[ 27.567923] lock_acquire+0xd7/0x280
[ 27.568124] down_read_interruptible+0x55/0x340
[ 27.568368] netfs_start_io_read+0x24/0x110
[ 27.568622] netfs_file_read_iter+0x191/0x240
[ 27.568858] __kernel_read+0x401/0x850
[ 27.569068] freader_fetch+0x19b/0x790
[ 27.569279] __build_id_parse+0xde/0x580
[ 27.569528] stack_map_get_build_id_offset+0x248/0x650
[ 27.569802] __bpf_get_stack+0x653/0x890
[ 27.570019] bpf_get_stack_sleepable+0x1d/0x30
[ 27.570257] bpf_prog_51bb4c0f31ae471b_uprobe_sleepable+0x2d/0x42
[ 27.570594] uprobe_prog_run+0x425/0x870
[ 27.570816] uprobe_multi_link_handler+0x58/0xb0
[ 27.571062] handler_chain+0x234/0x1270
[ 27.571275] uprobe_notify_resume+0x4de/0xd60
[ 27.571542] exit_to_user_mode_loop+0xe0/0x480
[ 27.571785] exc_int3+0x1bd/0x260
[ 27.571973] asm_exc_int3+0x39/0x40
[ 27.572169]
[ 27.572169] other info that might help us debug this:
[ 27.572169]
[ 27.572571] Possible unsafe locking scenario:
[ 27.572571]
[ 27.572854] CPU0 CPU1
[ 27.573074] ---- ----
[ 27.573293] rlock(&mm->mmap_lock);
[ 27.573502] lock(&sb->s_type->i_mutex_key#8);
[ 27.573846] lock(&mm->mmap_lock);
[ 27.574135] rlock(&sb->s_type->i_mutex_key#8);
[ 27.574364]
[ 27.574364] *** DEADLOCK ***
[ 27.574364]
[ 27.574671] 3 locks held by uprobe_multi/561:
[ 27.574884] #0: ffffffff852e0038 (rcu_tasks_trace_srcu_struct){....}-{0:0}, at: uprobe_notify_resume+0x229/0xd60
[ 27.575371] #1: ffffffff852e0038 (rcu_tasks_trace_srcu_struct){....}-{0:0}, at: uprobe_prog_run+0x239/0x870
[ 27.575874] #2: ff1100010b12a640 (&mm->mmap_lock){++++}-{4:4}, at: stack_map_get_build_id_offset+0xfa/0x650
[ 27.576342]
[ 27.576342] stack backtrace:
[ 27.576578] CPU: 7 UID: 0 PID: 561 Comm: uprobe_multi Tainted: G OE 7.0.0-rc5-g0eeb0094ba03 #3 PREEMPT(full)
[ 27.576586] Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 27.576588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-5.el9 11/05/2023
[ 27.576591] Call Trace:
[ 27.576595] <TASK>
[ 27.576598] dump_stack_lvl+0x54/0x70
[ 27.576606] print_circular_bug+0x2e8/0x300
[ 27.576616] check_noncircular+0x12e/0x150
[ 27.576628] __lock_acquire+0x1529/0x2a10
[ 27.576642] ? __pfx_walk_pgd_range+0x10/0x10
[ 27.576651] ? srso_return_thunk+0x5/0x5f
[ 27.576656] ? lock_acquire+0xd7/0x280
[ 27.576661] ? avc_has_perm_noaudit+0x38/0x1f0
[ 27.576672] lock_acquire+0xd7/0x280
[ 27.576677] ? netfs_start_io_read+0x24/0x110
[ 27.576685] ? srso_return_thunk+0x5/0x5f
[ 27.576693] ? netfs_start_io_read+0x24/0x110
[ 27.576698] down_read_interruptible+0x55/0x340
[ 27.576702] ? netfs_start_io_read+0x24/0x110
[ 27.576707] ? __pfx_cmp_ex_search+0x10/0x10
[ 27.576716] netfs_start_io_read+0x24/0x110
[ 27.576723] netfs_file_read_iter+0x191/0x240
[ 27.576731] __kernel_read+0x401/0x850
[ 27.576741] ? __pfx___kernel_read+0x10/0x10
[ 27.576752] ? __pfx_fixup_exception+0x10/0x10
[ 27.576761] ? srso_return_thunk+0x5/0x5f
[ 27.576766] ? lock_is_held_type+0x76/0x100
[ 27.576773] ? srso_return_thunk+0x5/0x5f
[ 27.576777] ? mtree_range_walk+0x421/0x6c0
[ 27.576785] freader_fetch+0x19b/0x790
[ 27.576793] ? mt_find+0x14b/0x340
[ 27.576801] ? __pfx_freader_fetch+0x10/0x10
[ 27.576805] ? mt_find+0x29a/0x340
[ 27.576810] ? mt_find+0x14b/0x340
[ 27.576820] __build_id_parse+0xde/0x580
[ 27.576832] ? __pfx___build_id_parse+0x10/0x10
[ 27.576850] stack_map_get_build_id_offset+0x248/0x650
[ 27.576863] __bpf_get_stack+0x653/0x890
[ 27.576867] ? __pfx_stack_trace_consume_entry+0x10/0x10
[ 27.576880] ? __pfx___bpf_get_stack+0x10/0x10
[ 27.576884] ? lock_acquire+0xd7/0x280
[ 27.576889] ? uprobe_prog_run+0x239/0x870
[ 27.576895] ? srso_return_thunk+0x5/0x5f
[ 27.576899] ? stack_trace_save+0xae/0x100
[ 27.576908] bpf_get_stack_sleepable+0x1d/0x30
[ 27.576913] bpf_prog_51bb4c0f31ae471b_uprobe_sleepable+0x2d/0x42
[ 27.576919] uprobe_prog_run+0x425/0x870
[ 27.576927] ? srso_return_thunk+0x5/0x5f
[ 27.576933] ? __pfx_uprobe_prog_run+0x10/0x10
[ 27.576937] ? uprobe_notify_resume+0x413/0xd60
[ 27.576952] uprobe_multi_link_handler+0x58/0xb0
[ 27.576960] handler_chain+0x234/0x1270
[ 27.576976] ? __pfx_handler_chain+0x10/0x10
[ 27.576988] ? srso_return_thunk+0x5/0x5f
[ 27.576992] ? __kasan_kmalloc+0x72/0x90
[ 27.576998] ? __pfx_ri_timer+0x10/0x10
[ 27.577003] ? timer_init_key+0xf5/0x200
[ 27.577013] uprobe_notify_resume+0x4de/0xd60
[ 27.577025] ? uprobe_notify_resume+0x229/0xd60
[ 27.577030] ? __pfx_uprobe_notify_resume+0x10/0x10
[ 27.577035] ? srso_return_thunk+0x5/0x5f
[ 27.577039] ? atomic_notifier_call_chain+0xfc/0x110
[ 27.577047] ? srso_return_thunk+0x5/0x5f
[ 27.577051] ? notify_die+0xe5/0x130
[ 27.577056] ? __pfx_notify_die+0x10/0x10
[ 27.577063] ? srso_return_thunk+0x5/0x5f
[ 27.577071] exit_to_user_mode_loop+0xe0/0x480
[ 27.577076] ? asm_exc_int3+0x39/0x40
[ 27.577083] exc_int3+0x1bd/0x260
[ 27.577090] asm_exc_int3+0x39/0x40
[ 27.577095] RIP: 0033:0x6c4180
[ 27.577100] Code: c6 05 0b 6f 32 00 01 5d c3 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa eb 8a 66 2e 0f 1f 84 00 00 00 00 00 <cc> 48 89 e5 31 c0 5d c3 0f 1f 84 00 00 00 00 00 55 48 89 e5 31 c0
[ 27.577104] RSP: 002b:00007ffcbf71ada8 EFLAGS: 00000246
[ 27.577109] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fec9163db7b
[ 27.577112] RDX: 00007ffcbf71adbf RSI: 0000000000001000 RDI: 00000000007f0000
[ 27.577115] RBP: 00007ffcbf71add0 R08: 00007fec91731330 R09: 00007fec9178d060
[ 27.577118] R10: 00007fec9153ad98 R11: 0000000000000202 R12: 00007ffcbf71af08
[ 27.577120] R13: 0000000000787760 R14: 00000000009eade8 R15: 00007fec917bf000
[ 27.577135] </TASK>
#54/1 build_id/nofault-paged-out:OK
It was reproducable, which allowed me to bisect to commit 777a8560fd29
("lib/buildid: use __kernel_read() for sleepable context")
Then I used Codex for analysis of the splat, patch and relevant lore
discussions, and then asked it to produce a patch draft, which was a
starting point for this series.
I verified that the patch series fixes the deadlock splat.
---
v2->v3:
* Split patch #2 in two: stack_map_get_build_id_offset_sleepable()
implementation, and then introduce caching
* Drop taking mmap_lock if CONFIG_PER_VMA_LOCK=n, fall back to raw
IPs instead
* Cache vm_{start,end} in addition to prev_file (Mykyta)
v2: https://lore.kernel.org/bpf/20260409010604.1439087-1-ihor.solodrai@linux.dev/
v1->v2:
* Addressed feedback from Puranjay:
* split out a small refactoring patch
* use mmap_read_trylock()
* take into account CONFIG_PER_VMA_LOCK
* replace find_vma() with vma_lookup()
* cache prev_build_id to avoid re-parsing the same file
* Snapshot vm_pgoff and vm_start before unlocking (AI)
* To avoid repetitive unlocking statements, introduce struct
stack_map_vma_lock to hold relevant lock state info and add an
unlock helper
v1: https://lore.kernel.org/bpf/20260407223003.720428-1-ihor.solodrai@linux.dev/
---
Ihor Solodrai (3):
bpf: Factor out stack_map_build_id_set_ip() in stackmap.c
bpf: Avoid faultable build ID reads under mm locks
bpf: Cache build IDs in sleepable stackmap path
kernel/bpf/stackmap.c | 171 +++++++++++++++++++++++++++++++++++++++---
1 file changed, 162 insertions(+), 9 deletions(-)
--
2.54.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH bpf v3 1/3] bpf: Factor out stack_map_build_id_set_ip() in stackmap.c
2026-05-12 3:29 [PATCH bpf v3 0/3] bpf: Implement stack_map_get_build_id_offset_sleepable() Ihor Solodrai
@ 2026-05-12 3:29 ` Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 2/3] bpf: Avoid faultable build ID reads under mm locks Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 3/3] bpf: Cache build IDs in sleepable stackmap path Ihor Solodrai
2 siblings, 0 replies; 5+ messages in thread
From: Ihor Solodrai @ 2026-05-12 3:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Puranjay Mohan, Shakeel Butt, Mykyta Yatsenko, bpf, linux-kernel,
kernel-team
Factor out a small helper from stack_map_get_build_id_offset() in
preparation for adding a sleepable build ID resolution path.
No functional changes.
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
---
kernel/bpf/stackmap.c | 21 ++++++++++++---------
1 file changed, 12 insertions(+), 9 deletions(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index da3d328f5c15..4ef0fd06cea5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -152,6 +152,12 @@ static int fetch_build_id(struct vm_area_struct *vma, unsigned char *build_id, b
: build_id_parse_nofault(vma, build_id, NULL);
}
+static inline void stack_map_build_id_set_ip(struct bpf_stack_build_id *id)
+{
+ id->status = BPF_STACK_BUILD_ID_IP;
+ memset(id->build_id, 0, BUILD_ID_SIZE_MAX);
+}
+
/*
* Expects all id_offs[i].ip values to be set to correct initial IPs.
* They will be subsequently:
@@ -165,23 +171,21 @@ static int fetch_build_id(struct vm_area_struct *vma, unsigned char *build_id, b
static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
u32 trace_nr, bool user, bool may_fault)
{
- int i;
struct mmap_unlock_irq_work *work = NULL;
bool irq_work_busy = bpf_mmap_unlock_get_irq_work(&work);
+ bool has_user_ctx = user && current && current->mm;
struct vm_area_struct *vma, *prev_vma = NULL;
const char *prev_build_id;
+ int i;
/* If the irq_work is in use, fall back to report ips. Same
* fallback is used for kernel stack (!user) on a stackmap with
* build_id.
*/
- if (!user || !current || !current->mm || irq_work_busy ||
- !mmap_read_trylock(current->mm)) {
+ if (!has_user_ctx || irq_work_busy || !mmap_read_trylock(current->mm)) {
/* cannot access current->mm, fall back to ips */
- for (i = 0; i < trace_nr; i++) {
- id_offs[i].status = BPF_STACK_BUILD_ID_IP;
- memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
- }
+ for (i = 0; i < trace_nr; i++)
+ stack_map_build_id_set_ip(&id_offs[i]);
return;
}
@@ -196,8 +200,7 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
vma = find_vma(current->mm, ip);
if (!vma || fetch_build_id(vma, id_offs[i].build_id, may_fault)) {
/* per entry fall back to ips */
- id_offs[i].status = BPF_STACK_BUILD_ID_IP;
- memset(id_offs[i].build_id, 0, BUILD_ID_SIZE_MAX);
+ stack_map_build_id_set_ip(&id_offs[i]);
continue;
}
build_id_valid:
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH bpf v3 2/3] bpf: Avoid faultable build ID reads under mm locks
2026-05-12 3:29 [PATCH bpf v3 0/3] bpf: Implement stack_map_get_build_id_offset_sleepable() Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 1/3] bpf: Factor out stack_map_build_id_set_ip() in stackmap.c Ihor Solodrai
@ 2026-05-12 3:29 ` Ihor Solodrai
2026-05-12 4:16 ` bot+bpf-ci
2026-05-12 3:29 ` [PATCH bpf v3 3/3] bpf: Cache build IDs in sleepable stackmap path Ihor Solodrai
2 siblings, 1 reply; 5+ messages in thread
From: Ihor Solodrai @ 2026-05-12 3:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Puranjay Mohan, Shakeel Butt, Mykyta Yatsenko, bpf, linux-kernel,
kernel-team
Sleepable build ID parsing can block in __kernel_read() [1], so the
stackmap sleepable path must not call it while holding mmap_lock or a
per-VMA read lock.
The issue and the fix are conceptually similar to a recent procfs
patch [2].
Resolve each covered VMA with a stable read-side reference, preferring
lock_vma_under_rcu() and falling back to mmap_read_trylock() only long
enough to acquire the VMA read lock. Take a reference to the backing
file, drop the VMA lock, and then parse the build ID through
(sleepable) build_id_parse_file().
[1]: https://lore.kernel.org/all/20251218005818.614819-1-shakeel.butt@linux.dev/
[2]: https://lore.kernel.org/all/20260128183232.2854138-1-andrii@kernel.org/
Fixes: 777a8560fd29 ("lib/buildid: use __kernel_read() for sleepable context")
Assisted-by: Codex:gpt-5.4
Suggested-by: Puranjay Mohan <puranjay@kernel.org>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
---
kernel/bpf/stackmap.c | 105 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 105 insertions(+)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 4ef0fd06cea5..c1e96df360c3 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -9,6 +9,7 @@
#include <linux/perf_event.h>
#include <linux/btf_ids.h>
#include <linux/buildid.h>
+#include <linux/mmap_lock.h>
#include "percpu_freelist.h"
#include "mmap_unlock_work.h"
@@ -158,6 +159,105 @@ static inline void stack_map_build_id_set_ip(struct bpf_stack_build_id *id)
memset(id->build_id, 0, BUILD_ID_SIZE_MAX);
}
+struct stack_map_vma_lock {
+ bool vma_locked;
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+};
+
+static struct vm_area_struct *
+stack_map_lock_vma(struct stack_map_vma_lock *lock, unsigned long ip)
+{
+ struct mm_struct *mm = lock->mm;
+ struct vm_area_struct *vma;
+
+ if (WARN_ON_ONCE(!mm))
+ return NULL;
+
+ vma = lock_vma_under_rcu(mm, ip);
+ if (vma)
+ goto vma_locked;
+
+ if (!mmap_read_trylock(mm))
+ return NULL;
+
+ vma = vma_lookup(mm, ip);
+ if (!vma) {
+ mmap_read_unlock(mm);
+ return NULL;
+ }
+
+#ifdef CONFIG_PER_VMA_LOCK
+ if (!vma_start_read_locked(vma)) {
+ mmap_read_unlock(mm);
+ return NULL;
+ }
+ mmap_read_unlock(mm);
+#else
+ mmap_read_unlock(mm);
+ return NULL;
+#endif
+vma_locked:
+ lock->vma_locked = true;
+ lock->vma = vma;
+ return vma;
+}
+
+static void stack_map_unlock_vma(struct stack_map_vma_lock *lock)
+{
+ struct vm_area_struct *vma = lock->vma;
+
+ if (lock->vma_locked) {
+ if (WARN_ON_ONCE(!vma))
+ goto out;
+ vma_end_read(vma);
+ }
+out:
+ lock->vma_locked = false;
+ lock->vma = NULL;
+}
+
+static void stack_map_get_build_id_offset_sleepable(struct bpf_stack_build_id *id_offs,
+ u32 trace_nr)
+{
+ struct mm_struct *mm = current->mm;
+ struct stack_map_vma_lock lock = {
+ .vma_locked = false,
+ .vma = NULL,
+ .mm = mm,
+ };
+ unsigned long vm_pgoff, vm_start;
+ struct vm_area_struct *vma;
+ struct file *file;
+ u64 ip;
+
+ for (u32 i = 0; i < trace_nr; i++) {
+ ip = READ_ONCE(id_offs[i].ip);
+ vma = stack_map_lock_vma(&lock, ip);
+ if (!vma || !vma->vm_file) {
+ stack_map_build_id_set_ip(&id_offs[i]);
+ stack_map_unlock_vma(&lock);
+ continue;
+ }
+
+ file = get_file(vma->vm_file);
+ vm_pgoff = vma->vm_pgoff;
+ vm_start = vma->vm_start;
+ stack_map_unlock_vma(&lock);
+
+ /* build_id_parse_file() may block on filesystem reads */
+ if (build_id_parse_file(file, id_offs[i].build_id, NULL)) {
+ stack_map_build_id_set_ip(&id_offs[i]);
+ fput(file);
+ continue;
+ }
+ fput(file);
+
+ id_offs[i].offset = (vm_pgoff << PAGE_SHIFT) + ip - vm_start;
+ id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
+ }
+}
+
/*
* Expects all id_offs[i].ip values to be set to correct initial IPs.
* They will be subsequently:
@@ -178,6 +278,11 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
const char *prev_build_id;
int i;
+ if (may_fault && has_user_ctx) {
+ stack_map_get_build_id_offset_sleepable(id_offs, trace_nr);
+ return;
+ }
+
/* If the irq_work is in use, fall back to report ips. Same
* fallback is used for kernel stack (!user) on a stackmap with
* build_id.
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH bpf v3 3/3] bpf: Cache build IDs in sleepable stackmap path
2026-05-12 3:29 [PATCH bpf v3 0/3] bpf: Implement stack_map_get_build_id_offset_sleepable() Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 1/3] bpf: Factor out stack_map_build_id_set_ip() in stackmap.c Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 2/3] bpf: Avoid faultable build ID reads under mm locks Ihor Solodrai
@ 2026-05-12 3:29 ` Ihor Solodrai
2 siblings, 0 replies; 5+ messages in thread
From: Ihor Solodrai @ 2026-05-12 3:29 UTC (permalink / raw)
To: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Eduard Zingerman, Kumar Kartikeya Dwivedi
Cc: Puranjay Mohan, Shakeel Butt, Mykyta Yatsenko, bpf, linux-kernel,
kernel-team
Stack traces often contain adjacent IPs from the same VMA or from
different VMAs backed by the same ELF file. Cache the last successfully
parsed build ID together with the resolved VMA range and backing file
so the sleepable build-ID path can avoid repeated VMA locking and file
parsing in common cases.
Suggested-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
Signed-off-by: Ihor Solodrai <ihor.solodrai@linux.dev>
---
kernel/bpf/stackmap.c | 51 ++++++++++++++++++++++++++++++++++++++++---
1 file changed, 48 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index c1e96df360c3..318ce9ed0dd5 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -226,13 +226,34 @@ static void stack_map_get_build_id_offset_sleepable(struct bpf_stack_build_id *i
.vma = NULL,
.mm = mm,
};
- unsigned long vm_pgoff, vm_start;
+ struct {
+ struct file *file;
+ const char *build_id;
+ unsigned long vm_start;
+ unsigned long vm_end;
+ unsigned long vm_pgoff;
+ } cache = {};
+ unsigned long vm_pgoff, vm_start, vm_end;
struct vm_area_struct *vma;
struct file *file;
u64 ip;
for (u32 i = 0; i < trace_nr; i++) {
ip = READ_ONCE(id_offs[i].ip);
+
+ /*
+ * Range cache fast path: if ip falls within the previously
+ * resolved VMA range, reuse the cache build_id without
+ * re-acquiring the VMA lock.
+ */
+ if (cache.build_id && ip >= cache.vm_start && ip < cache.vm_end) {
+ memcpy(id_offs[i].build_id, cache.build_id, BUILD_ID_SIZE_MAX);
+ vm_start = cache.vm_start;
+ vm_end = cache.vm_end;
+ vm_pgoff = cache.vm_pgoff;
+ goto build_id_valid;
+ }
+
vma = stack_map_lock_vma(&lock, ip);
if (!vma || !vma->vm_file) {
stack_map_build_id_set_ip(&id_offs[i]);
@@ -240,9 +261,22 @@ static void stack_map_get_build_id_offset_sleepable(struct bpf_stack_build_id *i
continue;
}
- file = get_file(vma->vm_file);
+ file = vma->vm_file;
vm_pgoff = vma->vm_pgoff;
vm_start = vma->vm_start;
+ vm_end = vma->vm_end;
+
+ if (file == cache.file) {
+ /*
+ * Same backing file as previous (e.g. different VMAs
+ * of the same ELF binary). Reuse the cache build_id.
+ */
+ memcpy(id_offs[i].build_id, cache.build_id, BUILD_ID_SIZE_MAX);
+ stack_map_unlock_vma(&lock);
+ goto build_id_valid;
+ }
+
+ file = get_file(file);
stack_map_unlock_vma(&lock);
/* build_id_parse_file() may block on filesystem reads */
@@ -251,11 +285,22 @@ static void stack_map_get_build_id_offset_sleepable(struct bpf_stack_build_id *i
fput(file);
continue;
}
- fput(file);
+ if (cache.file)
+ fput(cache.file);
+ cache.file = file;
+ cache.build_id = id_offs[i].build_id;
+
+build_id_valid:
+ cache.vm_start = vm_start;
+ cache.vm_end = vm_end;
+ cache.vm_pgoff = vm_pgoff;
id_offs[i].offset = (vm_pgoff << PAGE_SHIFT) + ip - vm_start;
id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
}
+
+ if (cache.file)
+ fput(cache.file);
}
/*
--
2.54.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH bpf v3 2/3] bpf: Avoid faultable build ID reads under mm locks
2026-05-12 3:29 ` [PATCH bpf v3 2/3] bpf: Avoid faultable build ID reads under mm locks Ihor Solodrai
@ 2026-05-12 4:16 ` bot+bpf-ci
0 siblings, 0 replies; 5+ messages in thread
From: bot+bpf-ci @ 2026-05-12 4:16 UTC (permalink / raw)
To: ihor.solodrai, ast, andrii, daniel, eddyz87, memxor
Cc: puranjay, shakeel.butt, mykyta.yatsenko5, bpf, linux-kernel,
kernel-team, ast, andrii, daniel, martin.lau, eddyz87,
yonghong.song, clm, ihor.solodrai
[-- Attachment #1: Type: text/plain, Size: 6600 bytes --]
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 4ef0fd06cea5..c1e96df360c3 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
[ ... ]
> @@ -158,6 +159,105 @@ static inline void stack_map_build_id_set_ip(struct bpf_stack_build_id *id)
> memset(id->build_id, 0, BUILD_ID_SIZE_MAX);
> }
>
> +struct stack_map_vma_lock {
> + bool vma_locked;
> + struct vm_area_struct *vma;
> + struct mm_struct *mm;
> +};
> +
> +static struct vm_area_struct *
> +stack_map_lock_vma(struct stack_map_vma_lock *lock, unsigned long ip)
> +{
> + struct mm_struct *mm = lock->mm;
> + struct vm_area_struct *vma;
> +
> + if (WARN_ON_ONCE(!mm))
> + return NULL;
> +
> + vma = lock_vma_under_rcu(mm, ip);
> + if (vma)
> + goto vma_locked;
> +
> + if (!mmap_read_trylock(mm))
> + return NULL;
> +
> + vma = vma_lookup(mm, ip);
> + if (!vma) {
> + mmap_read_unlock(mm);
> + return NULL;
> + }
> +
> +#ifdef CONFIG_PER_VMA_LOCK
> + if (!vma_start_read_locked(vma)) {
> + mmap_read_unlock(mm);
> + return NULL;
> + }
> + mmap_read_unlock(mm);
> +#else
> + mmap_read_unlock(mm);
> + return NULL;
> +#endif
> +vma_locked:
> + lock->vma_locked = true;
> + lock->vma = vma;
> + return vma;
> +}
> +
> +static void stack_map_unlock_vma(struct stack_map_vma_lock *lock)
> +{
> + struct vm_area_struct *vma = lock->vma;
> +
> + if (lock->vma_locked) {
> + if (WARN_ON_ONCE(!vma))
> + goto out;
> + vma_end_read(vma);
> + }
> +out:
> + lock->vma_locked = false;
> + lock->vma = NULL;
> +}
> +
> +static void stack_map_get_build_id_offset_sleepable(struct bpf_stack_build_id *id_offs,
> + u32 trace_nr)
> +{
> + struct mm_struct *mm = current->mm;
> + struct stack_map_vma_lock lock = {
> + .vma_locked = false,
> + .vma = NULL,
> + .mm = mm,
> + };
> + unsigned long vm_pgoff, vm_start;
> + struct vm_area_struct *vma;
> + struct file *file;
> + u64 ip;
> +
> + for (u32 i = 0; i < trace_nr; i++) {
> + ip = READ_ONCE(id_offs[i].ip);
> + vma = stack_map_lock_vma(&lock, ip);
> + if (!vma || !vma->vm_file) {
^^^^^^^^
A subsystem pattern flags this as potentially concerning: the code uses
!vma->vm_file to classify the VMA as anonymous. Per mm-vma.md's VMA
classification rules, the correct test for anonymous VMAs is
vma_is_anonymous() (i.e. !vma->vm_ops). Some anonymous VMAs carry a
vm_file reference (e.g. private mappings of /dev/zero set via
vma_set_anonymous() while leaving vm_file populated).
On such VMAs, this code would proceed down the file-backed path and hand
/dev/zero (or any other vma_set_anonymous()-style driver file) to
build_id_parse_file() instead of immediately falling back to reporting
the raw IP. build_id_parse_file() would fail the ELF-magic check and
return an error, so the practical effect is a wasted pinned-file
reference and a pointless sleepable read attempt. However, the mm
subsystem guide directs that this pattern be reported.
The existing non-sleepable stack_map_get_build_id_offset()/build_id_parse()
in this same subsystem also uses !vma->vm_file, so the new code is
consistent with pre-existing style.
> + stack_map_build_id_set_ip(&id_offs[i]);
> + stack_map_unlock_vma(&lock);
> + continue;
> + }
Mykyta Yatsenko suggested an optimization in v2 that has not been
implemented: cache vm_start and vm_end, and if the next IP is in range,
reuse the previous build ID without locking the VMA. This would avoid
repeated lock/unlock cycles for consecutive IPs in the same VMA.
The current v3 code only caches the file reference for same-file
optimization, not the VMA range. See the original discussion at
https://lore.kernel.org/bpf/8b2cf5ca-182e-4eb2-ab90-8fab9d81f6e1@gmail.com/
> +
> + file = get_file(vma->vm_file);
> + vm_pgoff = vma->vm_pgoff;
> + vm_start = vma->vm_start;
> + stack_map_unlock_vma(&lock);
> +
> + /* build_id_parse_file() may block on filesystem reads */
> + if (build_id_parse_file(file, id_offs[i].build_id, NULL)) {
> + stack_map_build_id_set_ip(&id_offs[i]);
> + fput(file);
> + continue;
> + }
> + fput(file);
> +
> + id_offs[i].offset = (vm_pgoff << PAGE_SHIFT) + ip - vm_start;
> + id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
> + }
> +}
> +
> /*
> * Expects all id_offs[i].ip values to be set to correct initial IPs.
> * They will be subsequently:
> @@ -178,6 +278,11 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
> const char *prev_build_id;
> int i;
>
> + if (may_fault && has_user_ctx) {
> + stack_map_get_build_id_offset_sleepable(id_offs, trace_nr);
> + return;
> + }
> +
> /* If the irq_work is in use, fall back to report ips. Same
> * fallback is used for kernel stack (!user) on a stackmap with
> * build_id.
Commit message:
> bpf: Avoid faultable build ID reads under mm locks
>
> Sleepable build ID parsing can block in __kernel_read() [1], so the
> stackmap sleepable path must not call it while holding mmap_lock or a
> per-VMA read lock.
> The issue and the fix are conceptually similar to a recent procfs
> patch [2].
> Resolve each covered VMA with a stable read-side reference, preferring
> lock_vma_under_rcu() and falling back to mmap_read_trylock() only long
> enough to acquire the VMA read lock. Take a reference to the backing
> file, drop the VMA lock, and then parse the build ID through
> (sleepable) build_id_parse_file().
> [1]: https://lore.kernel.org/all/20251218005818.614819-1-shakeel.butt@linux.dev/
> [2]: https://lore.kernel.org/all/20260128183232.2854138-1-andrii@kernel.org/
> Fixes: 777a8560fd29 ("lib/buildid: use __kernel_read() for sleepable context")
The existing Fixes tag points to commit 777a8560fd29 ("lib/buildid: use
__kernel_read() for sleepable context"), but the bug was actually
introduced by commit d4dd9775ec24 ("bpf: wire up sleepable
bpf_get_stack() and bpf_get_task_stack() helpers").
The d4dd9775ec24 commit introduced the pattern of calling potentially
sleepable build_id_parse() while holding mm locks via find_vma(). Commit
777a8560fd29 only made the latent blocking behavior more severe by using
__kernel_read(), but the architectural issue of calling sleepable
functions under locks was introduced earlier.
Should this be:
Fixes: d4dd9775ec24 ("bpf: wire up sleepable bpf_get_stack() and bpf_get_task_stack() helpers")
---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/25711733137
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-05-12 4:16 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-12 3:29 [PATCH bpf v3 0/3] bpf: Implement stack_map_get_build_id_offset_sleepable() Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 1/3] bpf: Factor out stack_map_build_id_set_ip() in stackmap.c Ihor Solodrai
2026-05-12 3:29 ` [PATCH bpf v3 2/3] bpf: Avoid faultable build ID reads under mm locks Ihor Solodrai
2026-05-12 4:16 ` bot+bpf-ci
2026-05-12 3:29 ` [PATCH bpf v3 3/3] bpf: Cache build IDs in sleepable stackmap path Ihor Solodrai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox