public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator
@ 2026-03-04 14:20 Puranjay Mohan
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Puranjay Mohan @ 2026-03-04 14:20 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

This series fixes the mm lifecycle handling in the open-coded task_vma
BPF iterator and switches it from mmap_lock to per-VMA locking to reduce
contention. It then fixes a deadlock that is caused by holding locks
accross the body of the iterator where faulting is allowed.

Patch 1 fixes a missing mmget() that allows the mm_struct to be freed
before the iterator takes mmap_lock. It adds mmget_not_zero() and
introduces an NMI-safe mmput path using per-CPU irq_work, following the
existing mmap_unlock irq_work pattern.

Patch 2 switches from holding mmap_lock for the entire iteration to
per-VMA locking via lock_vma_under_rcu(). This still doesn't fix the
deadlock problem because holding the per-vma lock for the whole
iteration can still cause lock ordering issues when a faultable helper
is called in the body of the iterator.

Patch 3 resolves the lock ordering problems caused by holding the
per-VMA lock or the mmap_lock (not applicable after patch 2) across BPF
program execution.  It snapshots VMA fields under the lock, then drops
the lock before returning to the BPF program. File references are
managed via get_file()/fput() across iterations.

Puranjay Mohan (3):
  bpf: fix mm lifecycle in open-coded task_vma iterator
  bpf: switch task_vma iterator from mmap_lock to per-VMA locks
  bpf: return VMA snapshot from task_vma iterator

 kernel/bpf/task_iter.c | 136 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 125 insertions(+), 11 deletions(-)


base-commit: 3ebc98c1ae7efda949a015990280a097f4a5453a
-- 
2.47.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
@ 2026-03-04 14:20 ` Puranjay Mohan
  2026-03-05  8:55   ` kernel test robot
                     ` (3 more replies)
  2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
  2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
  2 siblings, 4 replies; 13+ messages in thread
From: Puranjay Mohan @ 2026-03-04 14:20 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

The open-coded task_vma BPF iterator reads task->mm and acquires
mmap_read_trylock() but never calls mmget(). This violates refcount
discipline: the mm can reach mm_users == 0 if the task exits while the
iterator holds the lock.

Add mmget_not_zero() before mmap_read_trylock(). On the error path
after mmget succeeds, the mm reference must be dropped. mmput() can
sleep (exit_mmap, etc.) so it is unsuitable from BPF context.
mmput_async() is safe from hardirq but not from NMI, because
schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
which can deadlock if the NMI interrupted code holding that lock.

Add a dedicated per-CPU irq_work (bpf_iter_mmput_work) and a helper
bpf_iter_mmput() that calls mmput_async() directly when not in NMI,
or defers to the irq_work callback when in NMI context. Use it in
both the _new() error path and _destroy(). Add bpf_iter_mmput_busy()
to check irq_work slot availability, and use it alongside
bpf_mmap_unlock_get_irq_work() in _new() to verify both slots are
free before acquiring references.

Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/task_iter.c | 77 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 73 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 98d9b4c0daff..d3fa8ba0a896 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -813,6 +813,55 @@ struct bpf_iter_task_vma_kern {
 	struct bpf_iter_task_vma_kern_data *data;
 } __attribute__((aligned(8)));
 
+/*
+ * Per-CPU irq_work for NMI-safe mmput.
+ *
+ * mmput_async() is safe from hardirq context but not from NMI, because
+ * schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
+ * which can deadlock if the NMI interrupted code holding that lock.
+ *
+ * This dedicated irq_work defers mmput to hardirq context where
+ * mmput_async() is safe. BPF programs are non-preemptible, so one
+ * slot per CPU is sufficient.
+ */
+struct bpf_iter_mmput_irq_work {
+	struct irq_work irq_work;
+	struct mm_struct *mm;
+};
+
+static DEFINE_PER_CPU(struct bpf_iter_mmput_irq_work, bpf_iter_mmput_work);
+
+static void do_bpf_iter_mmput(struct irq_work *entry)
+{
+	struct bpf_iter_mmput_irq_work *work;
+
+	work = container_of(entry, struct bpf_iter_mmput_irq_work, irq_work);
+	if (work->mm) {
+		mmput_async(work->mm);
+		work->mm = NULL;
+	}
+}
+
+static void bpf_iter_mmput(struct mm_struct *mm)
+{
+	if (!in_nmi()) {
+		mmput_async(mm);
+	} else {
+		struct bpf_iter_mmput_irq_work *work;
+
+		work = this_cpu_ptr(&bpf_iter_mmput_work);
+		work->mm = mm;
+		irq_work_queue(&work->irq_work);
+	}
+}
+
+static bool bpf_iter_mmput_busy(void)
+{
+	if (!in_nmi())
+		return false;
+	return irq_work_is_busy(&this_cpu_ptr(&bpf_iter_mmput_work)->irq_work);
+}
+
 __bpf_kfunc_start_defs();
 
 __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
@@ -840,19 +889,35 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 		goto err_cleanup_iter;
 	}
 
-	/* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
+	/*
+	 * Check irq_work availability for both mmap_lock release and mmput.
+	 * Both use separate per-CPU irq_work slots, and both must be free
+	 * to guarantee _destroy() can complete from NMI context.
+	 * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
+	 */
 	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
-	if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
+	if (irq_work_busy || bpf_iter_mmput_busy()) {
 		err = -EBUSY;
 		goto err_cleanup_iter;
 	}
 
+	if (!mmget_not_zero(kit->data->mm)) {
+		err = -ENOENT;
+		goto err_cleanup_iter;
+	}
+
+	if (!mmap_read_trylock(kit->data->mm)) {
+		err = -EBUSY;
+		goto err_cleanup_mmget;
+	}
+
 	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
 	return 0;
 
+err_cleanup_mmget:
+	bpf_iter_mmput(kit->data->mm);
 err_cleanup_iter:
-	if (kit->data->task)
-		put_task_struct(kit->data->task);
+	put_task_struct(kit->data->task);
 	bpf_mem_free(&bpf_global_ma, kit->data);
 	/* NULL kit->data signals failed bpf_iter_task_vma initialization */
 	kit->data = NULL;
@@ -874,6 +939,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
 
 	if (kit->data) {
 		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
+		bpf_iter_mmput(kit->data->mm);
 		put_task_struct(kit->data->task);
 		bpf_mem_free(&bpf_global_ma, kit->data);
 	}
@@ -1044,12 +1110,15 @@ static void do_mmap_read_unlock(struct irq_work *entry)
 
 static int __init task_iter_init(void)
 {
+	struct bpf_iter_mmput_irq_work *mmput_work;
 	struct mmap_unlock_irq_work *work;
 	int ret, cpu;
 
 	for_each_possible_cpu(cpu) {
 		work = per_cpu_ptr(&mmap_unlock_work, cpu);
 		init_irq_work(&work->irq_work, do_mmap_read_unlock);
+		mmput_work = per_cpu_ptr(&bpf_iter_mmput_work, cpu);
+		init_irq_work(&mmput_work->irq_work, do_bpf_iter_mmput);
 	}
 
 	task_reg_info.ctx_arg_info[0].btf_id = btf_tracing_ids[BTF_TRACING_TYPE_TASK];
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
  2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
@ 2026-03-04 14:20 ` Puranjay Mohan
  2026-03-05 18:47   ` Mykyta Yatsenko
  2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
  2 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2026-03-04 14:20 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

The open-coded task_vma BPF iterator holds mmap_lock for the entire
duration of the iteration, increasing contention on this highly
contended lock.

Switch to per-VMA locking. In _next(), the next VMA is found via an
RCU-protected maple tree walk, then locked with lock_vma_under_rcu()
at its vm_start address. lock_next_vma() is not used because its
fallback path takes mmap_read_lock(), and the iterator must work in
non-sleepable contexts.

Between the RCU walk and the lock attempt, the VMA may be removed,
shrunk, or write-locked. When lock_vma_under_rcu() fails or the locked
VMA was modified, the iterator advances past it and retries using vm_end
saved from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU,
individual objects can be freed and immediately reused within an RCU
critical section. A VMA found by the maple tree walk may be recycled for
a different mm before its fields are read, making the captured vm_end
stale. When vm_end is stale and no longer ahead of the iteration
position, the iterator falls back to PAGE_SIZE advancement to guarantee
forward progress. VMAs inserted in gaps between iterations cannot be
detected without mmap_lock speculation.

The mm_struct is kept alive with mmget()/bpf_iter_mmput(). The
bpf_mmap_unlock_get_irq_work() check is no longer needed since
mmap_lock is no longer held; bpf_iter_mmput_busy() remains to guard the
mmput irq_work slot. CONFIG_PER_VMA_LOCK is required; -EOPNOTSUPP is
returned without it.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/task_iter.c | 72 +++++++++++++++++++++++++++++++-----------
 1 file changed, 53 insertions(+), 19 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index d3fa8ba0a896..ff29d4da0267 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -9,6 +9,7 @@
 #include <linux/bpf_mem_alloc.h>
 #include <linux/btf_ids.h>
 #include <linux/mm_types.h>
+#include <linux/mmap_lock.h>
 #include "mmap_unlock_work.h"
 
 static const char * const iter_task_type_names[] = {
@@ -797,8 +798,8 @@ const struct bpf_func_proto bpf_find_vma_proto = {
 struct bpf_iter_task_vma_kern_data {
 	struct task_struct *task;
 	struct mm_struct *mm;
-	struct mmap_unlock_irq_work *work;
-	struct vma_iterator vmi;
+	struct vm_area_struct *locked_vma;
+	u64 last_addr;
 };
 
 struct bpf_iter_task_vma {
@@ -868,12 +869,16 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 				      struct task_struct *task, u64 addr)
 {
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
-	bool irq_work_busy = false;
 	int err;
 
 	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
 	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
 
+	if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
+		kit->data = NULL;
+		return -EOPNOTSUPP;
+	}
+
 	/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
 	 * before, so non-NULL kit->data doesn't point to previously
 	 * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
@@ -890,13 +895,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 	}
 
 	/*
-	 * Check irq_work availability for both mmap_lock release and mmput.
-	 * Both use separate per-CPU irq_work slots, and both must be free
-	 * to guarantee _destroy() can complete from NMI context.
-	 * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
+	 * Ensure the mmput irq_work slot is available so _destroy() can
+	 * safely drop the mm reference from NMI context.
 	 */
-	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
-	if (irq_work_busy || bpf_iter_mmput_busy()) {
+	if (bpf_iter_mmput_busy()) {
 		err = -EBUSY;
 		goto err_cleanup_iter;
 	}
@@ -906,16 +908,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 		goto err_cleanup_iter;
 	}
 
-	if (!mmap_read_trylock(kit->data->mm)) {
-		err = -EBUSY;
-		goto err_cleanup_mmget;
-	}
-
-	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
+	kit->data->locked_vma = NULL;
+	kit->data->last_addr = addr;
 	return 0;
 
-err_cleanup_mmget:
-	bpf_iter_mmput(kit->data->mm);
 err_cleanup_iter:
 	put_task_struct(kit->data->task);
 	bpf_mem_free(&bpf_global_ma, kit->data);
@@ -927,10 +923,47 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
 {
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
+	struct vm_area_struct *vma;
+	struct vma_iterator vmi;
+	unsigned long next_addr, next_end;
 
 	if (!kit->data) /* bpf_iter_task_vma_new failed */
 		return NULL;
-	return vma_next(&kit->data->vmi);
+
+	if (kit->data->locked_vma)
+		vma_end_read(kit->data->locked_vma);
+
+retry:
+	rcu_read_lock();
+	vma_iter_init(&vmi, kit->data->mm, kit->data->last_addr);
+	vma = vma_next(&vmi);
+	if (!vma) {
+		rcu_read_unlock();
+		kit->data->locked_vma = NULL;
+		return NULL;
+	}
+	next_addr = vma->vm_start;
+	next_end = vma->vm_end;
+	rcu_read_unlock();
+
+	vma = lock_vma_under_rcu(kit->data->mm, next_addr);
+	if (!vma) {
+		if (next_end > kit->data->last_addr)
+			kit->data->last_addr = next_end;
+		else
+			kit->data->last_addr += PAGE_SIZE;
+		goto retry;
+	}
+
+	if (unlikely(kit->data->last_addr >= vma->vm_end)) {
+		kit->data->last_addr = vma->vm_end;
+		vma_end_read(vma);
+		goto retry;
+	}
+
+	kit->data->locked_vma = vma;
+	kit->data->last_addr = vma->vm_end;
+	return vma;
 }
 
 __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
@@ -938,7 +971,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
 
 	if (kit->data) {
-		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
+		if (kit->data->locked_vma)
+			vma_end_read(kit->data->locked_vma);
 		bpf_iter_mmput(kit->data->mm);
 		put_task_struct(kit->data->task);
 		bpf_mem_free(&bpf_global_ma, kit->data);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator
  2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
  2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
@ 2026-03-04 14:20 ` Puranjay Mohan
  2026-03-05 18:53   ` Mykyta Yatsenko
  2 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2026-03-04 14:20 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

Holding the per-VMA lock across the BPF program's loop body creates a
lock ordering problem when helpers acquire locks with a dependency on
mmap_lock (e.g., bpf_dynptr_read -> __kernel_read -> i_rwsem):

  vm_lock -> i_rwsem -> mmap_lock -> vm_lock

Snapshot VMA fields into an embedded struct vm_area_struct under the
per-VMA lock in _next(), then drop the lock before returning. The BPF
program accesses only the snapshot, so no lock is held during execution.
For vm_file, get_file() takes a reference under the lock, released via
fput() on the next iteration or in _destroy(). The snapshot's vm_file is
set to NULL after fput() so _destroy() does not double-release the
reference when _next() has already dropped it. For vm_mm, the snapshot
uses the mm pointer held via mmget().

Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/task_iter.c | 31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index ff29d4da0267..4bf93cff69c7 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -798,7 +798,7 @@ const struct bpf_func_proto bpf_find_vma_proto = {
 struct bpf_iter_task_vma_kern_data {
 	struct task_struct *task;
 	struct mm_struct *mm;
-	struct vm_area_struct *locked_vma;
+	struct vm_area_struct snapshot;
 	u64 last_addr;
 };
 
@@ -908,8 +908,8 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 		goto err_cleanup_iter;
 	}
 
-	kit->data->locked_vma = NULL;
 	kit->data->last_addr = addr;
+	memset(&kit->data->snapshot, 0, sizeof(kit->data->snapshot));
 	return 0;
 
 err_cleanup_iter:
@@ -923,15 +923,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
 {
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
-	struct vm_area_struct *vma;
+	struct vm_area_struct *snap, *vma;
 	struct vma_iterator vmi;
 	unsigned long next_addr, next_end;
 
 	if (!kit->data) /* bpf_iter_task_vma_new failed */
 		return NULL;
 
-	if (kit->data->locked_vma)
-		vma_end_read(kit->data->locked_vma);
+	snap = &kit->data->snapshot;
+
+	if (snap->vm_file) {
+		fput(snap->vm_file);
+		snap->vm_file = NULL;
+	}
 
 retry:
 	rcu_read_lock();
@@ -939,7 +943,6 @@ __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_v
 	vma = vma_next(&vmi);
 	if (!vma) {
 		rcu_read_unlock();
-		kit->data->locked_vma = NULL;
 		return NULL;
 	}
 	next_addr = vma->vm_start;
@@ -961,9 +964,17 @@ __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_v
 		goto retry;
 	}
 
-	kit->data->locked_vma = vma;
+	snap->vm_start = vma->vm_start;
+	snap->vm_end = vma->vm_end;
+	snap->vm_mm = kit->data->mm;
+	snap->vm_page_prot = vma->vm_page_prot;
+	snap->flags = vma->flags;
+	snap->vm_pgoff = vma->vm_pgoff;
+	snap->vm_file = vma->vm_file ? get_file(vma->vm_file) : NULL;
+
 	kit->data->last_addr = vma->vm_end;
-	return vma;
+	vma_end_read(vma);
+	return snap;
 }
 
 __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
@@ -971,8 +982,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
 
 	if (kit->data) {
-		if (kit->data->locked_vma)
-			vma_end_read(kit->data->locked_vma);
+		if (kit->data->snapshot.vm_file)
+			fput(kit->data->snapshot.vm_file);
 		bpf_iter_mmput(kit->data->mm);
 		put_task_struct(kit->data->task);
 		bpf_mem_free(&bpf_global_ma, kit->data);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
@ 2026-03-05  8:55   ` kernel test robot
  2026-03-05 11:58   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-05  8:55 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: llvm, oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

Hi Puranjay,

kernel test robot noticed the following build errors:

[auto build test ERROR on 3ebc98c1ae7efda949a015990280a097f4a5453a]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-fix-mm-lifecycle-in-open-coded-task_vma-iterator/20260304-224301
base:   3ebc98c1ae7efda949a015990280a097f4a5453a
patch link:    https://lore.kernel.org/r/20260304142026.1443666-2-puranjay%40kernel.org
patch subject: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
config: arm-randconfig-003-20260305 (https://download.01.org/0day-ci/archive/20260305/202603051628.H3HNsDUG-lkp@intel.com/config)
compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 9a109fbb6e184ec9bcce10615949f598f4c974a9)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260305/202603051628.H3HNsDUG-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603051628.H3HNsDUG-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/bpf/task_iter.c:840:3: error: call to undeclared function 'mmput_async'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     840 |                 mmput_async(work->mm);
         |                 ^
   kernel/bpf/task_iter.c:848:3: error: call to undeclared function 'mmput_async'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
     848 |                 mmput_async(mm);
         |                 ^
   2 errors generated.


vim +/mmput_async +840 kernel/bpf/task_iter.c

   833	
   834	static void do_bpf_iter_mmput(struct irq_work *entry)
   835	{
   836		struct bpf_iter_mmput_irq_work *work;
   837	
   838		work = container_of(entry, struct bpf_iter_mmput_irq_work, irq_work);
   839		if (work->mm) {
 > 840			mmput_async(work->mm);
   841			work->mm = NULL;
   842		}
   843	}
   844	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
  2026-03-05  8:55   ` kernel test robot
@ 2026-03-05 11:58   ` kernel test robot
  2026-03-05 16:34   ` Mykyta Yatsenko
  2026-03-06  1:11   ` Alexei Starovoitov
  3 siblings, 0 replies; 13+ messages in thread
From: kernel test robot @ 2026-03-05 11:58 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: oe-kbuild-all, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

Hi Puranjay,

kernel test robot noticed the following build errors:

[auto build test ERROR on 3ebc98c1ae7efda949a015990280a097f4a5453a]

url:    https://github.com/intel-lab-lkp/linux/commits/Puranjay-Mohan/bpf-fix-mm-lifecycle-in-open-coded-task_vma-iterator/20260304-224301
base:   3ebc98c1ae7efda949a015990280a097f4a5453a
patch link:    https://lore.kernel.org/r/20260304142026.1443666-2-puranjay%40kernel.org
patch subject: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
config: sh-allmodconfig (https://download.01.org/0day-ci/archive/20260305/202603051901.tplxkNgS-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 15.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260305/202603051901.tplxkNgS-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603051901.tplxkNgS-lkp@intel.com/

All errors (new ones prefixed by >>):

   kernel/bpf/task_iter.c: In function 'do_bpf_iter_mmput':
>> kernel/bpf/task_iter.c:840:17: error: implicit declaration of function 'mmput_async' [-Wimplicit-function-declaration]
     840 |                 mmput_async(work->mm);
         |                 ^~~~~~~~~~~


vim +/mmput_async +840 kernel/bpf/task_iter.c

   833	
   834	static void do_bpf_iter_mmput(struct irq_work *entry)
   835	{
   836		struct bpf_iter_mmput_irq_work *work;
   837	
   838		work = container_of(entry, struct bpf_iter_mmput_irq_work, irq_work);
   839		if (work->mm) {
 > 840			mmput_async(work->mm);
   841			work->mm = NULL;
   842		}
   843	}
   844	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
  2026-03-05  8:55   ` kernel test robot
  2026-03-05 11:58   ` kernel test robot
@ 2026-03-05 16:34   ` Mykyta Yatsenko
  2026-03-05 16:48     ` Puranjay Mohan
  2026-03-06  1:11   ` Alexei Starovoitov
  3 siblings, 1 reply; 13+ messages in thread
From: Mykyta Yatsenko @ 2026-03-05 16:34 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Puranjay Mohan <puranjay@kernel.org> writes:

> The open-coded task_vma BPF iterator reads task->mm and acquires
> mmap_read_trylock() but never calls mmget(). This violates refcount
> discipline: the mm can reach mm_users == 0 if the task exits while the
> iterator holds the lock.
>
> Add mmget_not_zero() before mmap_read_trylock(). On the error path
> after mmget succeeds, the mm reference must be dropped. mmput() can
> sleep (exit_mmap, etc.) so it is unsuitable from BPF context.
> mmput_async() is safe from hardirq but not from NMI, because
> schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
> which can deadlock if the NMI interrupted code holding that lock.
>
> Add a dedicated per-CPU irq_work (bpf_iter_mmput_work) and a helper
> bpf_iter_mmput() that calls mmput_async() directly when not in NMI,
> or defers to the irq_work callback when in NMI context. Use it in
> both the _new() error path and _destroy(). Add bpf_iter_mmput_busy()
> to check irq_work slot availability, and use it alongside
> bpf_mmap_unlock_get_irq_work() in _new() to verify both slots are
> free before acquiring references.
>
> Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  kernel/bpf/task_iter.c | 77 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 73 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 98d9b4c0daff..d3fa8ba0a896 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -813,6 +813,55 @@ struct bpf_iter_task_vma_kern {
>  	struct bpf_iter_task_vma_kern_data *data;
>  } __attribute__((aligned(8)));
>  
> +/*
> + * Per-CPU irq_work for NMI-safe mmput.
> + *
> + * mmput_async() is safe from hardirq context but not from NMI, because
> + * schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
> + * which can deadlock if the NMI interrupted code holding that lock.
> + *
> + * This dedicated irq_work defers mmput to hardirq context where
> + * mmput_async() is safe. BPF programs are non-preemptible, so one
> + * slot per CPU is sufficient.
> + */
> +struct bpf_iter_mmput_irq_work {
> +	struct irq_work irq_work;
> +	struct mm_struct *mm;
> +};
struct mmap_unlock_irq_work is exactly the same struct, perhaps an
additional patch renaming it to something like struct bpf_iter_mm_irq_work
is needed. Then we can reuse it for mmput.
> +
> +static DEFINE_PER_CPU(struct bpf_iter_mmput_irq_work, bpf_iter_mmput_work);
> +
> +static void do_bpf_iter_mmput(struct irq_work *entry)
> +{
> +	struct bpf_iter_mmput_irq_work *work;
> +
> +	work = container_of(entry, struct bpf_iter_mmput_irq_work, irq_work);
> +	if (work->mm) {
> +		mmput_async(work->mm);
> +		work->mm = NULL;
> +	}
> +}
> +
> +static void bpf_iter_mmput(struct mm_struct *mm)
> +{
> +	if (!in_nmi()) {
> +		mmput_async(mm);
> +	} else {
> +		struct bpf_iter_mmput_irq_work *work;
> +
> +		work = this_cpu_ptr(&bpf_iter_mmput_work);
> +		work->mm = mm;
> +		irq_work_queue(&work->irq_work);
> +	}
> +}
> +
> +static bool bpf_iter_mmput_busy(void)
> +{
> +	if (!in_nmi())
> +		return false;
> +	return irq_work_is_busy(&this_cpu_ptr(&bpf_iter_mmput_work)->irq_work);
> +}
> +
>  __bpf_kfunc_start_defs();
>  
>  __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> @@ -840,19 +889,35 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  		goto err_cleanup_iter;
>  	}
>  
> -	/* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> +	/*
> +	 * Check irq_work availability for both mmap_lock release and mmput.
> +	 * Both use separate per-CPU irq_work slots, and both must be free
> +	 * to guarantee _destroy() can complete from NMI context.
> +	 * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
> +	 */
>  	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> -	if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
> +	if (irq_work_busy || bpf_iter_mmput_busy()) {
>  		err = -EBUSY;
>  		goto err_cleanup_iter;
>  	}
>  
> +	if (!mmget_not_zero(kit->data->mm)) {
> +		err = -ENOENT;
> +		goto err_cleanup_iter;
> +	}
> +
> +	if (!mmap_read_trylock(kit->data->mm)) {
> +		err = -EBUSY;
> +		goto err_cleanup_mmget;
> +	}
> +
>  	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
>  	return 0;
>  
> +err_cleanup_mmget:
> +	bpf_iter_mmput(kit->data->mm);
>  err_cleanup_iter:
> -	if (kit->data->task)
> -		put_task_struct(kit->data->task);
> +	put_task_struct(kit->data->task);
>  	bpf_mem_free(&bpf_global_ma, kit->data);
>  	/* NULL kit->data signals failed bpf_iter_task_vma initialization */
>  	kit->data = NULL;
> @@ -874,6 +939,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
>  
>  	if (kit->data) {
>  		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> +		bpf_iter_mmput(kit->data->mm);
>  		put_task_struct(kit->data->task);
>  		bpf_mem_free(&bpf_global_ma, kit->data);
>  	}
> @@ -1044,12 +1110,15 @@ static void do_mmap_read_unlock(struct irq_work *entry)
>  
>  static int __init task_iter_init(void)
>  {
> +	struct bpf_iter_mmput_irq_work *mmput_work;
>  	struct mmap_unlock_irq_work *work;
>  	int ret, cpu;
>  
>  	for_each_possible_cpu(cpu) {
>  		work = per_cpu_ptr(&mmap_unlock_work, cpu);
>  		init_irq_work(&work->irq_work, do_mmap_read_unlock);
> +		mmput_work = per_cpu_ptr(&bpf_iter_mmput_work, cpu);
> +		init_irq_work(&mmput_work->irq_work, do_bpf_iter_mmput);
>  	}
>  
>  	task_reg_info.ctx_arg_info[0].btf_id = btf_tracing_ids[BTF_TRACING_TYPE_TASK];
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-05 16:34   ` Mykyta Yatsenko
@ 2026-03-05 16:48     ` Puranjay Mohan
  2026-03-05 17:36       ` Mykyta Yatsenko
  0 siblings, 1 reply; 13+ messages in thread
From: Puranjay Mohan @ 2026-03-05 16:48 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	kernel-team

On Thu, Mar 5, 2026 at 4:34 PM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> Puranjay Mohan <puranjay@kernel.org> writes:
>
> > The open-coded task_vma BPF iterator reads task->mm and acquires
> > mmap_read_trylock() but never calls mmget(). This violates refcount
> > discipline: the mm can reach mm_users == 0 if the task exits while the
> > iterator holds the lock.
> >
> > Add mmget_not_zero() before mmap_read_trylock(). On the error path
> > after mmget succeeds, the mm reference must be dropped. mmput() can
> > sleep (exit_mmap, etc.) so it is unsuitable from BPF context.
> > mmput_async() is safe from hardirq but not from NMI, because
> > schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
> > which can deadlock if the NMI interrupted code holding that lock.
> >
> > Add a dedicated per-CPU irq_work (bpf_iter_mmput_work) and a helper
> > bpf_iter_mmput() that calls mmput_async() directly when not in NMI,
> > or defers to the irq_work callback when in NMI context. Use it in
> > both the _new() error path and _destroy(). Add bpf_iter_mmput_busy()
> > to check irq_work slot availability, and use it alongside
> > bpf_mmap_unlock_get_irq_work() in _new() to verify both slots are
> > free before acquiring references.
> >
> > Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> > ---
> >  kernel/bpf/task_iter.c | 77 +++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 73 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > index 98d9b4c0daff..d3fa8ba0a896 100644
> > --- a/kernel/bpf/task_iter.c
> > +++ b/kernel/bpf/task_iter.c
> > @@ -813,6 +813,55 @@ struct bpf_iter_task_vma_kern {
> >       struct bpf_iter_task_vma_kern_data *data;
> >  } __attribute__((aligned(8)));
> >
> > +/*
> > + * Per-CPU irq_work for NMI-safe mmput.
> > + *
> > + * mmput_async() is safe from hardirq context but not from NMI, because
> > + * schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
> > + * which can deadlock if the NMI interrupted code holding that lock.
> > + *
> > + * This dedicated irq_work defers mmput to hardirq context where
> > + * mmput_async() is safe. BPF programs are non-preemptible, so one
> > + * slot per CPU is sufficient.
> > + */
> > +struct bpf_iter_mmput_irq_work {
> > +     struct irq_work irq_work;
> > +     struct mm_struct *mm;
> > +};
> struct mmap_unlock_irq_work is exactly the same struct, perhaps an
> additional patch renaming it to something like struct bpf_iter_mm_irq_work
> is needed. Then we can reuse it for mmput.

They are similar but do different things, mmap_unlock_irq_work is used
to defer from NMI and hard-irq, but mmput_async is safe to run from
hard-irq and only needs to be deferred from NMI. I had thought of
combining these but later felt that keeping them seperate would be a
better approach.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-05 16:48     ` Puranjay Mohan
@ 2026-03-05 17:36       ` Mykyta Yatsenko
  0 siblings, 0 replies; 13+ messages in thread
From: Mykyta Yatsenko @ 2026-03-05 17:36 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	kernel-team

Puranjay Mohan <puranjay12@gmail.com> writes:

> On Thu, Mar 5, 2026 at 4:34 PM Mykyta Yatsenko
> <mykyta.yatsenko5@gmail.com> wrote:
>>
>> Puranjay Mohan <puranjay@kernel.org> writes:
>>
>> > The open-coded task_vma BPF iterator reads task->mm and acquires
>> > mmap_read_trylock() but never calls mmget(). This violates refcount
>> > discipline: the mm can reach mm_users == 0 if the task exits while the
>> > iterator holds the lock.
>> >
>> > Add mmget_not_zero() before mmap_read_trylock(). On the error path
>> > after mmget succeeds, the mm reference must be dropped. mmput() can
>> > sleep (exit_mmap, etc.) so it is unsuitable from BPF context.
>> > mmput_async() is safe from hardirq but not from NMI, because
>> > schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
>> > which can deadlock if the NMI interrupted code holding that lock.
>> >
>> > Add a dedicated per-CPU irq_work (bpf_iter_mmput_work) and a helper
>> > bpf_iter_mmput() that calls mmput_async() directly when not in NMI,
>> > or defers to the irq_work callback when in NMI context. Use it in
>> > both the _new() error path and _destroy(). Add bpf_iter_mmput_busy()
>> > to check irq_work slot availability, and use it alongside
>> > bpf_mmap_unlock_get_irq_work() in _new() to verify both slots are
>> > free before acquiring references.
>> >
>> > Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
>> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
>> > ---
>> >  kernel/bpf/task_iter.c | 77 +++++++++++++++++++++++++++++++++++++++---
>> >  1 file changed, 73 insertions(+), 4 deletions(-)
>> >
>> > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
>> > index 98d9b4c0daff..d3fa8ba0a896 100644
>> > --- a/kernel/bpf/task_iter.c
>> > +++ b/kernel/bpf/task_iter.c
>> > @@ -813,6 +813,55 @@ struct bpf_iter_task_vma_kern {
>> >       struct bpf_iter_task_vma_kern_data *data;
>> >  } __attribute__((aligned(8)));
>> >
>> > +/*
>> > + * Per-CPU irq_work for NMI-safe mmput.
>> > + *
>> > + * mmput_async() is safe from hardirq context but not from NMI, because
>> > + * schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
>> > + * which can deadlock if the NMI interrupted code holding that lock.
>> > + *
>> > + * This dedicated irq_work defers mmput to hardirq context where
>> > + * mmput_async() is safe. BPF programs are non-preemptible, so one
>> > + * slot per CPU is sufficient.
>> > + */
>> > +struct bpf_iter_mmput_irq_work {
>> > +     struct irq_work irq_work;
>> > +     struct mm_struct *mm;
>> > +};
>> struct mmap_unlock_irq_work is exactly the same struct, perhaps an
>> additional patch renaming it to something like struct bpf_iter_mm_irq_work
>> is needed. Then we can reuse it for mmput.
>
> They are similar but do different things, mmap_unlock_irq_work is used
> to defer from NMI and hard-irq, but mmput_async is safe to run from
> hard-irq and only needs to be deferred from NMI. I had thought of
> combining these but later felt that keeping them seperate would be a
> better approach.
Sure, I just suggest to reuse the structure, not combining the code.
It's very generic: mm_struct + irq_work, basically some async op on the
mm_struct. I agree with you it's better to keep the code separate
in this case.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
  2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
@ 2026-03-05 18:47   ` Mykyta Yatsenko
  0 siblings, 0 replies; 13+ messages in thread
From: Mykyta Yatsenko @ 2026-03-05 18:47 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Puranjay Mohan <puranjay@kernel.org> writes:

> The open-coded task_vma BPF iterator holds mmap_lock for the entire
> duration of the iteration, increasing contention on this highly
> contended lock.
>
> Switch to per-VMA locking. In _next(), the next VMA is found via an
> RCU-protected maple tree walk, then locked with lock_vma_under_rcu()
> at its vm_start address. lock_next_vma() is not used because its
> fallback path takes mmap_read_lock(), and the iterator must work in
> non-sleepable contexts.
>
> Between the RCU walk and the lock attempt, the VMA may be removed,
> shrunk, or write-locked. When lock_vma_under_rcu() fails or the locked
> VMA was modified, the iterator advances past it and retries using vm_end
> saved from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU,
> individual objects can be freed and immediately reused within an RCU
> critical section. A VMA found by the maple tree walk may be recycled for
> a different mm before its fields are read, making the captured vm_end
> stale. When vm_end is stale and no longer ahead of the iteration
> position, the iterator falls back to PAGE_SIZE advancement to guarantee
> forward progress. VMAs inserted in gaps between iterations cannot be
> detected without mmap_lock speculation.
>
> The mm_struct is kept alive with mmget()/bpf_iter_mmput(). The
> bpf_mmap_unlock_get_irq_work() check is no longer needed since
> mmap_lock is no longer held; bpf_iter_mmput_busy() remains to guard the
> mmput irq_work slot. CONFIG_PER_VMA_LOCK is required; -EOPNOTSUPP is
> returned without it.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  kernel/bpf/task_iter.c | 72 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 53 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index d3fa8ba0a896..ff29d4da0267 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -9,6 +9,7 @@
>  #include <linux/bpf_mem_alloc.h>
>  #include <linux/btf_ids.h>
>  #include <linux/mm_types.h>
> +#include <linux/mmap_lock.h>
>  #include "mmap_unlock_work.h"
>  
>  static const char * const iter_task_type_names[] = {
> @@ -797,8 +798,8 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>  struct bpf_iter_task_vma_kern_data {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
> -	struct mmap_unlock_irq_work *work;
> -	struct vma_iterator vmi;
> +	struct vm_area_struct *locked_vma;
> +	u64 last_addr;
>  };
>  
>  struct bpf_iter_task_vma {
> @@ -868,12 +869,16 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  				      struct task_struct *task, u64 addr)
>  {
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
> -	bool irq_work_busy = false;
>  	int err;
>  
>  	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
>  	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>  
> +	if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> +		kit->data = NULL;
> +		return -EOPNOTSUPP;
> +	}
> +
>  	/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
>  	 * before, so non-NULL kit->data doesn't point to previously
>  	 * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> @@ -890,13 +895,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  	}
>  
>  	/*
> -	 * Check irq_work availability for both mmap_lock release and mmput.
> -	 * Both use separate per-CPU irq_work slots, and both must be free
> -	 * to guarantee _destroy() can complete from NMI context.
> -	 * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
> +	 * Ensure the mmput irq_work slot is available so _destroy() can
> +	 * safely drop the mm reference from NMI context.
>  	 */
> -	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> -	if (irq_work_busy || bpf_iter_mmput_busy()) {
> +	if (bpf_iter_mmput_busy()) {
>  		err = -EBUSY;
>  		goto err_cleanup_iter;
>  	}
> @@ -906,16 +908,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  		goto err_cleanup_iter;
>  	}
>  
> -	if (!mmap_read_trylock(kit->data->mm)) {
> -		err = -EBUSY;
> -		goto err_cleanup_mmget;
> -	}
> -
> -	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> +	kit->data->locked_vma = NULL;
> +	kit->data->last_addr = addr;
>  	return 0;
>  
> -err_cleanup_mmget:
> -	bpf_iter_mmput(kit->data->mm);
>  err_cleanup_iter:
>  	put_task_struct(kit->data->task);
>  	bpf_mem_free(&bpf_global_ma, kit->data);
> @@ -927,10 +923,47 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
>  {
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
> +	struct vm_area_struct *vma;
> +	struct vma_iterator vmi;
> +	unsigned long next_addr, next_end;
>  
>  	if (!kit->data) /* bpf_iter_task_vma_new failed */
>  		return NULL;
> -	return vma_next(&kit->data->vmi);
> +
> +	if (kit->data->locked_vma)
> +		vma_end_read(kit->data->locked_vma);
> +
> +retry:
> +	rcu_read_lock();
> +	vma_iter_init(&vmi, kit->data->mm, kit->data->last_addr);
> +	vma = vma_next(&vmi);
> +	if (!vma) {
> +		rcu_read_unlock();
> +		kit->data->locked_vma = NULL;
> +		return NULL;
> +	}
> +	next_addr = vma->vm_start;
> +	next_end = vma->vm_end;
> +	rcu_read_unlock();
> +
> +	vma = lock_vma_under_rcu(kit->data->mm, next_addr);
> +	if (!vma) {
> +		if (next_end > kit->data->last_addr)
> +			kit->data->last_addr = next_end;
> +		else
> +			kit->data->last_addr += PAGE_SIZE;
> +		goto retry;
> +	}
> +
> +	if (unlikely(kit->data->last_addr >= vma->vm_end)) {
> +		kit->data->last_addr = vma->vm_end;
> +		vma_end_read(vma);
> +		goto retry;
> +	}
nit: maybe we can move this next vma lookup (retry block) to the
separate function, this code looks a bit intimidating in _vma_next().
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
> +
> +	kit->data->locked_vma = vma;
> +	kit->data->last_addr = vma->vm_end;
> +	return vma;
>  }
>  
>  __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -938,7 +971,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
>  
>  	if (kit->data) {
> -		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> +		if (kit->data->locked_vma)
> +			vma_end_read(kit->data->locked_vma);
>  		bpf_iter_mmput(kit->data->mm);
>  		put_task_struct(kit->data->task);
>  		bpf_mem_free(&bpf_global_ma, kit->data);
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator
  2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
@ 2026-03-05 18:53   ` Mykyta Yatsenko
  2026-03-05 19:03     ` Puranjay Mohan
  0 siblings, 1 reply; 13+ messages in thread
From: Mykyta Yatsenko @ 2026-03-05 18:53 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, kernel-team

Puranjay Mohan <puranjay@kernel.org> writes:

> Holding the per-VMA lock across the BPF program's loop body creates a
> lock ordering problem when helpers acquire locks with a dependency on
> mmap_lock (e.g., bpf_dynptr_read -> __kernel_read -> i_rwsem):
>
>   vm_lock -> i_rwsem -> mmap_lock -> vm_lock
>
> Snapshot VMA fields into an embedded struct vm_area_struct under the
> per-VMA lock in _next(), then drop the lock before returning. The BPF
> program accesses only the snapshot, so no lock is held during execution.
> For vm_file, get_file() takes a reference under the lock, released via
> fput() on the next iteration or in _destroy(). The snapshot's vm_file is
> set to NULL after fput() so _destroy() does not double-release the
> reference when _next() has already dropped it. For vm_mm, the snapshot
> uses the mm pointer held via mmget().
>
> Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  kernel/bpf/task_iter.c | 31 +++++++++++++++++++++----------
>  1 file changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index ff29d4da0267..4bf93cff69c7 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -798,7 +798,7 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>  struct bpf_iter_task_vma_kern_data {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
> -	struct vm_area_struct *locked_vma;
> +	struct vm_area_struct snapshot;
>  	u64 last_addr;
>  };
>  
> @@ -908,8 +908,8 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  		goto err_cleanup_iter;
>  	}
>  
> -	kit->data->locked_vma = NULL;
>  	kit->data->last_addr = addr;
> +	memset(&kit->data->snapshot, 0, sizeof(kit->data->snapshot));
>  	return 0;
>  
>  err_cleanup_iter:
> @@ -923,15 +923,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
>  {
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
> -	struct vm_area_struct *vma;
> +	struct vm_area_struct *snap, *vma;
>  	struct vma_iterator vmi;
>  	unsigned long next_addr, next_end;
>  
>  	if (!kit->data) /* bpf_iter_task_vma_new failed */
>  		return NULL;
>  
> -	if (kit->data->locked_vma)
> -		vma_end_read(kit->data->locked_vma);
> +	snap = &kit->data->snapshot;
> +
> +	if (snap->vm_file) {
> +		fput(snap->vm_file);
> +		snap->vm_file = NULL;
> +	}
>  
>  retry:
>  	rcu_read_lock();
> @@ -939,7 +943,6 @@ __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_v
>  	vma = vma_next(&vmi);
>  	if (!vma) {
>  		rcu_read_unlock();
> -		kit->data->locked_vma = NULL;
>  		return NULL;
>  	}
>  	next_addr = vma->vm_start;
> @@ -961,9 +964,17 @@ __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_v
>  		goto retry;
>  	}
>  
> -	kit->data->locked_vma = vma;
> +	snap->vm_start = vma->vm_start;
> +	snap->vm_end = vma->vm_end;
> +	snap->vm_mm = kit->data->mm;
> +	snap->vm_page_prot = vma->vm_page_prot;
> +	snap->flags = vma->flags;
> +	snap->vm_pgoff = vma->vm_pgoff;
> +	snap->vm_file = vma->vm_file ? get_file(vma->vm_file) : NULL;
Are you omitting some fields when copying to snapshot? How do
you decide what fields are needed and what not? If your intention is
to copy everything and bump refcnt for file, why not memcpy() +
get_file(vma->vm_file)?
> +
>  	kit->data->last_addr = vma->vm_end;
> -	return vma;
> +	vma_end_read(vma);
> +	return snap;
>  }
>  
>  __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -971,8 +982,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
>  
>  	if (kit->data) {
> -		if (kit->data->locked_vma)
> -			vma_end_read(kit->data->locked_vma);
> +		if (kit->data->snapshot.vm_file)
> +			fput(kit->data->snapshot.vm_file);
>  		bpf_iter_mmput(kit->data->mm);
>  		put_task_struct(kit->data->task);
>  		bpf_mem_free(&bpf_global_ma, kit->data);
> -- 
> 2.47.3

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator
  2026-03-05 18:53   ` Mykyta Yatsenko
@ 2026-03-05 19:03     ` Puranjay Mohan
  0 siblings, 0 replies; 13+ messages in thread
From: Puranjay Mohan @ 2026-03-05 19:03 UTC (permalink / raw)
  To: Mykyta Yatsenko
  Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
	kernel-team

On Thu, Mar 5, 2026 at 6:53 PM Mykyta Yatsenko
<mykyta.yatsenko5@gmail.com> wrote:
>
> Puranjay Mohan <puranjay@kernel.org> writes:
>
> > Holding the per-VMA lock across the BPF program's loop body creates a
> > lock ordering problem when helpers acquire locks with a dependency on
> > mmap_lock (e.g., bpf_dynptr_read -> __kernel_read -> i_rwsem):
> >
> >   vm_lock -> i_rwsem -> mmap_lock -> vm_lock
> >
> > Snapshot VMA fields into an embedded struct vm_area_struct under the
> > per-VMA lock in _next(), then drop the lock before returning. The BPF
> > program accesses only the snapshot, so no lock is held during execution.
> > For vm_file, get_file() takes a reference under the lock, released via
> > fput() on the next iteration or in _destroy(). The snapshot's vm_file is
> > set to NULL after fput() so _destroy() does not double-release the
> > reference when _next() has already dropped it. For vm_mm, the snapshot
> > uses the mm pointer held via mmget().
> >
> > Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> > ---
> >  kernel/bpf/task_iter.c | 31 +++++++++++++++++++++----------
> >  1 file changed, 21 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > index ff29d4da0267..4bf93cff69c7 100644
> > --- a/kernel/bpf/task_iter.c
> > +++ b/kernel/bpf/task_iter.c
> > @@ -798,7 +798,7 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> >  struct bpf_iter_task_vma_kern_data {
> >       struct task_struct *task;
> >       struct mm_struct *mm;
> > -     struct vm_area_struct *locked_vma;
> > +     struct vm_area_struct snapshot;
> >       u64 last_addr;
> >  };
> >
> > @@ -908,8 +908,8 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >               goto err_cleanup_iter;
> >       }
> >
> > -     kit->data->locked_vma = NULL;
> >       kit->data->last_addr = addr;
> > +     memset(&kit->data->snapshot, 0, sizeof(kit->data->snapshot));
> >       return 0;
> >
> >  err_cleanup_iter:
> > @@ -923,15 +923,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >  __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
> >  {
> >       struct bpf_iter_task_vma_kern *kit = (void *)it;
> > -     struct vm_area_struct *vma;
> > +     struct vm_area_struct *snap, *vma;
> >       struct vma_iterator vmi;
> >       unsigned long next_addr, next_end;
> >
> >       if (!kit->data) /* bpf_iter_task_vma_new failed */
> >               return NULL;
> >
> > -     if (kit->data->locked_vma)
> > -             vma_end_read(kit->data->locked_vma);
> > +     snap = &kit->data->snapshot;
> > +
> > +     if (snap->vm_file) {
> > +             fput(snap->vm_file);
> > +             snap->vm_file = NULL;
> > +     }
> >
> >  retry:
> >       rcu_read_lock();
> > @@ -939,7 +943,6 @@ __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_v
> >       vma = vma_next(&vmi);
> >       if (!vma) {
> >               rcu_read_unlock();
> > -             kit->data->locked_vma = NULL;
> >               return NULL;
> >       }
> >       next_addr = vma->vm_start;
> > @@ -961,9 +964,17 @@ __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_v
> >               goto retry;
> >       }
> >
> > -     kit->data->locked_vma = vma;
> > +     snap->vm_start = vma->vm_start;
> > +     snap->vm_end = vma->vm_end;
> > +     snap->vm_mm = kit->data->mm;
> > +     snap->vm_page_prot = vma->vm_page_prot;
> > +     snap->flags = vma->flags;
> > +     snap->vm_pgoff = vma->vm_pgoff;
> > +     snap->vm_file = vma->vm_file ? get_file(vma->vm_file) : NULL;
> Are you omitting some fields when copying to snapshot? How do
> you decide what fields are needed and what not? If your intention is
> to copy everything and bump refcnt for file, why not memcpy() +
> get_file(vma->vm_file)?

I looked at the usage of the vma across bpf programs and these were
the most important fields. I couldn't find a bpf program that uses a
field other than these. so, I only copy these and others stay 0.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
                     ` (2 preceding siblings ...)
  2026-03-05 16:34   ` Mykyta Yatsenko
@ 2026-03-06  1:11   ` Alexei Starovoitov
  3 siblings, 0 replies; 13+ messages in thread
From: Alexei Starovoitov @ 2026-03-06  1:11 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Mykyta Yatsenko, Kernel Team

On Wed, Mar 4, 2026 at 6:20 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> +
> +static void bpf_iter_mmput(struct mm_struct *mm)
> +{
> +       if (!in_nmi()) {

We discussed the same thing earlier during timer series.
Are you sure there is no way to reenter into __queue_work()?
hint.. hint..

> +               mmput_async(mm);

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-06  1:12 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-03-05  8:55   ` kernel test robot
2026-03-05 11:58   ` kernel test robot
2026-03-05 16:34   ` Mykyta Yatsenko
2026-03-05 16:48     ` Puranjay Mohan
2026-03-05 17:36       ` Mykyta Yatsenko
2026-03-06  1:11   ` Alexei Starovoitov
2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-03-05 18:47   ` Mykyta Yatsenko
2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2026-03-05 18:53   ` Mykyta Yatsenko
2026-03-05 19:03     ` Puranjay Mohan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox