public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator
@ 2026-04-08 15:45 Puranjay Mohan
  2026-04-08 15:45 ` [PATCH bpf v6 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Puranjay Mohan @ 2026-04-08 15:45 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

Changelog:
v5: https://lore.kernel.org/all/20260326151111.4002475-1-puranjay@kernel.org/
Changes in v6:
- Replace local_irq_disable() + get_task_mm() with spin_trylock() on
  alloc_lock to avoid a softirq deadlock: if the target task holds its
  alloc_lock and gets interrupted, a softirq BPF program iterating
  that task would deadlock on task_lock() (Gemini)
- Gate on CONFIG_MMU in patch 1 so that the mmput() fallback in
  bpf_iter_mmput_async() cannot sleep in non-sleepable BPF context
  on NOMMU; patch 2 tightens this to CONFIG_PER_VMA_LOCK (Gemini)
- Merge the split if (irq_work_busy) / if (!mmap_read_trylock())
  back into a single if statement in patch 1 (Andrii)
- Flip comparison direction in bpf_iter_task_vma_find_next() so both
  the locked and unlocked VMA failure cases read consistently:
  end <= next_addr → PAGE_SIZE, else - use end (Andrii)
- Add Acked-by from Andrii on patch 3

v4: https://lore.kernel.org/all/20260316185736.649940-1-puranjay@kernel.org/
Changes in v5:
- Use get_task_mm() instead of a lockless task->mm read followed by
  mmget_not_zero() to fix a use-after-free: mm_struct is not
  SLAB_TYPESAFE_BY_RCU, so the lockless pointer can go stale (AI)
- Add a local bpf_iter_mmput_async() wrapper with #ifdef CONFIG_MMU
  to avoid modifying fork.c and sched/mm.h outside the BPF tree
- Drop the fork.c and sched/mm.h changes that widened the
  mmput_async() #if guard
- Disable IRQs around get_task_mm() to prevent raw tracepoint
  re-entrancy from deadlocking on task_lock()

v3: https://lore.kernel.org/all/20260311225726.808332-1-puranjay@kernel.org/
Changes in v4:
- Disable task_vma iterator in irq_disabled() contexts to mitigate deadlocks (Alexei)
- Use a helper function to reset the snapshot (Andrii)
- Remove the redundant snap->vm_mm = kit->data->mm; (Andrii)
- Remove all irq_work deferral as the iterator will not work in
  irq_disabled() sections anymore and _new() will return -EBUSY early.

v2: https://lore.kernel.org/all/20260309155506.23490-1-puranjay@kernel.org/
Changes in v3:
- Remove the rename patch 1 (Andrii)
- Put the irq_work in the iter data, per-cpu slot is not needed (Andrii)
- Remove the unnecessary !in_hardirq() in the deferral path (Alexei)
- Use PAGE_SIZE advancement in case vma shrinks back to maintain the
  forward progress guarantee (AI)

v1: https://lore.kernel.org/all/20260304142026.1443666-1-puranjay@kernel.org/
Changes in v2:
- Add a preparatory patch to rename mmap_unlock_irq_work to
  bpf_iter_mm_irq_work (Mykyta)
- Fix bpf_iter_mmput() to also defer for IRQ disabled regions (Alexei)
- Fix a build issue where mmpu_async() is not available without
  CONFIG_MMU (kernel test robot)
- Reuse mmap_unlock_irq_work (after rename) for mmput (Mykyta)
- Move vma lookup (retry block) to a separate function (Mykyta)

This series fixes the mm lifecycle handling in the open-coded task_vma
BPF iterator and switches it from mmap_lock to per-VMA locking to reduce
contention. It then fixes a deadlock that is caused by holding locks
accross the body of the iterator where faulting is allowed.

Patch 1 fixes a use-after-free where task->mm was read locklessly and
could be freed before the iterator used it. It uses a trylock on
alloc_lock to safely read task->mm and acquire an mm reference, and
disables the iterator in irq_disabled() contexts by returning -EBUSY
from _new().

Patch 2 switches from holding mmap_lock for the entire iteration to
per-VMA locking via lock_vma_under_rcu(). This still doesn't fix the
deadlock problem because holding the per-vma lock for the whole
iteration can still cause lock ordering issues when a faultable helper
is called in the body of the iterator.

Patch 3 resolves the lock ordering problems caused by holding the
per-VMA lock or the mmap_lock (not applicable after patch 2) across BPF
program execution. It snapshots VMA fields under the lock, then drops
the lock before returning to the BPF program. File references are
managed via get_file()/fput() across iterations.

Puranjay Mohan (3):
  bpf: fix mm lifecycle in open-coded task_vma iterator
  bpf: switch task_vma iterator from mmap_lock to per-VMA locks
  bpf: return VMA snapshot from task_vma iterator

 kernel/bpf/task_iter.c | 151 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 136 insertions(+), 15 deletions(-)


base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d
-- 
2.52.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH bpf v6 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
  2026-04-08 15:45 [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
@ 2026-04-08 15:45 ` Puranjay Mohan
  2026-04-08 15:45 ` [PATCH bpf v6 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Puranjay Mohan @ 2026-04-08 15:45 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

The open-coded task_vma iterator reads task->mm locklessly and acquires
mmap_read_trylock() but never calls mmget(). If the task exits
concurrently, the mm_struct can be freed as it is not
SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free.

Safely read task->mm with a trylock on alloc_lock and acquire an mm
reference. Drop the reference via bpf_iter_mmput_async() in _destroy()
and error paths. bpf_iter_mmput_async() is a local wrapper around
mmput_async() with a fallback to mmput() on !CONFIG_MMU.

Reject irqs-disabled contexts (including NMI) up front. Operations used
by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from
NMI or from a tracepoint that fires with those locks held could
deadlock.

A trylock on alloc_lock is used instead of the blocking task_lock()
(get_task_mm) to avoid a deadlock when a softirq BPF program iterates
a task that already holds its alloc_lock on the same CPU.

Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/task_iter.c | 54 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 98d9b4c0daff..c1f5fbe9dc2f 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -9,6 +9,7 @@
 #include <linux/bpf_mem_alloc.h>
 #include <linux/btf_ids.h>
 #include <linux/mm_types.h>
+#include <linux/sched/mm.h>
 #include "mmap_unlock_work.h"
 
 static const char * const iter_task_type_names[] = {
@@ -794,6 +795,15 @@ const struct bpf_func_proto bpf_find_vma_proto = {
 	.arg5_type	= ARG_ANYTHING,
 };
 
+static inline void bpf_iter_mmput_async(struct mm_struct *mm)
+{
+#ifdef CONFIG_MMU
+	mmput_async(mm);
+#else
+	mmput(mm);
+#endif
+}
+
 struct bpf_iter_task_vma_kern_data {
 	struct task_struct *task;
 	struct mm_struct *mm;
@@ -825,6 +835,24 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
 	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
 
+	/* bpf_iter_mmput_async() needs mmput_async() which requires CONFIG_MMU */
+	if (!IS_ENABLED(CONFIG_MMU)) {
+		kit->data = NULL;
+		return -EOPNOTSUPP;
+	}
+
+	/*
+	 * Reject irqs-disabled contexts including NMI. Operations used
+	 * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
+	 * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
+	 * Running from NMI or from a tracepoint that fires with those
+	 * locks held could deadlock.
+	 */
+	if (irqs_disabled()) {
+		kit->data = NULL;
+		return -EBUSY;
+	}
+
 	/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
 	 * before, so non-NULL kit->data doesn't point to previously
 	 * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
@@ -834,7 +862,25 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 		return -ENOMEM;
 
 	kit->data->task = get_task_struct(task);
+	/*
+	 * Safely read task->mm and acquire an mm reference.
+	 *
+	 * Cannot use get_task_mm() because its task_lock() is a
+	 * blocking spin_lock that would deadlock if the target task
+	 * already holds alloc_lock on this CPU (e.g. a softirq BPF
+	 * program iterating a task interrupted while holding its
+	 * alloc_lock).
+	 */
+	if (!spin_trylock(&task->alloc_lock)) {
+		err = -EBUSY;
+		goto err_cleanup_iter;
+	}
 	kit->data->mm = task->mm;
+	if (kit->data->mm && !(task->flags & PF_KTHREAD))
+		mmget(kit->data->mm);
+	else
+		kit->data->mm = NULL;
+	spin_unlock(&task->alloc_lock);
 	if (!kit->data->mm) {
 		err = -ENOENT;
 		goto err_cleanup_iter;
@@ -844,15 +890,16 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
 	if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
 		err = -EBUSY;
-		goto err_cleanup_iter;
+		goto err_cleanup_mmget;
 	}
 
 	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
 	return 0;
 
+err_cleanup_mmget:
+	bpf_iter_mmput_async(kit->data->mm);
 err_cleanup_iter:
-	if (kit->data->task)
-		put_task_struct(kit->data->task);
+	put_task_struct(kit->data->task);
 	bpf_mem_free(&bpf_global_ma, kit->data);
 	/* NULL kit->data signals failed bpf_iter_task_vma initialization */
 	kit->data = NULL;
@@ -875,6 +922,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
 	if (kit->data) {
 		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
 		put_task_struct(kit->data->task);
+		bpf_iter_mmput_async(kit->data->mm);
 		bpf_mem_free(&bpf_global_ma, kit->data);
 	}
 }
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf v6 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
  2026-04-08 15:45 [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
  2026-04-08 15:45 ` [PATCH bpf v6 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
@ 2026-04-08 15:45 ` Puranjay Mohan
  2026-04-08 15:45 ` [PATCH bpf v6 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
  2026-04-10 19:10 ` [PATCH bpf v6 0/3] bpf: fix and improve open-coded " patchwork-bot+netdevbpf
  3 siblings, 0 replies; 6+ messages in thread
From: Puranjay Mohan @ 2026-04-08 15:45 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

The open-coded task_vma iterator holds mmap_lock for the entire duration
of iteration, increasing contention on this highly contended lock.

Switch to per-VMA locking. Find the next VMA via an RCU-protected maple
tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not
used because its fallback takes mmap_read_lock(), and the iterator must
work in non-sleepable contexts.

lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA
containing a given address but cannot iterate across gaps. An
RCU-protected vma_next() walk (mas_find) first locates the next VMA's
vm_start to pass to lock_vma_under_rcu().

Between the RCU walk and the lock, the VMA may be removed, shrunk, or
write-locked. On failure, advance past it using vm_end from the RCU
walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be
stale; fall back to PAGE_SIZE advancement when it does not make forward
progress. Concurrent VMA insertions at addresses already passed by the
iterator are not detected.

CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/bpf/task_iter.c | 91 +++++++++++++++++++++++++++++++++---------
 1 file changed, 73 insertions(+), 18 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index c1f5fbe9dc2f..87e87f18913d 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -9,6 +9,7 @@
 #include <linux/bpf_mem_alloc.h>
 #include <linux/btf_ids.h>
 #include <linux/mm_types.h>
+#include <linux/mmap_lock.h>
 #include <linux/sched/mm.h>
 #include "mmap_unlock_work.h"
 
@@ -807,8 +808,8 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
 struct bpf_iter_task_vma_kern_data {
 	struct task_struct *task;
 	struct mm_struct *mm;
-	struct mmap_unlock_irq_work *work;
-	struct vma_iterator vmi;
+	struct vm_area_struct *locked_vma;
+	u64 next_addr;
 };
 
 struct bpf_iter_task_vma {
@@ -829,21 +830,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 				      struct task_struct *task, u64 addr)
 {
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
-	bool irq_work_busy = false;
 	int err;
 
 	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
 	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
 
-	/* bpf_iter_mmput_async() needs mmput_async() which requires CONFIG_MMU */
-	if (!IS_ENABLED(CONFIG_MMU)) {
+	if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
 		kit->data = NULL;
 		return -EOPNOTSUPP;
 	}
 
 	/*
 	 * Reject irqs-disabled contexts including NMI. Operations used
-	 * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
+	 * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
 	 * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
 	 * Running from NMI or from a tracepoint that fires with those
 	 * locks held could deadlock.
@@ -886,18 +885,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 		goto err_cleanup_iter;
 	}
 
-	/* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
-	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
-	if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
-		err = -EBUSY;
-		goto err_cleanup_mmget;
-	}
-
-	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
+	kit->data->locked_vma = NULL;
+	kit->data->next_addr = addr;
 	return 0;
 
-err_cleanup_mmget:
-	bpf_iter_mmput_async(kit->data->mm);
 err_cleanup_iter:
 	put_task_struct(kit->data->task);
 	bpf_mem_free(&bpf_global_ma, kit->data);
@@ -906,13 +897,76 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 	return err;
 }
 
+/*
+ * Find and lock the next VMA at or after data->next_addr.
+ *
+ * lock_vma_under_rcu() is a point lookup (mas_walk): it finds the VMA
+ * containing a given address but cannot iterate. An RCU-protected
+ * maple tree walk with vma_next() (mas_find) is needed first to locate
+ * the next VMA's vm_start across any gap.
+ *
+ * Between the RCU walk and the lock, the VMA may be removed, shrunk,
+ * or write-locked. On failure, advance past it using vm_end from the
+ * RCU walk. SLAB_TYPESAFE_BY_RCU can make vm_end stale, so fall back
+ * to PAGE_SIZE advancement to guarantee forward progress.
+ */
+static struct vm_area_struct *
+bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
+{
+	struct vm_area_struct *vma;
+	struct vma_iterator vmi;
+	unsigned long start, end;
+
+retry:
+	rcu_read_lock();
+	vma_iter_init(&vmi, data->mm, data->next_addr);
+	vma = vma_next(&vmi);
+	if (!vma) {
+		rcu_read_unlock();
+		return NULL;
+	}
+	start = vma->vm_start;
+	end = vma->vm_end;
+	rcu_read_unlock();
+
+	vma = lock_vma_under_rcu(data->mm, start);
+	if (!vma) {
+		if (end <= data->next_addr)
+			data->next_addr += PAGE_SIZE;
+		else
+			data->next_addr = end;
+		goto retry;
+	}
+
+	if (unlikely(vma->vm_end <= data->next_addr)) {
+		data->next_addr += PAGE_SIZE;
+		vma_end_read(vma);
+		goto retry;
+	}
+
+	return vma;
+}
+
 __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
 {
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
+	struct vm_area_struct *vma;
 
 	if (!kit->data) /* bpf_iter_task_vma_new failed */
 		return NULL;
-	return vma_next(&kit->data->vmi);
+
+	if (kit->data->locked_vma) {
+		vma_end_read(kit->data->locked_vma);
+		kit->data->locked_vma = NULL;
+	}
+
+	vma = bpf_iter_task_vma_find_next(kit->data);
+	if (!vma)
+		return NULL;
+
+	kit->data->locked_vma = vma;
+	kit->data->next_addr = vma->vm_end;
+	return vma;
 }
 
 __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
@@ -920,7 +974,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
 
 	if (kit->data) {
-		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
+		if (kit->data->locked_vma)
+			vma_end_read(kit->data->locked_vma);
 		put_task_struct(kit->data->task);
 		bpf_iter_mmput_async(kit->data->mm);
 		bpf_mem_free(&bpf_global_ma, kit->data);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf v6 3/3] bpf: return VMA snapshot from task_vma iterator
  2026-04-08 15:45 [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
  2026-04-08 15:45 ` [PATCH bpf v6 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
  2026-04-08 15:45 ` [PATCH bpf v6 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
@ 2026-04-08 15:45 ` Puranjay Mohan
  2026-04-09 13:15   ` Mykyta Yatsenko
  2026-04-10 19:10 ` [PATCH bpf v6 0/3] bpf: fix and improve open-coded " patchwork-bot+netdevbpf
  3 siblings, 1 reply; 6+ messages in thread
From: Puranjay Mohan @ 2026-04-08 15:45 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
	kernel-team

Holding the per-VMA lock across the BPF program body creates a lock
ordering problem when helpers acquire locks that depend on mmap_lock:

  vm_lock -> i_rwsem -> mmap_lock -> vm_lock

Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then
drop the lock before returning. The BPF program accesses only the
snapshot.

The verifier only trusts vm_mm and vm_file pointers (see
BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference-
counted with get_file() under the lock and released via fput() on the
next iteration or in _destroy(). vm_mm is already correct because
lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers
are left as-is by memcpy() since the verifier treats them as untrusted.

Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/task_iter.c | 42 ++++++++++++++++++++++++++++++------------
 1 file changed, 30 insertions(+), 12 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 87e87f18913d..e791ae065c39 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -808,7 +808,7 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
 struct bpf_iter_task_vma_kern_data {
 	struct task_struct *task;
 	struct mm_struct *mm;
-	struct vm_area_struct *locked_vma;
+	struct vm_area_struct snapshot;
 	u64 next_addr;
 };
 
@@ -842,7 +842,7 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 
 	/*
 	 * Reject irqs-disabled contexts including NMI. Operations used
-	 * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
+	 * by _next() and _destroy() (vma_end_read, fput, bpf_iter_mmput_async)
 	 * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
 	 * Running from NMI or from a tracepoint that fires with those
 	 * locks held could deadlock.
@@ -885,7 +885,7 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
 		goto err_cleanup_iter;
 	}
 
-	kit->data->locked_vma = NULL;
+	kit->data->snapshot.vm_file = NULL;
 	kit->data->next_addr = addr;
 	return 0;
 
@@ -947,26 +947,45 @@ bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
 	return vma;
 }
 
+static void bpf_iter_task_vma_snapshot_reset(struct vm_area_struct *snap)
+{
+	if (snap->vm_file) {
+		fput(snap->vm_file);
+		snap->vm_file = NULL;
+	}
+}
+
 __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
 {
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
-	struct vm_area_struct *vma;
+	struct vm_area_struct *snap, *vma;
 
 	if (!kit->data) /* bpf_iter_task_vma_new failed */
 		return NULL;
 
-	if (kit->data->locked_vma) {
-		vma_end_read(kit->data->locked_vma);
-		kit->data->locked_vma = NULL;
-	}
+	snap = &kit->data->snapshot;
+
+	bpf_iter_task_vma_snapshot_reset(snap);
 
 	vma = bpf_iter_task_vma_find_next(kit->data);
 	if (!vma)
 		return NULL;
 
-	kit->data->locked_vma = vma;
+	memcpy(snap, vma, sizeof(*snap));
+
+	/*
+	 * The verifier only trusts vm_mm and vm_file (see
+	 * BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). Take a reference
+	 * on vm_file; vm_mm is already correct because lock_vma_under_rcu()
+	 * verifies vma->vm_mm == mm. All other pointers are untrusted by
+	 * the verifier and left as-is.
+	 */
+	if (snap->vm_file)
+		get_file(snap->vm_file);
+
 	kit->data->next_addr = vma->vm_end;
-	return vma;
+	vma_end_read(vma);
+	return snap;
 }
 
 __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
@@ -974,8 +993,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
 	struct bpf_iter_task_vma_kern *kit = (void *)it;
 
 	if (kit->data) {
-		if (kit->data->locked_vma)
-			vma_end_read(kit->data->locked_vma);
+		bpf_iter_task_vma_snapshot_reset(&kit->data->snapshot);
 		put_task_struct(kit->data->task);
 		bpf_iter_mmput_async(kit->data->mm);
 		bpf_mem_free(&bpf_global_ma, kit->data);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf v6 3/3] bpf: return VMA snapshot from task_vma iterator
  2026-04-08 15:45 ` [PATCH bpf v6 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
@ 2026-04-09 13:15   ` Mykyta Yatsenko
  0 siblings, 0 replies; 6+ messages in thread
From: Mykyta Yatsenko @ 2026-04-09 13:15 UTC (permalink / raw)
  To: Puranjay Mohan, bpf
  Cc: Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
	Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, kernel-team

On 4/8/26 4:45 PM, Puranjay Mohan wrote:
> Holding the per-VMA lock across the BPF program body creates a lock
> ordering problem when helpers acquire locks that depend on mmap_lock:
> 
>    vm_lock -> i_rwsem -> mmap_lock -> vm_lock
> 
> Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then
> drop the lock before returning. The BPF program accesses only the
> snapshot.
> 
> The verifier only trusts vm_mm and vm_file pointers (see
> BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference-
> counted with get_file() under the lock and released via fput() on the
> next iteration or in _destroy(). vm_mm is already correct because
> lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers
> are left as-is by memcpy() since the verifier treats them as untrusted.
> 
> Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> ---
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
>   kernel/bpf/task_iter.c | 42 ++++++++++++++++++++++++++++++------------
>   1 file changed, 30 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 87e87f18913d..e791ae065c39 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -808,7 +808,7 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
>   struct bpf_iter_task_vma_kern_data {
>   	struct task_struct *task;
>   	struct mm_struct *mm;
> -	struct vm_area_struct *locked_vma;
> +	struct vm_area_struct snapshot;
>   	u64 next_addr;
>   };
>   
> @@ -842,7 +842,7 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>   
>   	/*
>   	 * Reject irqs-disabled contexts including NMI. Operations used
> -	 * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
> +	 * by _next() and _destroy() (vma_end_read, fput, bpf_iter_mmput_async)
>   	 * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
>   	 * Running from NMI or from a tracepoint that fires with those
>   	 * locks held could deadlock.
> @@ -885,7 +885,7 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>   		goto err_cleanup_iter;
>   	}
>   
> -	kit->data->locked_vma = NULL;
> +	kit->data->snapshot.vm_file = NULL;
>   	kit->data->next_addr = addr;
>   	return 0;
>   
> @@ -947,26 +947,45 @@ bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
>   	return vma;
>   }
>   
> +static void bpf_iter_task_vma_snapshot_reset(struct vm_area_struct *snap)
> +{
> +	if (snap->vm_file) {
> +		fput(snap->vm_file);
> +		snap->vm_file = NULL;
> +	}
> +}
> +
>   __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
>   {
>   	struct bpf_iter_task_vma_kern *kit = (void *)it;
> -	struct vm_area_struct *vma;
> +	struct vm_area_struct *snap, *vma;
>   
>   	if (!kit->data) /* bpf_iter_task_vma_new failed */
>   		return NULL;
>   
> -	if (kit->data->locked_vma) {
> -		vma_end_read(kit->data->locked_vma);
> -		kit->data->locked_vma = NULL;
> -	}
> +	snap = &kit->data->snapshot;
> +
> +	bpf_iter_task_vma_snapshot_reset(snap);
>   
>   	vma = bpf_iter_task_vma_find_next(kit->data);
>   	if (!vma)
>   		return NULL;
>   
> -	kit->data->locked_vma = vma;
> +	memcpy(snap, vma, sizeof(*snap));
> +
> +	/*
> +	 * The verifier only trusts vm_mm and vm_file (see
> +	 * BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). Take a reference
> +	 * on vm_file; vm_mm is already correct because lock_vma_under_rcu()
> +	 * verifies vma->vm_mm == mm. All other pointers are untrusted by
> +	 * the verifier and left as-is.
> +	 */
> +	if (snap->vm_file)
> +		get_file(snap->vm_file);
> +
>   	kit->data->next_addr = vma->vm_end;
> -	return vma;
> +	vma_end_read(vma);
> +	return snap;
>   }
>   
>   __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -974,8 +993,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
>   	struct bpf_iter_task_vma_kern *kit = (void *)it;
>   
>   	if (kit->data) {
> -		if (kit->data->locked_vma)
> -			vma_end_read(kit->data->locked_vma);
> +		bpf_iter_task_vma_snapshot_reset(&kit->data->snapshot);
>   		put_task_struct(kit->data->task);
>   		bpf_iter_mmput_async(kit->data->mm);
>   		bpf_mem_free(&bpf_global_ma, kit->data);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator
  2026-04-08 15:45 [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
                   ` (2 preceding siblings ...)
  2026-04-08 15:45 ` [PATCH bpf v6 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
@ 2026-04-10 19:10 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 6+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-04-10 19:10 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: bpf, puranjay12, ast, andrii, daniel, martin.lau, eddyz87, memxor,
	mykyta.yatsenko5, kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Wed,  8 Apr 2026 08:45:34 -0700 you wrote:
> Changelog:
> v5: https://lore.kernel.org/all/20260326151111.4002475-1-puranjay@kernel.org/
> Changes in v6:
> - Replace local_irq_disable() + get_task_mm() with spin_trylock() on
>   alloc_lock to avoid a softirq deadlock: if the target task holds its
>   alloc_lock and gets interrupted, a softirq BPF program iterating
>   that task would deadlock on task_lock() (Gemini)
> - Gate on CONFIG_MMU in patch 1 so that the mmput() fallback in
>   bpf_iter_mmput_async() cannot sleep in non-sleepable BPF context
>   on NOMMU; patch 2 tightens this to CONFIG_PER_VMA_LOCK (Gemini)
> - Merge the split if (irq_work_busy) / if (!mmap_read_trylock())
>   back into a single if statement in patch 1 (Andrii)
> - Flip comparison direction in bpf_iter_task_vma_find_next() so both
>   the locked and unlocked VMA failure cases read consistently:
>   end <= next_addr → PAGE_SIZE, else - use end (Andrii)
> - Add Acked-by from Andrii on patch 3
> 
> [...]

Here is the summary with links:
  - [bpf,v6,1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
    https://git.kernel.org/bpf/bpf-next/c/d8e27d2d22b6
  - [bpf,v6,2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
    https://git.kernel.org/bpf/bpf-next/c/bee9ef4a40a2
  - [bpf,v6,3/3] bpf: return VMA snapshot from task_vma iterator
    https://git.kernel.org/bpf/bpf-next/c/4cbee026db54

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-10 19:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-08 15:45 [PATCH bpf v6 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-04-08 15:45 ` [PATCH bpf v6 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-04-08 15:45 ` [PATCH bpf v6 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-04-08 15:45 ` [PATCH bpf v6 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2026-04-09 13:15   ` Mykyta Yatsenko
2026-04-10 19:10 ` [PATCH bpf v6 0/3] bpf: fix and improve open-coded " patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox