* [RESEND PATCH bpf v5 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
2026-03-26 15:11 [RESEND PATCH bpf v5 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
@ 2026-03-26 15:11 ` Puranjay Mohan
2026-03-30 23:24 ` Andrii Nakryiko
2026-03-26 15:11 ` [RESEND PATCH bpf v5 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-03-26 15:11 ` [RESEND PATCH bpf v5 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2 siblings, 1 reply; 9+ messages in thread
From: Puranjay Mohan @ 2026-03-26 15:11 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
kernel-team
The open-coded task_vma iterator reads task->mm locklessly and acquires
mmap_read_trylock() but never calls mmget(). If the task exits
concurrently, the mm_struct can be freed as it is not
SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free.
Use get_task_mm() to safely read task->mm under task_lock() and acquire
an mm reference. Drop the reference via bpf_iter_mmput_async() in
_destroy() and error paths. bpf_iter_mmput_async() is a local wrapper
around mmput_async() with a fallback to mmput() on !CONFIG_MMU.
Reject irqs-disabled contexts (including NMI) up front. Operations used
by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from
NMI or from a tracepoint that fires with those locks held could
deadlock. Disable IRQs around get_task_mm() to prevent raw tracepoint
re-entrancy from deadlocking on task_lock().
Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
kernel/bpf/task_iter.c | 45 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 40 insertions(+), 5 deletions(-)
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 98d9b4c0daff..faf4d6197608 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -9,6 +9,7 @@
#include <linux/bpf_mem_alloc.h>
#include <linux/btf_ids.h>
#include <linux/mm_types.h>
+#include <linux/sched/mm.h>
#include "mmap_unlock_work.h"
static const char * const iter_task_type_names[] = {
@@ -794,6 +795,15 @@ const struct bpf_func_proto bpf_find_vma_proto = {
.arg5_type = ARG_ANYTHING,
};
+static inline void bpf_iter_mmput_async(struct mm_struct *mm)
+{
+#ifdef CONFIG_MMU
+ mmput_async(mm);
+#else
+ mmput(mm);
+#endif
+}
+
struct bpf_iter_task_vma_kern_data {
struct task_struct *task;
struct mm_struct *mm;
@@ -825,6 +835,18 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
+ /*
+ * Reject irqs-disabled contexts including NMI. Operations used
+ * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
+ * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
+ * Running from NMI or from a tracepoint that fires with those
+ * locks held could deadlock.
+ */
+ if (irqs_disabled()) {
+ kit->data = NULL;
+ return -EBUSY;
+ }
+
/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
* before, so non-NULL kit->data doesn't point to previously
* bpf_mem_alloc'd bpf_iter_task_vma_kern_data
@@ -834,7 +856,13 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
return -ENOMEM;
kit->data->task = get_task_struct(task);
- kit->data->mm = task->mm;
+ /*
+ * Disable IRQs so that a raw tracepoint re-entering the
+ * iterator on this CPU cannot deadlock on task_lock().
+ */
+ local_irq_disable();
+ kit->data->mm = get_task_mm(task);
+ local_irq_enable();
if (!kit->data->mm) {
err = -ENOENT;
goto err_cleanup_iter;
@@ -842,17 +870,23 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
/* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
- if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
+ if (irq_work_busy) {
err = -EBUSY;
- goto err_cleanup_iter;
+ goto err_cleanup_mmget;
+ }
+
+ if (!mmap_read_trylock(kit->data->mm)) {
+ err = -EBUSY;
+ goto err_cleanup_mmget;
}
vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
return 0;
+err_cleanup_mmget:
+ bpf_iter_mmput_async(kit->data->mm);
err_cleanup_iter:
- if (kit->data->task)
- put_task_struct(kit->data->task);
+ put_task_struct(kit->data->task);
bpf_mem_free(&bpf_global_ma, kit->data);
/* NULL kit->data signals failed bpf_iter_task_vma initialization */
kit->data = NULL;
@@ -875,6 +909,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
if (kit->data) {
bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
put_task_struct(kit->data->task);
+ bpf_iter_mmput_async(kit->data->mm);
bpf_mem_free(&bpf_global_ma, kit->data);
}
}
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [RESEND PATCH bpf v5 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
2026-03-26 15:11 ` [RESEND PATCH bpf v5 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
@ 2026-03-30 23:24 ` Andrii Nakryiko
2026-03-30 23:37 ` Puranjay Mohan
0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2026-03-30 23:24 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Mykyta Yatsenko, kernel-team
On Thu, Mar 26, 2026 at 8:11 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> The open-coded task_vma iterator reads task->mm locklessly and acquires
> mmap_read_trylock() but never calls mmget(). If the task exits
> concurrently, the mm_struct can be freed as it is not
> SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free.
>
> Use get_task_mm() to safely read task->mm under task_lock() and acquire
> an mm reference. Drop the reference via bpf_iter_mmput_async() in
> _destroy() and error paths. bpf_iter_mmput_async() is a local wrapper
> around mmput_async() with a fallback to mmput() on !CONFIG_MMU.
>
> Reject irqs-disabled contexts (including NMI) up front. Operations used
> by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
> take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from
> NMI or from a tracepoint that fires with those locks held could
> deadlock. Disable IRQs around get_task_mm() to prevent raw tracepoint
> re-entrancy from deadlocking on task_lock().
>
> Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> kernel/bpf/task_iter.c | 45 +++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 40 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 98d9b4c0daff..faf4d6197608 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -9,6 +9,7 @@
> #include <linux/bpf_mem_alloc.h>
> #include <linux/btf_ids.h>
> #include <linux/mm_types.h>
> +#include <linux/sched/mm.h>
> #include "mmap_unlock_work.h"
>
> static const char * const iter_task_type_names[] = {
> @@ -794,6 +795,15 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> .arg5_type = ARG_ANYTHING,
> };
>
> +static inline void bpf_iter_mmput_async(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_MMU
> + mmput_async(mm);
> +#else
> + mmput(mm);
> +#endif
> +}
> +
> struct bpf_iter_task_vma_kern_data {
> struct task_struct *task;
> struct mm_struct *mm;
> @@ -825,6 +835,18 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>
> + /*
> + * Reject irqs-disabled contexts including NMI. Operations used
> + * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
> + * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
> + * Running from NMI or from a tracepoint that fires with those
> + * locks held could deadlock.
> + */
> + if (irqs_disabled()) {
> + kit->data = NULL;
> + return -EBUSY;
> + }
> +
> /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> * before, so non-NULL kit->data doesn't point to previously
> * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> @@ -834,7 +856,13 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> return -ENOMEM;
>
> kit->data->task = get_task_struct(task);
> - kit->data->mm = task->mm;
> + /*
> + * Disable IRQs so that a raw tracepoint re-entering the
> + * iterator on this CPU cannot deadlock on task_lock().
> + */
> + local_irq_disable();
> + kit->data->mm = get_task_mm(task);
> + local_irq_enable();
> if (!kit->data->mm) {
> err = -ENOENT;
> goto err_cleanup_iter;
> @@ -842,17 +870,23 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>
> /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> - if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
> + if (irq_work_busy) {
> err = -EBUSY;
> - goto err_cleanup_iter;
> + goto err_cleanup_mmget;
> + }
> +
> + if (!mmap_read_trylock(kit->data->mm)) {
> + err = -EBUSY;
> + goto err_cleanup_mmget;
> }
what's the purpose of this split of single if into two equivalent if
(with the same cleanup handling and error code, as far as I can see)?
Plus you are removing it in the next patch anyways...
>
> vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> return 0;
>
> +err_cleanup_mmget:
> + bpf_iter_mmput_async(kit->data->mm);
> err_cleanup_iter:
> - if (kit->data->task)
> - put_task_struct(kit->data->task);
> + put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> /* NULL kit->data signals failed bpf_iter_task_vma initialization */
> kit->data = NULL;
> @@ -875,6 +909,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> if (kit->data) {
> bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> put_task_struct(kit->data->task);
> + bpf_iter_mmput_async(kit->data->mm);
> bpf_mem_free(&bpf_global_ma, kit->data);
> }
> }
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RESEND PATCH bpf v5 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
2026-03-30 23:24 ` Andrii Nakryiko
@ 2026-03-30 23:37 ` Puranjay Mohan
0 siblings, 0 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-03-30 23:37 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Mykyta Yatsenko, kernel-team
On Tue, Mar 31, 2026 at 12:24 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Mar 26, 2026 at 8:11 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >
> > The open-coded task_vma iterator reads task->mm locklessly and acquires
> > mmap_read_trylock() but never calls mmget(). If the task exits
> > concurrently, the mm_struct can be freed as it is not
> > SLAB_TYPESAFE_BY_RCU, resulting in a use-after-free.
> >
> > Use get_task_mm() to safely read task->mm under task_lock() and acquire
> > an mm reference. Drop the reference via bpf_iter_mmput_async() in
> > _destroy() and error paths. bpf_iter_mmput_async() is a local wrapper
> > around mmput_async() with a fallback to mmput() on !CONFIG_MMU.
> >
> > Reject irqs-disabled contexts (including NMI) up front. Operations used
> > by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
> > take spinlocks with IRQs disabled (pool->lock, pi_lock). Running from
> > NMI or from a tracepoint that fires with those locks held could
> > deadlock. Disable IRQs around get_task_mm() to prevent raw tracepoint
> > re-entrancy from deadlocking on task_lock().
> >
> > Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> > ---
> > kernel/bpf/task_iter.c | 45 +++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 40 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > index 98d9b4c0daff..faf4d6197608 100644
> > --- a/kernel/bpf/task_iter.c
> > +++ b/kernel/bpf/task_iter.c
> > @@ -9,6 +9,7 @@
> > #include <linux/bpf_mem_alloc.h>
> > #include <linux/btf_ids.h>
> > #include <linux/mm_types.h>
> > +#include <linux/sched/mm.h>
> > #include "mmap_unlock_work.h"
> >
> > static const char * const iter_task_type_names[] = {
> > @@ -794,6 +795,15 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> > .arg5_type = ARG_ANYTHING,
> > };
> >
> > +static inline void bpf_iter_mmput_async(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_MMU
> > + mmput_async(mm);
> > +#else
> > + mmput(mm);
> > +#endif
> > +}
> > +
> > struct bpf_iter_task_vma_kern_data {
> > struct task_struct *task;
> > struct mm_struct *mm;
> > @@ -825,6 +835,18 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> > BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> > BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
> >
> > + /*
> > + * Reject irqs-disabled contexts including NMI. Operations used
> > + * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
> > + * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
> > + * Running from NMI or from a tracepoint that fires with those
> > + * locks held could deadlock.
> > + */
> > + if (irqs_disabled()) {
> > + kit->data = NULL;
> > + return -EBUSY;
> > + }
> > +
> > /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> > * before, so non-NULL kit->data doesn't point to previously
> > * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> > @@ -834,7 +856,13 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> > return -ENOMEM;
> >
> > kit->data->task = get_task_struct(task);
> > - kit->data->mm = task->mm;
> > + /*
> > + * Disable IRQs so that a raw tracepoint re-entering the
> > + * iterator on this CPU cannot deadlock on task_lock().
> > + */
> > + local_irq_disable();
> > + kit->data->mm = get_task_mm(task);
> > + local_irq_enable();
> > if (!kit->data->mm) {
> > err = -ENOENT;
> > goto err_cleanup_iter;
> > @@ -842,17 +870,23 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> >
> > /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> > irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> > - if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
> > + if (irq_work_busy) {
> > err = -EBUSY;
> > - goto err_cleanup_iter;
> > + goto err_cleanup_mmget;
> > + }
> > +
> > + if (!mmap_read_trylock(kit->data->mm)) {
> > + err = -EBUSY;
> > + goto err_cleanup_mmget;
> > }
>
> what's the purpose of this split of single if into two equivalent if
> (with the same cleanup handling and error code, as far as I can see)?
> Plus you are removing it in the next patch anyways...
Was trying something else earlier and missed un-doing this. Will
change in the next version.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RESEND PATCH bpf v5 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
2026-03-26 15:11 [RESEND PATCH bpf v5 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-26 15:11 ` [RESEND PATCH bpf v5 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
@ 2026-03-26 15:11 ` Puranjay Mohan
2026-03-30 23:31 ` Andrii Nakryiko
2026-03-26 15:11 ` [RESEND PATCH bpf v5 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2 siblings, 1 reply; 9+ messages in thread
From: Puranjay Mohan @ 2026-03-26 15:11 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
kernel-team
The open-coded task_vma iterator holds mmap_lock for the entire duration
of iteration, increasing contention on this highly contended lock.
Switch to per-VMA locking. Find the next VMA via an RCU-protected maple
tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not
used because its fallback takes mmap_read_lock(), and the iterator must
work in non-sleepable contexts.
lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA
containing a given address but cannot iterate across gaps. An
RCU-protected vma_next() walk (mas_find) first locates the next VMA's
vm_start to pass to lock_vma_under_rcu().
Between the RCU walk and the lock, the VMA may be removed, shrunk, or
write-locked. On failure, advance past it using vm_end from the RCU
walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be
stale; fall back to PAGE_SIZE advancement when it does not make forward
progress. Concurrent VMA insertions at addresses already passed by the
iterator are not detected.
CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it.
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
kernel/bpf/task_iter.c | 98 +++++++++++++++++++++++++++++++++---------
1 file changed, 77 insertions(+), 21 deletions(-)
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index faf4d6197608..46120745ee84 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -9,6 +9,7 @@
#include <linux/bpf_mem_alloc.h>
#include <linux/btf_ids.h>
#include <linux/mm_types.h>
+#include <linux/mmap_lock.h>
#include <linux/sched/mm.h>
#include "mmap_unlock_work.h"
@@ -807,8 +808,8 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
struct bpf_iter_task_vma_kern_data {
struct task_struct *task;
struct mm_struct *mm;
- struct mmap_unlock_irq_work *work;
- struct vma_iterator vmi;
+ struct vm_area_struct *locked_vma;
+ u64 next_addr;
};
struct bpf_iter_task_vma {
@@ -829,15 +830,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
struct task_struct *task, u64 addr)
{
struct bpf_iter_task_vma_kern *kit = (void *)it;
- bool irq_work_busy = false;
int err;
BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
+ if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
+ kit->data = NULL;
+ return -EOPNOTSUPP;
+ }
+
/*
* Reject irqs-disabled contexts including NMI. Operations used
- * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
+ * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
* can take spinlocks with IRQs disabled (pi_lock, pool->lock).
* Running from NMI or from a tracepoint that fires with those
* locks held could deadlock.
@@ -868,23 +873,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
goto err_cleanup_iter;
}
- /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
- irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
- if (irq_work_busy) {
- err = -EBUSY;
- goto err_cleanup_mmget;
- }
-
- if (!mmap_read_trylock(kit->data->mm)) {
- err = -EBUSY;
- goto err_cleanup_mmget;
- }
-
- vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
+ kit->data->locked_vma = NULL;
+ kit->data->next_addr = addr;
return 0;
-err_cleanup_mmget:
- bpf_iter_mmput_async(kit->data->mm);
err_cleanup_iter:
put_task_struct(kit->data->task);
bpf_mem_free(&bpf_global_ma, kit->data);
@@ -893,13 +885,76 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
return err;
}
+/*
+ * Find and lock the next VMA at or after data->next_addr.
+ *
+ * lock_vma_under_rcu() is a point lookup (mas_walk): it finds the VMA
+ * containing a given address but cannot iterate. An RCU-protected
+ * maple tree walk with vma_next() (mas_find) is needed first to locate
+ * the next VMA's vm_start across any gap.
+ *
+ * Between the RCU walk and the lock, the VMA may be removed, shrunk,
+ * or write-locked. On failure, advance past it using vm_end from the
+ * RCU walk. SLAB_TYPESAFE_BY_RCU can make vm_end stale, so fall back
+ * to PAGE_SIZE advancement to guarantee forward progress.
+ */
+static struct vm_area_struct *
+bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
+{
+ struct vm_area_struct *vma;
+ struct vma_iterator vmi;
+ unsigned long start, end;
+
+retry:
+ rcu_read_lock();
+ vma_iter_init(&vmi, data->mm, data->next_addr);
+ vma = vma_next(&vmi);
+ if (!vma) {
+ rcu_read_unlock();
+ return NULL;
+ }
+ start = vma->vm_start;
+ end = vma->vm_end;
+ rcu_read_unlock();
+
+ vma = lock_vma_under_rcu(data->mm, start);
+ if (!vma) {
+ if (end > data->next_addr)
+ data->next_addr = end;
+ else
+ data->next_addr += PAGE_SIZE;
+ goto retry;
+ }
+
+ if (unlikely(data->next_addr >= vma->vm_end)) {
+ data->next_addr += PAGE_SIZE;
+ vma_end_read(vma);
+ goto retry;
+ }
+
+ return vma;
+}
+
__bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
{
struct bpf_iter_task_vma_kern *kit = (void *)it;
+ struct vm_area_struct *vma;
if (!kit->data) /* bpf_iter_task_vma_new failed */
return NULL;
- return vma_next(&kit->data->vmi);
+
+ if (kit->data->locked_vma) {
+ vma_end_read(kit->data->locked_vma);
+ kit->data->locked_vma = NULL;
+ }
+
+ vma = bpf_iter_task_vma_find_next(kit->data);
+ if (!vma)
+ return NULL;
+
+ kit->data->locked_vma = vma;
+ kit->data->next_addr = vma->vm_end;
+ return vma;
}
__bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
@@ -907,7 +962,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
struct bpf_iter_task_vma_kern *kit = (void *)it;
if (kit->data) {
- bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
+ if (kit->data->locked_vma)
+ vma_end_read(kit->data->locked_vma);
put_task_struct(kit->data->task);
bpf_iter_mmput_async(kit->data->mm);
bpf_mem_free(&bpf_global_ma, kit->data);
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [RESEND PATCH bpf v5 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
2026-03-26 15:11 ` [RESEND PATCH bpf v5 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
@ 2026-03-30 23:31 ` Andrii Nakryiko
2026-03-30 23:36 ` Puranjay Mohan
0 siblings, 1 reply; 9+ messages in thread
From: Andrii Nakryiko @ 2026-03-30 23:31 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Mykyta Yatsenko, kernel-team
On Thu, Mar 26, 2026 at 8:11 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> The open-coded task_vma iterator holds mmap_lock for the entire duration
> of iteration, increasing contention on this highly contended lock.
>
> Switch to per-VMA locking. Find the next VMA via an RCU-protected maple
> tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not
> used because its fallback takes mmap_read_lock(), and the iterator must
> work in non-sleepable contexts.
>
> lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA
> containing a given address but cannot iterate across gaps. An
> RCU-protected vma_next() walk (mas_find) first locates the next VMA's
> vm_start to pass to lock_vma_under_rcu().
>
> Between the RCU walk and the lock, the VMA may be removed, shrunk, or
> write-locked. On failure, advance past it using vm_end from the RCU
> walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be
> stale; fall back to PAGE_SIZE advancement when it does not make forward
> progress. Concurrent VMA insertions at addresses already passed by the
> iterator are not detected.
>
> CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> kernel/bpf/task_iter.c | 98 +++++++++++++++++++++++++++++++++---------
> 1 file changed, 77 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index faf4d6197608..46120745ee84 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -9,6 +9,7 @@
> #include <linux/bpf_mem_alloc.h>
> #include <linux/btf_ids.h>
> #include <linux/mm_types.h>
> +#include <linux/mmap_lock.h>
> #include <linux/sched/mm.h>
> #include "mmap_unlock_work.h"
>
> @@ -807,8 +808,8 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
> struct bpf_iter_task_vma_kern_data {
> struct task_struct *task;
> struct mm_struct *mm;
> - struct mmap_unlock_irq_work *work;
> - struct vma_iterator vmi;
> + struct vm_area_struct *locked_vma;
> + u64 next_addr;
> };
>
> struct bpf_iter_task_vma {
> @@ -829,15 +830,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> struct task_struct *task, u64 addr)
> {
> struct bpf_iter_task_vma_kern *kit = (void *)it;
> - bool irq_work_busy = false;
> int err;
>
> BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>
> + if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> + kit->data = NULL;
> + return -EOPNOTSUPP;
> + }
> +
> /*
> * Reject irqs-disabled contexts including NMI. Operations used
> - * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
> + * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
> * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
> * Running from NMI or from a tracepoint that fires with those
> * locks held could deadlock.
> @@ -868,23 +873,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> goto err_cleanup_iter;
> }
>
> - /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> - irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> - if (irq_work_busy) {
> - err = -EBUSY;
> - goto err_cleanup_mmget;
> - }
> -
> - if (!mmap_read_trylock(kit->data->mm)) {
> - err = -EBUSY;
> - goto err_cleanup_mmget;
> - }
> -
> - vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> + kit->data->locked_vma = NULL;
> + kit->data->next_addr = addr;
> return 0;
>
> -err_cleanup_mmget:
> - bpf_iter_mmput_async(kit->data->mm);
> err_cleanup_iter:
> put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> @@ -893,13 +885,76 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> return err;
> }
>
> +/*
> + * Find and lock the next VMA at or after data->next_addr.
> + *
> + * lock_vma_under_rcu() is a point lookup (mas_walk): it finds the VMA
> + * containing a given address but cannot iterate. An RCU-protected
> + * maple tree walk with vma_next() (mas_find) is needed first to locate
> + * the next VMA's vm_start across any gap.
> + *
> + * Between the RCU walk and the lock, the VMA may be removed, shrunk,
> + * or write-locked. On failure, advance past it using vm_end from the
> + * RCU walk. SLAB_TYPESAFE_BY_RCU can make vm_end stale, so fall back
> + * to PAGE_SIZE advancement to guarantee forward progress.
> + */
> +static struct vm_area_struct *
> +bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
> +{
> + struct vm_area_struct *vma;
> + struct vma_iterator vmi;
> + unsigned long start, end;
> +
> +retry:
> + rcu_read_lock();
> + vma_iter_init(&vmi, data->mm, data->next_addr);
> + vma = vma_next(&vmi);
> + if (!vma) {
> + rcu_read_unlock();
> + return NULL;
> + }
> + start = vma->vm_start;
> + end = vma->vm_end;
> + rcu_read_unlock();
> +
> + vma = lock_vma_under_rcu(data->mm, start);
> + if (!vma) {
> + if (end > data->next_addr)
here you have vma->vm_end on the left side
> + data->next_addr = end;
> + else
> + data->next_addr += PAGE_SIZE;
> + goto retry;
> + }
> +
> + if (unlikely(data->next_addr >= vma->vm_end)) {
here you have it on the right side... which just makes it harder to
compare two cases for no good reason...
here you actually do the "else branch" but for already locked VMA, so
make it easy to see this analogy:
vma = lock_vma_under_rcu(data->mm, start);
if (!vma) {
if (end <= data->next_addr)
data->next_addr += PAGE_SIZE;
else
data->next_addr = end;
goto retry;
}
if (unlikely(vma->vm_end <= data->next_addr)) {
data->next_addr += PAGE_SIZE;
vma_end_read(vma);
goto retry;
}
isn't it easier to follow?
> + data->next_addr += PAGE_SIZE;
> + vma_end_read(vma);
> + goto retry;
> + }
> +
> + return vma;
> +}
> +
> __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
> {
> struct bpf_iter_task_vma_kern *kit = (void *)it;
> + struct vm_area_struct *vma;
>
> if (!kit->data) /* bpf_iter_task_vma_new failed */
> return NULL;
> - return vma_next(&kit->data->vmi);
> +
> + if (kit->data->locked_vma) {
> + vma_end_read(kit->data->locked_vma);
> + kit->data->locked_vma = NULL;
> + }
> +
> + vma = bpf_iter_task_vma_find_next(kit->data);
> + if (!vma)
> + return NULL;
> +
> + kit->data->locked_vma = vma;
> + kit->data->next_addr = vma->vm_end;
> + return vma;
> }
>
> __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -907,7 +962,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> struct bpf_iter_task_vma_kern *kit = (void *)it;
>
> if (kit->data) {
> - bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> + if (kit->data->locked_vma)
> + vma_end_read(kit->data->locked_vma);
> put_task_struct(kit->data->task);
> bpf_iter_mmput_async(kit->data->mm);
> bpf_mem_free(&bpf_global_ma, kit->data);
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [RESEND PATCH bpf v5 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
2026-03-30 23:31 ` Andrii Nakryiko
@ 2026-03-30 23:36 ` Puranjay Mohan
0 siblings, 0 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-03-30 23:36 UTC (permalink / raw)
To: Andrii Nakryiko
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Mykyta Yatsenko, kernel-team
On Tue, Mar 31, 2026 at 12:31 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, Mar 26, 2026 at 8:11 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >
> > The open-coded task_vma iterator holds mmap_lock for the entire duration
> > of iteration, increasing contention on this highly contended lock.
> >
> > Switch to per-VMA locking. Find the next VMA via an RCU-protected maple
> > tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not
> > used because its fallback takes mmap_read_lock(), and the iterator must
> > work in non-sleepable contexts.
> >
> > lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA
> > containing a given address but cannot iterate across gaps. An
> > RCU-protected vma_next() walk (mas_find) first locates the next VMA's
> > vm_start to pass to lock_vma_under_rcu().
> >
> > Between the RCU walk and the lock, the VMA may be removed, shrunk, or
> > write-locked. On failure, advance past it using vm_end from the RCU
> > walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be
> > stale; fall back to PAGE_SIZE advancement when it does not make forward
> > progress. Concurrent VMA insertions at addresses already passed by the
> > iterator are not detected.
> >
> > CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it.
> >
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> > ---
> > kernel/bpf/task_iter.c | 98 +++++++++++++++++++++++++++++++++---------
> > 1 file changed, 77 insertions(+), 21 deletions(-)
> >
> > diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> > index faf4d6197608..46120745ee84 100644
> > --- a/kernel/bpf/task_iter.c
> > +++ b/kernel/bpf/task_iter.c
> > @@ -9,6 +9,7 @@
> > #include <linux/bpf_mem_alloc.h>
> > #include <linux/btf_ids.h>
> > #include <linux/mm_types.h>
> > +#include <linux/mmap_lock.h>
> > #include <linux/sched/mm.h>
> > #include "mmap_unlock_work.h"
> >
> > @@ -807,8 +808,8 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
> > struct bpf_iter_task_vma_kern_data {
> > struct task_struct *task;
> > struct mm_struct *mm;
> > - struct mmap_unlock_irq_work *work;
> > - struct vma_iterator vmi;
> > + struct vm_area_struct *locked_vma;
> > + u64 next_addr;
> > };
> >
> > struct bpf_iter_task_vma {
> > @@ -829,15 +830,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> > struct task_struct *task, u64 addr)
> > {
> > struct bpf_iter_task_vma_kern *kit = (void *)it;
> > - bool irq_work_busy = false;
> > int err;
> >
> > BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> > BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
> >
> > + if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> > + kit->data = NULL;
> > + return -EOPNOTSUPP;
> > + }
> > +
> > /*
> > * Reject irqs-disabled contexts including NMI. Operations used
> > - * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async)
> > + * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
> > * can take spinlocks with IRQs disabled (pi_lock, pool->lock).
> > * Running from NMI or from a tracepoint that fires with those
> > * locks held could deadlock.
> > @@ -868,23 +873,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> > goto err_cleanup_iter;
> > }
> >
> > - /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> > - irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> > - if (irq_work_busy) {
> > - err = -EBUSY;
> > - goto err_cleanup_mmget;
> > - }
> > -
> > - if (!mmap_read_trylock(kit->data->mm)) {
> > - err = -EBUSY;
> > - goto err_cleanup_mmget;
> > - }
> > -
> > - vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> > + kit->data->locked_vma = NULL;
> > + kit->data->next_addr = addr;
> > return 0;
> >
> > -err_cleanup_mmget:
> > - bpf_iter_mmput_async(kit->data->mm);
> > err_cleanup_iter:
> > put_task_struct(kit->data->task);
> > bpf_mem_free(&bpf_global_ma, kit->data);
> > @@ -893,13 +885,76 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> > return err;
> > }
> >
> > +/*
> > + * Find and lock the next VMA at or after data->next_addr.
> > + *
> > + * lock_vma_under_rcu() is a point lookup (mas_walk): it finds the VMA
> > + * containing a given address but cannot iterate. An RCU-protected
> > + * maple tree walk with vma_next() (mas_find) is needed first to locate
> > + * the next VMA's vm_start across any gap.
> > + *
> > + * Between the RCU walk and the lock, the VMA may be removed, shrunk,
> > + * or write-locked. On failure, advance past it using vm_end from the
> > + * RCU walk. SLAB_TYPESAFE_BY_RCU can make vm_end stale, so fall back
> > + * to PAGE_SIZE advancement to guarantee forward progress.
> > + */
> > +static struct vm_area_struct *
> > +bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
> > +{
> > + struct vm_area_struct *vma;
> > + struct vma_iterator vmi;
> > + unsigned long start, end;
> > +
> > +retry:
> > + rcu_read_lock();
> > + vma_iter_init(&vmi, data->mm, data->next_addr);
> > + vma = vma_next(&vmi);
> > + if (!vma) {
> > + rcu_read_unlock();
> > + return NULL;
> > + }
> > + start = vma->vm_start;
> > + end = vma->vm_end;
> > + rcu_read_unlock();
> > +
> > + vma = lock_vma_under_rcu(data->mm, start);
> > + if (!vma) {
> > + if (end > data->next_addr)
>
> here you have vma->vm_end on the left side
>
> > + data->next_addr = end;
> > + else
> > + data->next_addr += PAGE_SIZE;
> > + goto retry;
> > + }
> > +
> > + if (unlikely(data->next_addr >= vma->vm_end)) {
>
> here you have it on the right side... which just makes it harder to
> compare two cases for no good reason...
>
> here you actually do the "else branch" but for already locked VMA, so
> make it easy to see this analogy:
>
>
> vma = lock_vma_under_rcu(data->mm, start);
> if (!vma) {
> if (end <= data->next_addr)
> data->next_addr += PAGE_SIZE;
> else
> data->next_addr = end;
> goto retry;
> }
>
> if (unlikely(vma->vm_end <= data->next_addr)) {
> data->next_addr += PAGE_SIZE;
> vma_end_read(vma);
> goto retry;
> }
>
> isn't it easier to follow?
Yes, I agree. Will use this approach in the next version.
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RESEND PATCH bpf v5 3/3] bpf: return VMA snapshot from task_vma iterator
2026-03-26 15:11 [RESEND PATCH bpf v5 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-26 15:11 ` [RESEND PATCH bpf v5 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-03-26 15:11 ` [RESEND PATCH bpf v5 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
@ 2026-03-26 15:11 ` Puranjay Mohan
2026-03-30 23:34 ` Andrii Nakryiko
2 siblings, 1 reply; 9+ messages in thread
From: Puranjay Mohan @ 2026-03-26 15:11 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Mykyta Yatsenko,
kernel-team
Holding the per-VMA lock across the BPF program body creates a lock
ordering problem when helpers acquire locks that depend on mmap_lock:
vm_lock -> i_rwsem -> mmap_lock -> vm_lock
Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then
drop the lock before returning. The BPF program accesses only the
snapshot.
The verifier only trusts vm_mm and vm_file pointers (see
BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference-
counted with get_file() under the lock and released via fput() on the
next iteration or in _destroy(). vm_mm is already correct because
lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers
are left as-is by memcpy() since the verifier treats them as untrusted.
Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
kernel/bpf/task_iter.c | 42 ++++++++++++++++++++++++++++++------------
1 file changed, 30 insertions(+), 12 deletions(-)
diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 46120745ee84..29aceac4c696 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -808,7 +808,7 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm)
struct bpf_iter_task_vma_kern_data {
struct task_struct *task;
struct mm_struct *mm;
- struct vm_area_struct *locked_vma;
+ struct vm_area_struct snapshot;
u64 next_addr;
};
@@ -842,7 +842,7 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
/*
* Reject irqs-disabled contexts including NMI. Operations used
- * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async)
+ * by _next() and _destroy() (vma_end_read, fput, bpf_iter_mmput_async)
* can take spinlocks with IRQs disabled (pi_lock, pool->lock).
* Running from NMI or from a tracepoint that fires with those
* locks held could deadlock.
@@ -873,7 +873,7 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
goto err_cleanup_iter;
}
- kit->data->locked_vma = NULL;
+ kit->data->snapshot.vm_file = NULL;
kit->data->next_addr = addr;
return 0;
@@ -935,26 +935,45 @@ bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data)
return vma;
}
+static void bpf_iter_task_vma_snapshot_reset(struct vm_area_struct *snap)
+{
+ if (snap->vm_file) {
+ fput(snap->vm_file);
+ snap->vm_file = NULL;
+ }
+}
+
__bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
{
struct bpf_iter_task_vma_kern *kit = (void *)it;
- struct vm_area_struct *vma;
+ struct vm_area_struct *snap, *vma;
if (!kit->data) /* bpf_iter_task_vma_new failed */
return NULL;
- if (kit->data->locked_vma) {
- vma_end_read(kit->data->locked_vma);
- kit->data->locked_vma = NULL;
- }
+ snap = &kit->data->snapshot;
+
+ bpf_iter_task_vma_snapshot_reset(snap);
vma = bpf_iter_task_vma_find_next(kit->data);
if (!vma)
return NULL;
- kit->data->locked_vma = vma;
+ memcpy(snap, vma, sizeof(*snap));
+
+ /*
+ * The verifier only trusts vm_mm and vm_file (see
+ * BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). Take a reference
+ * on vm_file; vm_mm is already correct because lock_vma_under_rcu()
+ * verifies vma->vm_mm == mm. All other pointers are untrusted by
+ * the verifier and left as-is.
+ */
+ if (snap->vm_file)
+ get_file(snap->vm_file);
+
kit->data->next_addr = vma->vm_end;
- return vma;
+ vma_end_read(vma);
+ return snap;
}
__bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
@@ -962,8 +981,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
struct bpf_iter_task_vma_kern *kit = (void *)it;
if (kit->data) {
- if (kit->data->locked_vma)
- vma_end_read(kit->data->locked_vma);
+ bpf_iter_task_vma_snapshot_reset(&kit->data->snapshot);
put_task_struct(kit->data->task);
bpf_iter_mmput_async(kit->data->mm);
bpf_mem_free(&bpf_global_ma, kit->data);
--
2.52.0
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [RESEND PATCH bpf v5 3/3] bpf: return VMA snapshot from task_vma iterator
2026-03-26 15:11 ` [RESEND PATCH bpf v5 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
@ 2026-03-30 23:34 ` Andrii Nakryiko
0 siblings, 0 replies; 9+ messages in thread
From: Andrii Nakryiko @ 2026-03-30 23:34 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Mykyta Yatsenko, kernel-team
On Thu, Mar 26, 2026 at 8:11 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Holding the per-VMA lock across the BPF program body creates a lock
> ordering problem when helpers acquire locks that depend on mmap_lock:
>
> vm_lock -> i_rwsem -> mmap_lock -> vm_lock
>
> Snapshot the VMA under the per-VMA lock in _next() via memcpy(), then
> drop the lock before returning. The BPF program accesses only the
> snapshot.
>
> The verifier only trusts vm_mm and vm_file pointers (see
> BTF_TYPE_SAFE_TRUSTED_OR_NULL in verifier.c). vm_file is reference-
> counted with get_file() under the lock and released via fput() on the
> next iteration or in _destroy(). vm_mm is already correct because
> lock_vma_under_rcu() verifies vma->vm_mm == mm. All other pointers
> are left as-is by memcpy() since the verifier treats them as untrusted.
>
> Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> kernel/bpf/task_iter.c | 42 ++++++++++++++++++++++++++++++------------
> 1 file changed, 30 insertions(+), 12 deletions(-)
>
looks nice and clean
Acked-by: Andrii Nakryiko <andrii@kernel.org>
[...]
^ permalink raw reply [flat|nested] 9+ messages in thread