From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Puranjay Mohan <puranjay@kernel.org>, bpf@vger.kernel.org
Cc: Puranjay Mohan <puranjay@kernel.org>,
Puranjay Mohan <puranjay12@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
kernel-team@meta.com
Subject: Re: [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
Date: Thu, 05 Mar 2026 18:47:58 +0000 [thread overview]
Message-ID: <87jyvqxgdd.fsf@gmail.com> (raw)
In-Reply-To: <20260304142026.1443666-3-puranjay@kernel.org>
Puranjay Mohan <puranjay@kernel.org> writes:
> The open-coded task_vma BPF iterator holds mmap_lock for the entire
> duration of the iteration, increasing contention on this highly
> contended lock.
>
> Switch to per-VMA locking. In _next(), the next VMA is found via an
> RCU-protected maple tree walk, then locked with lock_vma_under_rcu()
> at its vm_start address. lock_next_vma() is not used because its
> fallback path takes mmap_read_lock(), and the iterator must work in
> non-sleepable contexts.
>
> Between the RCU walk and the lock attempt, the VMA may be removed,
> shrunk, or write-locked. When lock_vma_under_rcu() fails or the locked
> VMA was modified, the iterator advances past it and retries using vm_end
> saved from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU,
> individual objects can be freed and immediately reused within an RCU
> critical section. A VMA found by the maple tree walk may be recycled for
> a different mm before its fields are read, making the captured vm_end
> stale. When vm_end is stale and no longer ahead of the iteration
> position, the iterator falls back to PAGE_SIZE advancement to guarantee
> forward progress. VMAs inserted in gaps between iterations cannot be
> detected without mmap_lock speculation.
>
> The mm_struct is kept alive with mmget()/bpf_iter_mmput(). The
> bpf_mmap_unlock_get_irq_work() check is no longer needed since
> mmap_lock is no longer held; bpf_iter_mmput_busy() remains to guard the
> mmput irq_work slot. CONFIG_PER_VMA_LOCK is required; -EOPNOTSUPP is
> returned without it.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> kernel/bpf/task_iter.c | 72 +++++++++++++++++++++++++++++++-----------
> 1 file changed, 53 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index d3fa8ba0a896..ff29d4da0267 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -9,6 +9,7 @@
> #include <linux/bpf_mem_alloc.h>
> #include <linux/btf_ids.h>
> #include <linux/mm_types.h>
> +#include <linux/mmap_lock.h>
> #include "mmap_unlock_work.h"
>
> static const char * const iter_task_type_names[] = {
> @@ -797,8 +798,8 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> struct bpf_iter_task_vma_kern_data {
> struct task_struct *task;
> struct mm_struct *mm;
> - struct mmap_unlock_irq_work *work;
> - struct vma_iterator vmi;
> + struct vm_area_struct *locked_vma;
> + u64 last_addr;
> };
>
> struct bpf_iter_task_vma {
> @@ -868,12 +869,16 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> struct task_struct *task, u64 addr)
> {
> struct bpf_iter_task_vma_kern *kit = (void *)it;
> - bool irq_work_busy = false;
> int err;
>
> BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>
> + if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> + kit->data = NULL;
> + return -EOPNOTSUPP;
> + }
> +
> /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> * before, so non-NULL kit->data doesn't point to previously
> * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> @@ -890,13 +895,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> }
>
> /*
> - * Check irq_work availability for both mmap_lock release and mmput.
> - * Both use separate per-CPU irq_work slots, and both must be free
> - * to guarantee _destroy() can complete from NMI context.
> - * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
> + * Ensure the mmput irq_work slot is available so _destroy() can
> + * safely drop the mm reference from NMI context.
> */
> - irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> - if (irq_work_busy || bpf_iter_mmput_busy()) {
> + if (bpf_iter_mmput_busy()) {
> err = -EBUSY;
> goto err_cleanup_iter;
> }
> @@ -906,16 +908,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> goto err_cleanup_iter;
> }
>
> - if (!mmap_read_trylock(kit->data->mm)) {
> - err = -EBUSY;
> - goto err_cleanup_mmget;
> - }
> -
> - vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> + kit->data->locked_vma = NULL;
> + kit->data->last_addr = addr;
> return 0;
>
> -err_cleanup_mmget:
> - bpf_iter_mmput(kit->data->mm);
> err_cleanup_iter:
> put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> @@ -927,10 +923,47 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
> {
> struct bpf_iter_task_vma_kern *kit = (void *)it;
> + struct vm_area_struct *vma;
> + struct vma_iterator vmi;
> + unsigned long next_addr, next_end;
>
> if (!kit->data) /* bpf_iter_task_vma_new failed */
> return NULL;
> - return vma_next(&kit->data->vmi);
> +
> + if (kit->data->locked_vma)
> + vma_end_read(kit->data->locked_vma);
> +
> +retry:
> + rcu_read_lock();
> + vma_iter_init(&vmi, kit->data->mm, kit->data->last_addr);
> + vma = vma_next(&vmi);
> + if (!vma) {
> + rcu_read_unlock();
> + kit->data->locked_vma = NULL;
> + return NULL;
> + }
> + next_addr = vma->vm_start;
> + next_end = vma->vm_end;
> + rcu_read_unlock();
> +
> + vma = lock_vma_under_rcu(kit->data->mm, next_addr);
> + if (!vma) {
> + if (next_end > kit->data->last_addr)
> + kit->data->last_addr = next_end;
> + else
> + kit->data->last_addr += PAGE_SIZE;
> + goto retry;
> + }
> +
> + if (unlikely(kit->data->last_addr >= vma->vm_end)) {
> + kit->data->last_addr = vma->vm_end;
> + vma_end_read(vma);
> + goto retry;
> + }
nit: maybe we can move this next vma lookup (retry block) to the
separate function, this code looks a bit intimidating in _vma_next().
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
> +
> + kit->data->locked_vma = vma;
> + kit->data->last_addr = vma->vm_end;
> + return vma;
> }
>
> __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -938,7 +971,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> struct bpf_iter_task_vma_kern *kit = (void *)it;
>
> if (kit->data) {
> - bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> + if (kit->data->locked_vma)
> + vma_end_read(kit->data->locked_vma);
> bpf_iter_mmput(kit->data->mm);
> put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> --
> 2.47.3
next prev parent reply other threads:[~2026-03-05 18:48 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-03-05 8:55 ` kernel test robot
2026-03-05 11:58 ` kernel test robot
2026-03-05 16:34 ` Mykyta Yatsenko
2026-03-05 16:48 ` Puranjay Mohan
2026-03-05 17:36 ` Mykyta Yatsenko
2026-03-06 1:11 ` Alexei Starovoitov
2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-03-05 18:47 ` Mykyta Yatsenko [this message]
2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2026-03-05 18:53 ` Mykyta Yatsenko
2026-03-05 19:03 ` Puranjay Mohan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jyvqxgdd.fsf@gmail.com \
--to=mykyta.yatsenko5@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=kernel-team@meta.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=puranjay12@gmail.com \
--cc=puranjay@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.