From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Puranjay Mohan <puranjay@kernel.org>, bpf@vger.kernel.org
Cc: Puranjay Mohan <puranjay@kernel.org>,
Puranjay Mohan <puranjay12@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
kernel-team@meta.com
Subject: Re: [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
Date: Thu, 05 Mar 2026 18:47:58 +0000 [thread overview]
Message-ID: <87jyvqxgdd.fsf@gmail.com> (raw)
In-Reply-To: <20260304142026.1443666-3-puranjay@kernel.org>
Puranjay Mohan <puranjay@kernel.org> writes:
> The open-coded task_vma BPF iterator holds mmap_lock for the entire
> duration of the iteration, increasing contention on this highly
> contended lock.
>
> Switch to per-VMA locking. In _next(), the next VMA is found via an
> RCU-protected maple tree walk, then locked with lock_vma_under_rcu()
> at its vm_start address. lock_next_vma() is not used because its
> fallback path takes mmap_read_lock(), and the iterator must work in
> non-sleepable contexts.
>
> Between the RCU walk and the lock attempt, the VMA may be removed,
> shrunk, or write-locked. When lock_vma_under_rcu() fails or the locked
> VMA was modified, the iterator advances past it and retries using vm_end
> saved from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU,
> individual objects can be freed and immediately reused within an RCU
> critical section. A VMA found by the maple tree walk may be recycled for
> a different mm before its fields are read, making the captured vm_end
> stale. When vm_end is stale and no longer ahead of the iteration
> position, the iterator falls back to PAGE_SIZE advancement to guarantee
> forward progress. VMAs inserted in gaps between iterations cannot be
> detected without mmap_lock speculation.
>
> The mm_struct is kept alive with mmget()/bpf_iter_mmput(). The
> bpf_mmap_unlock_get_irq_work() check is no longer needed since
> mmap_lock is no longer held; bpf_iter_mmput_busy() remains to guard the
> mmput irq_work slot. CONFIG_PER_VMA_LOCK is required; -EOPNOTSUPP is
> returned without it.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> kernel/bpf/task_iter.c | 72 +++++++++++++++++++++++++++++++-----------
> 1 file changed, 53 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index d3fa8ba0a896..ff29d4da0267 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -9,6 +9,7 @@
> #include <linux/bpf_mem_alloc.h>
> #include <linux/btf_ids.h>
> #include <linux/mm_types.h>
> +#include <linux/mmap_lock.h>
> #include "mmap_unlock_work.h"
>
> static const char * const iter_task_type_names[] = {
> @@ -797,8 +798,8 @@ const struct bpf_func_proto bpf_find_vma_proto = {
> struct bpf_iter_task_vma_kern_data {
> struct task_struct *task;
> struct mm_struct *mm;
> - struct mmap_unlock_irq_work *work;
> - struct vma_iterator vmi;
> + struct vm_area_struct *locked_vma;
> + u64 last_addr;
> };
>
> struct bpf_iter_task_vma {
> @@ -868,12 +869,16 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> struct task_struct *task, u64 addr)
> {
> struct bpf_iter_task_vma_kern *kit = (void *)it;
> - bool irq_work_busy = false;
> int err;
>
> BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
> BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>
> + if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> + kit->data = NULL;
> + return -EOPNOTSUPP;
> + }
> +
> /* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
> * before, so non-NULL kit->data doesn't point to previously
> * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> @@ -890,13 +895,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> }
>
> /*
> - * Check irq_work availability for both mmap_lock release and mmput.
> - * Both use separate per-CPU irq_work slots, and both must be free
> - * to guarantee _destroy() can complete from NMI context.
> - * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
> + * Ensure the mmput irq_work slot is available so _destroy() can
> + * safely drop the mm reference from NMI context.
> */
> - irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> - if (irq_work_busy || bpf_iter_mmput_busy()) {
> + if (bpf_iter_mmput_busy()) {
> err = -EBUSY;
> goto err_cleanup_iter;
> }
> @@ -906,16 +908,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> goto err_cleanup_iter;
> }
>
> - if (!mmap_read_trylock(kit->data->mm)) {
> - err = -EBUSY;
> - goto err_cleanup_mmget;
> - }
> -
> - vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> + kit->data->locked_vma = NULL;
> + kit->data->last_addr = addr;
> return 0;
>
> -err_cleanup_mmget:
> - bpf_iter_mmput(kit->data->mm);
> err_cleanup_iter:
> put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> @@ -927,10 +923,47 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
> {
> struct bpf_iter_task_vma_kern *kit = (void *)it;
> + struct vm_area_struct *vma;
> + struct vma_iterator vmi;
> + unsigned long next_addr, next_end;
>
> if (!kit->data) /* bpf_iter_task_vma_new failed */
> return NULL;
> - return vma_next(&kit->data->vmi);
> +
> + if (kit->data->locked_vma)
> + vma_end_read(kit->data->locked_vma);
> +
> +retry:
> + rcu_read_lock();
> + vma_iter_init(&vmi, kit->data->mm, kit->data->last_addr);
> + vma = vma_next(&vmi);
> + if (!vma) {
> + rcu_read_unlock();
> + kit->data->locked_vma = NULL;
> + return NULL;
> + }
> + next_addr = vma->vm_start;
> + next_end = vma->vm_end;
> + rcu_read_unlock();
> +
> + vma = lock_vma_under_rcu(kit->data->mm, next_addr);
> + if (!vma) {
> + if (next_end > kit->data->last_addr)
> + kit->data->last_addr = next_end;
> + else
> + kit->data->last_addr += PAGE_SIZE;
> + goto retry;
> + }
> +
> + if (unlikely(kit->data->last_addr >= vma->vm_end)) {
> + kit->data->last_addr = vma->vm_end;
> + vma_end_read(vma);
> + goto retry;
> + }
nit: maybe we can move this next vma lookup (retry block) to the
separate function, this code looks a bit intimidating in _vma_next().
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
> +
> + kit->data->locked_vma = vma;
> + kit->data->last_addr = vma->vm_end;
> + return vma;
> }
>
> __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -938,7 +971,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> struct bpf_iter_task_vma_kern *kit = (void *)it;
>
> if (kit->data) {
> - bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> + if (kit->data->locked_vma)
> + vma_end_read(kit->data->locked_vma);
> bpf_iter_mmput(kit->data->mm);
> put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> --
> 2.47.3
next prev parent reply other threads:[~2026-03-05 18:48 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-03-05 8:55 ` kernel test robot
2026-03-05 11:58 ` kernel test robot
2026-03-05 16:34 ` Mykyta Yatsenko
2026-03-05 16:48 ` Puranjay Mohan
2026-03-05 17:36 ` Mykyta Yatsenko
2026-03-06 1:11 ` Alexei Starovoitov
2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-03-05 18:47 ` Mykyta Yatsenko [this message]
2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2026-03-05 18:53 ` Mykyta Yatsenko
2026-03-05 19:03 ` Puranjay Mohan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87jyvqxgdd.fsf@gmail.com \
--to=mykyta.yatsenko5@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=kernel-team@meta.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=puranjay12@gmail.com \
--cc=puranjay@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox