public inbox for bpf@vger.kernel.org
 help / color / mirror / Atom feed
From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Puranjay Mohan <puranjay@kernel.org>, bpf@vger.kernel.org
Cc: Puranjay Mohan <puranjay@kernel.org>,
	Puranjay Mohan <puranjay12@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <martin.lau@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>,
	Kumar Kartikeya Dwivedi <memxor@gmail.com>,
	kernel-team@meta.com
Subject: Re: [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks
Date: Thu, 05 Mar 2026 18:47:58 +0000	[thread overview]
Message-ID: <87jyvqxgdd.fsf@gmail.com> (raw)
In-Reply-To: <20260304142026.1443666-3-puranjay@kernel.org>

Puranjay Mohan <puranjay@kernel.org> writes:

> The open-coded task_vma BPF iterator holds mmap_lock for the entire
> duration of the iteration, increasing contention on this highly
> contended lock.
>
> Switch to per-VMA locking. In _next(), the next VMA is found via an
> RCU-protected maple tree walk, then locked with lock_vma_under_rcu()
> at its vm_start address. lock_next_vma() is not used because its
> fallback path takes mmap_read_lock(), and the iterator must work in
> non-sleepable contexts.
>
> Between the RCU walk and the lock attempt, the VMA may be removed,
> shrunk, or write-locked. When lock_vma_under_rcu() fails or the locked
> VMA was modified, the iterator advances past it and retries using vm_end
> saved from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU,
> individual objects can be freed and immediately reused within an RCU
> critical section. A VMA found by the maple tree walk may be recycled for
> a different mm before its fields are read, making the captured vm_end
> stale. When vm_end is stale and no longer ahead of the iteration
> position, the iterator falls back to PAGE_SIZE advancement to guarantee
> forward progress. VMAs inserted in gaps between iterations cannot be
> detected without mmap_lock speculation.
>
> The mm_struct is kept alive with mmget()/bpf_iter_mmput(). The
> bpf_mmap_unlock_get_irq_work() check is no longer needed since
> mmap_lock is no longer held; bpf_iter_mmput_busy() remains to guard the
> mmput irq_work slot. CONFIG_PER_VMA_LOCK is required; -EOPNOTSUPP is
> returned without it.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>  kernel/bpf/task_iter.c | 72 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 53 insertions(+), 19 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index d3fa8ba0a896..ff29d4da0267 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -9,6 +9,7 @@
>  #include <linux/bpf_mem_alloc.h>
>  #include <linux/btf_ids.h>
>  #include <linux/mm_types.h>
> +#include <linux/mmap_lock.h>
>  #include "mmap_unlock_work.h"
>  
>  static const char * const iter_task_type_names[] = {
> @@ -797,8 +798,8 @@ const struct bpf_func_proto bpf_find_vma_proto = {
>  struct bpf_iter_task_vma_kern_data {
>  	struct task_struct *task;
>  	struct mm_struct *mm;
> -	struct mmap_unlock_irq_work *work;
> -	struct vma_iterator vmi;
> +	struct vm_area_struct *locked_vma;
> +	u64 last_addr;
>  };
>  
>  struct bpf_iter_task_vma {
> @@ -868,12 +869,16 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  				      struct task_struct *task, u64 addr)
>  {
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
> -	bool irq_work_busy = false;
>  	int err;
>  
>  	BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma));
>  	BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma));
>  
> +	if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) {
> +		kit->data = NULL;
> +		return -EOPNOTSUPP;
> +	}
> +
>  	/* is_iter_reg_valid_uninit guarantees that kit hasn't been initialized
>  	 * before, so non-NULL kit->data doesn't point to previously
>  	 * bpf_mem_alloc'd bpf_iter_task_vma_kern_data
> @@ -890,13 +895,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  	}
>  
>  	/*
> -	 * Check irq_work availability for both mmap_lock release and mmput.
> -	 * Both use separate per-CPU irq_work slots, and both must be free
> -	 * to guarantee _destroy() can complete from NMI context.
> -	 * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
> +	 * Ensure the mmput irq_work slot is available so _destroy() can
> +	 * safely drop the mm reference from NMI context.
>  	 */
> -	irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> -	if (irq_work_busy || bpf_iter_mmput_busy()) {
> +	if (bpf_iter_mmput_busy()) {
>  		err = -EBUSY;
>  		goto err_cleanup_iter;
>  	}
> @@ -906,16 +908,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  		goto err_cleanup_iter;
>  	}
>  
> -	if (!mmap_read_trylock(kit->data->mm)) {
> -		err = -EBUSY;
> -		goto err_cleanup_mmget;
> -	}
> -
> -	vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> +	kit->data->locked_vma = NULL;
> +	kit->data->last_addr = addr;
>  	return 0;
>  
> -err_cleanup_mmget:
> -	bpf_iter_mmput(kit->data->mm);
>  err_cleanup_iter:
>  	put_task_struct(kit->data->task);
>  	bpf_mem_free(&bpf_global_ma, kit->data);
> @@ -927,10 +923,47 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
>  __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it)
>  {
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
> +	struct vm_area_struct *vma;
> +	struct vma_iterator vmi;
> +	unsigned long next_addr, next_end;
>  
>  	if (!kit->data) /* bpf_iter_task_vma_new failed */
>  		return NULL;
> -	return vma_next(&kit->data->vmi);
> +
> +	if (kit->data->locked_vma)
> +		vma_end_read(kit->data->locked_vma);
> +
> +retry:
> +	rcu_read_lock();
> +	vma_iter_init(&vmi, kit->data->mm, kit->data->last_addr);
> +	vma = vma_next(&vmi);
> +	if (!vma) {
> +		rcu_read_unlock();
> +		kit->data->locked_vma = NULL;
> +		return NULL;
> +	}
> +	next_addr = vma->vm_start;
> +	next_end = vma->vm_end;
> +	rcu_read_unlock();
> +
> +	vma = lock_vma_under_rcu(kit->data->mm, next_addr);
> +	if (!vma) {
> +		if (next_end > kit->data->last_addr)
> +			kit->data->last_addr = next_end;
> +		else
> +			kit->data->last_addr += PAGE_SIZE;
> +		goto retry;
> +	}
> +
> +	if (unlikely(kit->data->last_addr >= vma->vm_end)) {
> +		kit->data->last_addr = vma->vm_end;
> +		vma_end_read(vma);
> +		goto retry;
> +	}
nit: maybe we can move this next vma lookup (retry block) to the
separate function, this code looks a bit intimidating in _vma_next().
Acked-by: Mykyta Yatsenko <yatsenko@meta.com>
> +
> +	kit->data->locked_vma = vma;
> +	kit->data->last_addr = vma->vm_end;
> +	return vma;
>  }
>  
>  __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
> @@ -938,7 +971,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
>  	struct bpf_iter_task_vma_kern *kit = (void *)it;
>  
>  	if (kit->data) {
> -		bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> +		if (kit->data->locked_vma)
> +			vma_end_read(kit->data->locked_vma);
>  		bpf_iter_mmput(kit->data->mm);
>  		put_task_struct(kit->data->task);
>  		bpf_mem_free(&bpf_global_ma, kit->data);
> -- 
> 2.47.3

  reply	other threads:[~2026-03-05 18:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-03-05  8:55   ` kernel test robot
2026-03-05 11:58   ` kernel test robot
2026-03-05 16:34   ` Mykyta Yatsenko
2026-03-05 16:48     ` Puranjay Mohan
2026-03-05 17:36       ` Mykyta Yatsenko
2026-03-06  1:11   ` Alexei Starovoitov
2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-03-05 18:47   ` Mykyta Yatsenko [this message]
2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2026-03-05 18:53   ` Mykyta Yatsenko
2026-03-05 19:03     ` Puranjay Mohan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jyvqxgdd.fsf@gmail.com \
    --to=mykyta.yatsenko5@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=martin.lau@kernel.org \
    --cc=memxor@gmail.com \
    --cc=puranjay12@gmail.com \
    --cc=puranjay@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox