From: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com>
To: Puranjay Mohan <puranjay@kernel.org>, bpf@vger.kernel.org
Cc: Puranjay Mohan <puranjay@kernel.org>,
Puranjay Mohan <puranjay12@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Martin KaFai Lau <martin.lau@kernel.org>,
Eduard Zingerman <eddyz87@gmail.com>,
Kumar Kartikeya Dwivedi <memxor@gmail.com>,
kernel-team@meta.com
Subject: Re: [PATCH bpf 1/3] bpf: fix mm lifecycle in open-coded task_vma iterator
Date: Thu, 05 Mar 2026 16:34:30 +0000 [thread overview]
Message-ID: <87pl5ixmjt.fsf@gmail.com> (raw)
In-Reply-To: <20260304142026.1443666-2-puranjay@kernel.org>
Puranjay Mohan <puranjay@kernel.org> writes:
> The open-coded task_vma BPF iterator reads task->mm and acquires
> mmap_read_trylock() but never calls mmget(). This violates refcount
> discipline: the mm can reach mm_users == 0 if the task exits while the
> iterator holds the lock.
>
> Add mmget_not_zero() before mmap_read_trylock(). On the error path
> after mmget succeeds, the mm reference must be dropped. mmput() can
> sleep (exit_mmap, etc.) so it is unsuitable from BPF context.
> mmput_async() is safe from hardirq but not from NMI, because
> schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
> which can deadlock if the NMI interrupted code holding that lock.
>
> Add a dedicated per-CPU irq_work (bpf_iter_mmput_work) and a helper
> bpf_iter_mmput() that calls mmput_async() directly when not in NMI,
> or defers to the irq_work callback when in NMI context. Use it in
> both the _new() error path and _destroy(). Add bpf_iter_mmput_busy()
> to check irq_work slot availability, and use it alongside
> bpf_mmap_unlock_get_irq_work() in _new() to verify both slots are
> free before acquiring references.
>
> Fixes: 4ac454682158 ("bpf: Introduce task_vma open-coded iterator kfuncs")
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
> kernel/bpf/task_iter.c | 77 +++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 73 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
> index 98d9b4c0daff..d3fa8ba0a896 100644
> --- a/kernel/bpf/task_iter.c
> +++ b/kernel/bpf/task_iter.c
> @@ -813,6 +813,55 @@ struct bpf_iter_task_vma_kern {
> struct bpf_iter_task_vma_kern_data *data;
> } __attribute__((aligned(8)));
>
> +/*
> + * Per-CPU irq_work for NMI-safe mmput.
> + *
> + * mmput_async() is safe from hardirq context but not from NMI, because
> + * schedule_work() -> queue_work_on() takes raw_spin_lock(&pool->lock)
> + * which can deadlock if the NMI interrupted code holding that lock.
> + *
> + * This dedicated irq_work defers mmput to hardirq context where
> + * mmput_async() is safe. BPF programs are non-preemptible, so one
> + * slot per CPU is sufficient.
> + */
> +struct bpf_iter_mmput_irq_work {
> + struct irq_work irq_work;
> + struct mm_struct *mm;
> +};
struct mmap_unlock_irq_work is exactly the same struct, perhaps an
additional patch renaming it to something like struct bpf_iter_mm_irq_work
is needed. Then we can reuse it for mmput.
> +
> +static DEFINE_PER_CPU(struct bpf_iter_mmput_irq_work, bpf_iter_mmput_work);
> +
> +static void do_bpf_iter_mmput(struct irq_work *entry)
> +{
> + struct bpf_iter_mmput_irq_work *work;
> +
> + work = container_of(entry, struct bpf_iter_mmput_irq_work, irq_work);
> + if (work->mm) {
> + mmput_async(work->mm);
> + work->mm = NULL;
> + }
> +}
> +
> +static void bpf_iter_mmput(struct mm_struct *mm)
> +{
> + if (!in_nmi()) {
> + mmput_async(mm);
> + } else {
> + struct bpf_iter_mmput_irq_work *work;
> +
> + work = this_cpu_ptr(&bpf_iter_mmput_work);
> + work->mm = mm;
> + irq_work_queue(&work->irq_work);
> + }
> +}
> +
> +static bool bpf_iter_mmput_busy(void)
> +{
> + if (!in_nmi())
> + return false;
> + return irq_work_is_busy(&this_cpu_ptr(&bpf_iter_mmput_work)->irq_work);
> +}
> +
> __bpf_kfunc_start_defs();
>
> __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> @@ -840,19 +889,35 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it,
> goto err_cleanup_iter;
> }
>
> - /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */
> + /*
> + * Check irq_work availability for both mmap_lock release and mmput.
> + * Both use separate per-CPU irq_work slots, and both must be free
> + * to guarantee _destroy() can complete from NMI context.
> + * kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work
> + */
> irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work);
> - if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) {
> + if (irq_work_busy || bpf_iter_mmput_busy()) {
> err = -EBUSY;
> goto err_cleanup_iter;
> }
>
> + if (!mmget_not_zero(kit->data->mm)) {
> + err = -ENOENT;
> + goto err_cleanup_iter;
> + }
> +
> + if (!mmap_read_trylock(kit->data->mm)) {
> + err = -EBUSY;
> + goto err_cleanup_mmget;
> + }
> +
> vma_iter_init(&kit->data->vmi, kit->data->mm, addr);
> return 0;
>
> +err_cleanup_mmget:
> + bpf_iter_mmput(kit->data->mm);
> err_cleanup_iter:
> - if (kit->data->task)
> - put_task_struct(kit->data->task);
> + put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> /* NULL kit->data signals failed bpf_iter_task_vma initialization */
> kit->data = NULL;
> @@ -874,6 +939,7 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
>
> if (kit->data) {
> bpf_mmap_unlock_mm(kit->data->work, kit->data->mm);
> + bpf_iter_mmput(kit->data->mm);
> put_task_struct(kit->data->task);
> bpf_mem_free(&bpf_global_ma, kit->data);
> }
> @@ -1044,12 +1110,15 @@ static void do_mmap_read_unlock(struct irq_work *entry)
>
> static int __init task_iter_init(void)
> {
> + struct bpf_iter_mmput_irq_work *mmput_work;
> struct mmap_unlock_irq_work *work;
> int ret, cpu;
>
> for_each_possible_cpu(cpu) {
> work = per_cpu_ptr(&mmap_unlock_work, cpu);
> init_irq_work(&work->irq_work, do_mmap_read_unlock);
> + mmput_work = per_cpu_ptr(&bpf_iter_mmput_work, cpu);
> + init_irq_work(&mmput_work->irq_work, do_bpf_iter_mmput);
> }
>
> task_reg_info.ctx_arg_info[0].btf_id = btf_tracing_ids[BTF_TRACING_TYPE_TASK];
> --
> 2.47.3
next prev parent reply other threads:[~2026-03-05 16:34 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-04 14:20 [PATCH bpf 0/3] bpf: fix and improve open-coded task_vma iterator Puranjay Mohan
2026-03-04 14:20 ` [PATCH bpf 1/3] bpf: fix mm lifecycle in " Puranjay Mohan
2026-03-05 8:55 ` kernel test robot
2026-03-05 11:58 ` kernel test robot
2026-03-05 16:34 ` Mykyta Yatsenko [this message]
2026-03-05 16:48 ` Puranjay Mohan
2026-03-05 17:36 ` Mykyta Yatsenko
2026-03-06 1:11 ` Alexei Starovoitov
2026-03-04 14:20 ` [PATCH bpf 2/3] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Puranjay Mohan
2026-03-05 18:47 ` Mykyta Yatsenko
2026-03-04 14:20 ` [PATCH bpf 3/3] bpf: return VMA snapshot from task_vma iterator Puranjay Mohan
2026-03-05 18:53 ` Mykyta Yatsenko
2026-03-05 19:03 ` Puranjay Mohan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pl5ixmjt.fsf@gmail.com \
--to=mykyta.yatsenko5@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=kernel-team@meta.com \
--cc=martin.lau@kernel.org \
--cc=memxor@gmail.com \
--cc=puranjay12@gmail.com \
--cc=puranjay@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.