From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A1C33F7ABC; Wed, 20 May 2026 18:05:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779300307; cv=none; b=Fa3PN2Obf33SAXkg5NTRBeqpN7V55dss1EJHyviGrWkrQTJx7EFtJKTFE7ErD9cw/LpaovvmQCUq8Tczb9ZagL7Haja/4cUawCDie8HakGDMWeW0Xuw86w+hrkps/Hdxvp5V3kSgYmyKsgG7OsZlBuw2+pSJ4Nyj3QN1+CwxDYI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779300307; c=relaxed/simple; bh=oH+pA7O+KNKKuZqgQ0n123jkE184oRTZJFYpzJ2qqh4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hsbIKW2BBvPPP8AZxwk4RAQNwOzR8X8LeWl+5ejF7tjR/dTMu/2OLY0nb/IBmywE+HKWa7znbxPuKT80we8uCaJnR6kaNoJUqiGFwdNlrevMG4y7D7jwTEPVNvuLUSeaWOsnbY6dzSB3ntpk6vWz+KGGxdhJOGGgXZ6wXZgh87o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b=oBjKiMFH; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linuxfoundation.org header.i=@linuxfoundation.org header.b="oBjKiMFH" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CF4301F000E9; Wed, 20 May 2026 18:05:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linuxfoundation.org; s=korg; t=1779300306; bh=0iUPToYUYx2EA/Fx8OHVhFvo4uHayj1OOTTu7Fxwx3Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=oBjKiMFHaHuqnYYYwM7V91o0qomnX/dMmC6mHUjulDqXk96hpxJIr/1zgkXdK2XiU b4p79JOeBCriwCEDMlH8iW+P+/NuC+mHRdOIq84hGJQm+ub5G2aYn3fusGWtmu16+c w7XhFAJccqqVkyYZHHMY6CmpWOtx43MbB6N4V/wc= From: Greg Kroah-Hartman To: stable@vger.kernel.org Cc: Greg Kroah-Hartman , patches@lists.linux.dev, Puranjay Mohan , Alexei Starovoitov , Sasha Levin Subject: [PATCH 6.12 083/666] bpf: switch task_vma iterator from mmap_lock to per-VMA locks Date: Wed, 20 May 2026 18:14:54 +0200 Message-ID: <20260520162113.029160438@linuxfoundation.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260520162111.222830634@linuxfoundation.org> References: <20260520162111.222830634@linuxfoundation.org> User-Agent: quilt/0.69 X-stable: review X-Patchwork-Hint: ignore Precedence: bulk X-Mailing-List: stable@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit 6.12-stable review patch. If anyone has any objections, please let me know. ------------------ From: Puranjay Mohan [ Upstream commit bee9ef4a40a277bf401be43d39ba7f7f063cf39c ] The open-coded task_vma iterator holds mmap_lock for the entire duration of iteration, increasing contention on this highly contended lock. Switch to per-VMA locking. Find the next VMA via an RCU-protected maple tree walk and lock it with lock_vma_under_rcu(). lock_next_vma() is not used because its fallback takes mmap_read_lock(), and the iterator must work in non-sleepable contexts. lock_vma_under_rcu() is a point lookup (mas_walk) that finds the VMA containing a given address but cannot iterate across gaps. An RCU-protected vma_next() walk (mas_find) first locates the next VMA's vm_start to pass to lock_vma_under_rcu(). Between the RCU walk and the lock, the VMA may be removed, shrunk, or write-locked. On failure, advance past it using vm_end from the RCU walk. Because the VMA slab is SLAB_TYPESAFE_BY_RCU, vm_end may be stale; fall back to PAGE_SIZE advancement when it does not make forward progress. Concurrent VMA insertions at addresses already passed by the iterator are not detected. CONFIG_PER_VMA_LOCK is required; return -EOPNOTSUPP without it. Signed-off-by: Puranjay Mohan Link: https://lore.kernel.org/r/20260408154539.3832150-3-puranjay@kernel.org Signed-off-by: Alexei Starovoitov Stable-dep-of: 4cbee026db54 ("bpf: return VMA snapshot from task_vma iterator") Signed-off-by: Sasha Levin --- kernel/bpf/task_iter.c | 91 +++++++++++++++++++++++++++++++++--------- 1 file changed, 73 insertions(+), 18 deletions(-) diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c index c37ae44bd0a53..aee03c55602a0 100644 --- a/kernel/bpf/task_iter.c +++ b/kernel/bpf/task_iter.c @@ -10,6 +10,7 @@ #include #include #include +#include #include #include "mmap_unlock_work.h" @@ -811,8 +812,8 @@ static inline void bpf_iter_mmput_async(struct mm_struct *mm) struct bpf_iter_task_vma_kern_data { struct task_struct *task; struct mm_struct *mm; - struct mmap_unlock_irq_work *work; - struct vma_iterator vmi; + struct vm_area_struct *locked_vma; + u64 next_addr; }; struct bpf_iter_task_vma { @@ -833,21 +834,19 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, struct task_struct *task, u64 addr) { struct bpf_iter_task_vma_kern *kit = (void *)it; - bool irq_work_busy = false; int err; BUILD_BUG_ON(sizeof(struct bpf_iter_task_vma_kern) != sizeof(struct bpf_iter_task_vma)); BUILD_BUG_ON(__alignof__(struct bpf_iter_task_vma_kern) != __alignof__(struct bpf_iter_task_vma)); - /* bpf_iter_mmput_async() needs mmput_async() which requires CONFIG_MMU */ - if (!IS_ENABLED(CONFIG_MMU)) { + if (!IS_ENABLED(CONFIG_PER_VMA_LOCK)) { kit->data = NULL; return -EOPNOTSUPP; } /* * Reject irqs-disabled contexts including NMI. Operations used - * by _next() and _destroy() (mmap_read_unlock, bpf_iter_mmput_async) + * by _next() and _destroy() (vma_end_read, bpf_iter_mmput_async) * can take spinlocks with IRQs disabled (pi_lock, pool->lock). * Running from NMI or from a tracepoint that fires with those * locks held could deadlock. @@ -890,18 +889,10 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, goto err_cleanup_iter; } - /* kit->data->work == NULL is valid after bpf_mmap_unlock_get_irq_work */ - irq_work_busy = bpf_mmap_unlock_get_irq_work(&kit->data->work); - if (irq_work_busy || !mmap_read_trylock(kit->data->mm)) { - err = -EBUSY; - goto err_cleanup_mmget; - } - - vma_iter_init(&kit->data->vmi, kit->data->mm, addr); + kit->data->locked_vma = NULL; + kit->data->next_addr = addr; return 0; -err_cleanup_mmget: - bpf_iter_mmput_async(kit->data->mm); err_cleanup_iter: put_task_struct(kit->data->task); bpf_mem_free(&bpf_global_ma, kit->data); @@ -910,13 +901,76 @@ __bpf_kfunc int bpf_iter_task_vma_new(struct bpf_iter_task_vma *it, return err; } +/* + * Find and lock the next VMA at or after data->next_addr. + * + * lock_vma_under_rcu() is a point lookup (mas_walk): it finds the VMA + * containing a given address but cannot iterate. An RCU-protected + * maple tree walk with vma_next() (mas_find) is needed first to locate + * the next VMA's vm_start across any gap. + * + * Between the RCU walk and the lock, the VMA may be removed, shrunk, + * or write-locked. On failure, advance past it using vm_end from the + * RCU walk. SLAB_TYPESAFE_BY_RCU can make vm_end stale, so fall back + * to PAGE_SIZE advancement to guarantee forward progress. + */ +static struct vm_area_struct * +bpf_iter_task_vma_find_next(struct bpf_iter_task_vma_kern_data *data) +{ + struct vm_area_struct *vma; + struct vma_iterator vmi; + unsigned long start, end; + +retry: + rcu_read_lock(); + vma_iter_init(&vmi, data->mm, data->next_addr); + vma = vma_next(&vmi); + if (!vma) { + rcu_read_unlock(); + return NULL; + } + start = vma->vm_start; + end = vma->vm_end; + rcu_read_unlock(); + + vma = lock_vma_under_rcu(data->mm, start); + if (!vma) { + if (end <= data->next_addr) + data->next_addr += PAGE_SIZE; + else + data->next_addr = end; + goto retry; + } + + if (unlikely(vma->vm_end <= data->next_addr)) { + data->next_addr += PAGE_SIZE; + vma_end_read(vma); + goto retry; + } + + return vma; +} + __bpf_kfunc struct vm_area_struct *bpf_iter_task_vma_next(struct bpf_iter_task_vma *it) { struct bpf_iter_task_vma_kern *kit = (void *)it; + struct vm_area_struct *vma; if (!kit->data) /* bpf_iter_task_vma_new failed */ return NULL; - return vma_next(&kit->data->vmi); + + if (kit->data->locked_vma) { + vma_end_read(kit->data->locked_vma); + kit->data->locked_vma = NULL; + } + + vma = bpf_iter_task_vma_find_next(kit->data); + if (!vma) + return NULL; + + kit->data->locked_vma = vma; + kit->data->next_addr = vma->vm_end; + return vma; } __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) @@ -924,7 +978,8 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it) struct bpf_iter_task_vma_kern *kit = (void *)it; if (kit->data) { - bpf_mmap_unlock_mm(kit->data->work, kit->data->mm); + if (kit->data->locked_vma) + vma_end_read(kit->data->locked_vma); put_task_struct(kit->data->task); bpf_iter_mmput_async(kit->data->mm); bpf_mem_free(&bpf_global_ma, kit->data); -- 2.53.0