* [PATCH 1/6] mm: Make per-VMA locks available universally
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
@ 2026-04-29 18:19 ` Dave Hansen
2026-05-08 10:12 ` David Hildenbrand (Arm)
2026-04-29 18:19 ` [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock Dave Hansen
` (8 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:19 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, Andrew Morton, Liam R. Howlett, linux-mm,
Lorenzo Stoakes, Shakeel Butt, Suren Baghdasaryan,
Vlastimil Babka
From: Dave Hansen <dave.hansen@linux.intel.com>
The per-VMA locks have been around for several years. They've had some
bugs worked out of them and have seen quite wide use. However, they
are still only available when architectures explicitly enable them.
Remove the conditional compilation around the per-VMA locks, making
them available on all architectures and configs.
The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
when the architecture started using per-VMA locks in the fault
handler. But, contrary to the naming, the Kconfig option does not
really indicate whether the architecture supports per-VMA locks or
not. It is more of a marker for whether the architecture is likely to
benefit from per-VMA locks.
To me, the most important thing side-effect of universal availability
is letting per-VMA locks be used in SMP=n configs. This lets us use
per-VMA locking in all x86 code without fallbacks.
Overall, this just generally makes the kernel simpler. Just look at
the diffstat. It also opens the door to users that want to use the
per-VMA locks in common code. Doing *that* can bring additional
simplifications.
The downside of this is adding some fields to vm_area_struct and
mm_struct. I suspect there are some very simple ways to implement the
per-VMA locks that don't require any additional fields, especially if
such an approach was limited to SMP=n configs*. For now, do the
simplest thing: use the same implementation everywhere.
* For example, since SMP=n configs don't care much about scalability or
false sharing, there could be a single, global VMA seqcount that is
bumped when any VMA is modified instead of having space in each VMA
for a seqcount.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
---
b/arch/arm/Kconfig | 1
b/arch/arm64/Kconfig | 1
b/arch/loongarch/Kconfig | 1
b/arch/powerpc/platforms/powernv/Kconfig | 1
b/arch/powerpc/platforms/pseries/Kconfig | 1
b/arch/riscv/Kconfig | 1
b/arch/s390/Kconfig | 1
b/arch/x86/Kconfig | 2 -
b/fs/proc/internal.h | 2 -
b/fs/proc/task_mmu.c | 51 -------------------------------
b/include/linux/mm.h | 12 -------
b/include/linux/mm_types.h | 7 ----
b/include/linux/mmap_lock.h | 48 -----------------------------
b/kernel/fork.c | 2 -
b/mm/Kconfig | 13 -------
b/mm/mmap_lock.c | 2 -
16 files changed, 1 insertion(+), 145 deletions(-)
diff -puN arch/arm64/Kconfig~unconditional-vma-locks arch/arm64/Kconfig
--- a/arch/arm64/Kconfig~unconditional-vma-locks 2026-04-29 11:18:47.795519653 -0700
+++ b/arch/arm64/Kconfig 2026-04-29 11:18:49.088569421 -0700
@@ -80,7 +80,6 @@ config ARM64
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
- select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SCHED_SMT
diff -puN arch/arm/Kconfig~unconditional-vma-locks arch/arm/Kconfig
--- a/arch/arm/Kconfig~unconditional-vma-locks 2026-04-29 11:18:47.915524272 -0700
+++ b/arch/arm/Kconfig 2026-04-29 11:18:49.088569421 -0700
@@ -41,7 +41,6 @@ config ARM
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_SUPPORTS_CFI
select ARCH_SUPPORTS_HUGETLBFS if ARM_LPAE
- select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_RT
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
diff -puN arch/loongarch/Kconfig~unconditional-vma-locks arch/loongarch/Kconfig
--- a/arch/loongarch/Kconfig~unconditional-vma-locks 2026-04-29 11:18:47.956525850 -0700
+++ b/arch/loongarch/Kconfig 2026-04-29 11:18:49.088569421 -0700
@@ -68,7 +68,6 @@ config LOONGARCH
select ARCH_SUPPORTS_LTO_CLANG_THIN
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
select ARCH_SUPPORTS_NUMA_BALANCING if NUMA
- select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SCHED_SMT if SMP
select ARCH_SUPPORTS_SCHED_MC if SMP
diff -puN arch/powerpc/platforms/powernv/Kconfig~unconditional-vma-locks arch/powerpc/platforms/powernv/Kconfig
--- a/arch/powerpc/platforms/powernv/Kconfig~unconditional-vma-locks 2026-04-29 11:18:47.969526350 -0700
+++ b/arch/powerpc/platforms/powernv/Kconfig 2026-04-29 11:18:49.089569460 -0700
@@ -17,7 +17,6 @@ config PPC_POWERNV
select PPC_DOORBELL
select MMU_NOTIFIER
select FORCE_SMP
- select ARCH_SUPPORTS_PER_VMA_LOCK
select PPC_RADIX_BROADCAST_TLBIE if PPC_RADIX_MMU
default y
diff -puN arch/powerpc/platforms/pseries/Kconfig~unconditional-vma-locks arch/powerpc/platforms/pseries/Kconfig
--- a/arch/powerpc/platforms/pseries/Kconfig~unconditional-vma-locks 2026-04-29 11:18:47.972526466 -0700
+++ b/arch/powerpc/platforms/pseries/Kconfig 2026-04-29 11:18:49.089569460 -0700
@@ -23,7 +23,6 @@ config PPC_PSERIES
select HOTPLUG_CPU
select FORCE_SMP
select SWIOTLB
- select ARCH_SUPPORTS_PER_VMA_LOCK
select PPC_RADIX_BROADCAST_TLBIE if PPC_RADIX_MMU
default y
diff -puN arch/riscv/Kconfig~unconditional-vma-locks arch/riscv/Kconfig
--- a/arch/riscv/Kconfig~unconditional-vma-locks 2026-04-29 11:18:48.060529854 -0700
+++ b/arch/riscv/Kconfig 2026-04-29 11:18:49.089569460 -0700
@@ -70,7 +70,6 @@ config RISCV
select ARCH_SUPPORTS_LTO_CLANG_THIN
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS if 64BIT && MMU
select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU
- select ARCH_SUPPORTS_PER_VMA_LOCK if MMU
select ARCH_SUPPORTS_RT
select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_SCHED_MC if SMP
diff -puN arch/s390/Kconfig~unconditional-vma-locks arch/s390/Kconfig
--- a/arch/s390/Kconfig~unconditional-vma-locks 2026-04-29 11:18:48.125532357 -0700
+++ b/arch/s390/Kconfig 2026-04-29 11:18:49.089569460 -0700
@@ -153,7 +153,6 @@ config S390
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
select ARCH_SUPPORTS_NUMA_BALANCING
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
- select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_USE_SYM_ANNOTATIONS
diff -puN arch/x86/Kconfig~unconditional-vma-locks arch/x86/Kconfig
--- a/arch/x86/Kconfig~unconditional-vma-locks 2026-04-29 11:18:48.128532472 -0700
+++ b/arch/x86/Kconfig 2026-04-29 11:18:49.090569499 -0700
@@ -27,7 +27,6 @@ config X86_64
select ARCH_HAS_GIGANTIC_PAGE
select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS
select ARCH_SUPPORTS_INT128 if CC_HAS_INT128
- select ARCH_SUPPORTS_PER_VMA_LOCK
select ARCH_SUPPORTS_HUGE_PFNMAP if TRANSPARENT_HUGEPAGE
select HAVE_ARCH_SOFT_DIRTY
select MODULES_USE_ELF_RELA
@@ -1885,7 +1884,6 @@ config X86_USER_SHADOW_STACK
bool "X86 userspace shadow stack"
depends on AS_WRUSS
depends on X86_64
- depends on PER_VMA_LOCK
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_USER_SHADOW_STACK
select X86_CET
diff -puN fs/proc/internal.h~unconditional-vma-locks fs/proc/internal.h
--- a/fs/proc/internal.h~unconditional-vma-locks 2026-04-29 11:18:48.305539283 -0700
+++ b/fs/proc/internal.h 2026-04-29 11:18:49.090569499 -0700
@@ -382,10 +382,8 @@ struct mem_size_stats;
struct proc_maps_locking_ctx {
struct mm_struct *mm;
-#ifdef CONFIG_PER_VMA_LOCK
bool mmap_locked;
struct vm_area_struct *locked_vma;
-#endif
};
struct proc_maps_private {
diff -puN fs/proc/task_mmu.c~unconditional-vma-locks fs/proc/task_mmu.c
--- a/fs/proc/task_mmu.c~unconditional-vma-locks 2026-04-29 11:18:48.346540861 -0700
+++ b/fs/proc/task_mmu.c 2026-04-29 11:18:49.090569499 -0700
@@ -130,8 +130,6 @@ static void release_task_mempolicy(struc
}
#endif
-#ifdef CONFIG_PER_VMA_LOCK
-
static void reset_lock_ctx(struct proc_maps_locking_ctx *lock_ctx)
{
lock_ctx->locked_vma = NULL;
@@ -213,33 +211,6 @@ static inline bool fallback_to_mmap_lock
return true;
}
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline bool lock_vma_range(struct seq_file *m,
- struct proc_maps_locking_ctx *lock_ctx)
-{
- return mmap_read_lock_killable(lock_ctx->mm) == 0;
-}
-
-static inline void unlock_vma_range(struct proc_maps_locking_ctx *lock_ctx)
-{
- mmap_read_unlock(lock_ctx->mm);
-}
-
-static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv,
- loff_t last_pos)
-{
- return vma_next(&priv->iter);
-}
-
-static inline bool fallback_to_mmap_lock(struct proc_maps_private *priv,
- loff_t pos)
-{
- return false;
-}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos)
{
struct proc_maps_private *priv = m->private;
@@ -527,8 +498,6 @@ static int pid_maps_open(struct inode *i
PROCMAP_QUERY_VMA_FLAGS \
)
-#ifdef CONFIG_PER_VMA_LOCK
-
static int query_vma_setup(struct proc_maps_locking_ctx *lock_ctx)
{
reset_lock_ctx(lock_ctx);
@@ -581,26 +550,6 @@ static struct vm_area_struct *query_vma_
return vma;
}
-#else /* CONFIG_PER_VMA_LOCK */
-
-static int query_vma_setup(struct proc_maps_locking_ctx *lock_ctx)
-{
- return mmap_read_lock_killable(lock_ctx->mm);
-}
-
-static void query_vma_teardown(struct proc_maps_locking_ctx *lock_ctx)
-{
- mmap_read_unlock(lock_ctx->mm);
-}
-
-static struct vm_area_struct *query_vma_find_by_addr(struct proc_maps_locking_ctx *lock_ctx,
- unsigned long addr)
-{
- return find_vma(lock_ctx->mm, addr);
-}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
static struct vm_area_struct *query_matching_vma(struct proc_maps_locking_ctx *lock_ctx,
unsigned long addr, u32 flags)
{
diff -puN include/linux/mmap_lock.h~unconditional-vma-locks include/linux/mmap_lock.h
--- a/include/linux/mmap_lock.h~unconditional-vma-locks 2026-04-29 11:18:48.700554487 -0700
+++ b/include/linux/mmap_lock.h 2026-04-29 11:18:49.091569537 -0700
@@ -76,8 +76,6 @@ static inline void mmap_assert_write_loc
rwsem_assert_held_write(&mm->mmap_lock);
}
-#ifdef CONFIG_PER_VMA_LOCK
-
#ifdef CONFIG_LOCKDEP
#define __vma_lockdep_map(vma) (&vma->vmlock_dep_map)
#else
@@ -484,52 +482,6 @@ struct vm_area_struct *lock_next_vma(str
struct vma_iterator *iter,
unsigned long address);
-#else /* CONFIG_PER_VMA_LOCK */
-
-static inline void mm_lock_seqcount_init(struct mm_struct *mm) {}
-static inline void mm_lock_seqcount_begin(struct mm_struct *mm) {}
-static inline void mm_lock_seqcount_end(struct mm_struct *mm) {}
-
-static inline bool mmap_lock_speculate_try_begin(struct mm_struct *mm, unsigned int *seq)
-{
- return false;
-}
-
-static inline bool mmap_lock_speculate_retry(struct mm_struct *mm, unsigned int seq)
-{
- return true;
-}
-static inline void vma_lock_init(struct vm_area_struct *vma, bool reset_refcnt) {}
-static inline void vma_end_read(struct vm_area_struct *vma) {}
-static inline void vma_start_write(struct vm_area_struct *vma) {}
-static inline __must_check
-int vma_start_write_killable(struct vm_area_struct *vma) { return 0; }
-static inline void vma_assert_write_locked(struct vm_area_struct *vma)
- { mmap_assert_write_locked(vma->vm_mm); }
-static inline void vma_assert_attached(struct vm_area_struct *vma) {}
-static inline void vma_assert_detached(struct vm_area_struct *vma) {}
-static inline void vma_mark_attached(struct vm_area_struct *vma) {}
-static inline void vma_mark_detached(struct vm_area_struct *vma) {}
-
-static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
- unsigned long address)
-{
- return NULL;
-}
-
-static inline void vma_assert_locked(struct vm_area_struct *vma)
-{
- mmap_assert_locked(vma->vm_mm);
-}
-
-static inline void vma_assert_stabilised(struct vm_area_struct *vma)
-{
- /* If no VMA locks, then either mmap lock suffices to stabilise. */
- mmap_assert_locked(vma->vm_mm);
-}
-
-#endif /* CONFIG_PER_VMA_LOCK */
-
static inline void mmap_write_lock(struct mm_struct *mm)
{
__mmap_lock_trace_start_locking(mm, true);
diff -puN include/linux/mm.h~unconditional-vma-locks include/linux/mm.h
--- a/include/linux/mm.h~unconditional-vma-locks 2026-04-29 11:18:48.714555026 -0700
+++ b/include/linux/mm.h 2026-04-29 11:18:49.091569537 -0700
@@ -890,7 +890,6 @@ static inline void vma_numab_state_free(
* These must be here rather than mmap_lock.h as dependent on vm_fault type,
* declared in this header.
*/
-#ifdef CONFIG_PER_VMA_LOCK
static inline void release_fault_lock(struct vm_fault *vmf)
{
if (vmf->flags & FAULT_FLAG_VMA_LOCK)
@@ -906,17 +905,6 @@ static inline void assert_fault_locked(c
else
mmap_assert_locked(vmf->vma->vm_mm);
}
-#else
-static inline void release_fault_lock(struct vm_fault *vmf)
-{
- mmap_read_unlock(vmf->vma->vm_mm);
-}
-
-static inline void assert_fault_locked(const struct vm_fault *vmf)
-{
- mmap_assert_locked(vmf->vma->vm_mm);
-}
-#endif /* CONFIG_PER_VMA_LOCK */
static inline bool mm_flags_test(int flag, const struct mm_struct *mm)
{
diff -puN include/linux/mm_types.h~unconditional-vma-locks include/linux/mm_types.h
--- a/include/linux/mm_types.h~unconditional-vma-locks 2026-04-29 11:18:48.761556836 -0700
+++ b/include/linux/mm_types.h 2026-04-29 11:18:49.092569576 -0700
@@ -959,7 +959,6 @@ struct vm_area_struct {
vma_flags_t flags;
};
-#ifdef CONFIG_PER_VMA_LOCK
/*
* Can only be written (using WRITE_ONCE()) while holding both:
* - mmap_lock (in write mode)
@@ -975,7 +974,7 @@ struct vm_area_struct {
* slowpath.
*/
unsigned int vm_lock_seq;
-#endif
+
/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages. A MAP_SHARED vma
@@ -1007,7 +1006,6 @@ struct vm_area_struct {
#ifdef CONFIG_NUMA_BALANCING
struct vma_numab_state *numab_state; /* NUMA Balancing state */
#endif
-#ifdef CONFIG_PER_VMA_LOCK
/*
* Used to keep track of firstly, whether the VMA is attached, secondly,
* if attached, how many read locks are taken, and thirdly, if the
@@ -1050,7 +1048,6 @@ struct vm_area_struct {
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map vmlock_dep_map;
#endif
-#endif
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree.
@@ -1249,7 +1246,6 @@ struct mm_struct {
* init_mm.mmlist, and are protected
* by mmlist_lock
*/
-#ifdef CONFIG_PER_VMA_LOCK
struct rcuwait vma_writer_wait;
/*
* This field has lock-like semantics, meaning it is sometimes
@@ -1269,7 +1265,6 @@ struct mm_struct {
* mmap_lock.
*/
seqcount_t mm_lock_seq;
-#endif
#ifdef CONFIG_FUTEX_PRIVATE_HASH
struct mutex futex_hash_lock;
struct futex_private_hash __rcu *futex_phash;
diff -puN kernel/fork.c~unconditional-vma-locks kernel/fork.c
--- a/kernel/fork.c~unconditional-vma-locks 2026-04-29 11:18:48.774557336 -0700
+++ b/kernel/fork.c 2026-04-29 11:18:49.092569576 -0700
@@ -1067,9 +1067,7 @@ static void mmap_init_lock(struct mm_str
{
init_rwsem(&mm->mmap_lock);
mm_lock_seqcount_init(mm);
-#ifdef CONFIG_PER_VMA_LOCK
rcuwait_init(&mm->vma_writer_wait);
-#endif
}
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
diff -puN mm/Kconfig~unconditional-vma-locks mm/Kconfig
--- a/mm/Kconfig~unconditional-vma-locks 2026-04-29 11:18:48.838559801 -0700
+++ b/mm/Kconfig 2026-04-29 11:18:49.093569614 -0700
@@ -1394,19 +1394,6 @@ config LRU_GEN_STATS
config LRU_GEN_WALKS_MMU
def_bool y
depends on LRU_GEN && ARCH_HAS_HW_PTE_YOUNG
-# }
-
-config ARCH_SUPPORTS_PER_VMA_LOCK
- def_bool n
-
-config PER_VMA_LOCK
- def_bool y
- depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP
- help
- Allow per-vma locking during page fault handling.
-
- This feature allows locking each virtual memory area separately when
- handling page faults instead of taking mmap_lock.
config LOCK_MM_AND_FIND_VMA
bool
diff -puN mm/mmap_lock.c~unconditional-vma-locks mm/mmap_lock.c
--- a/mm/mmap_lock.c~unconditional-vma-locks 2026-04-29 11:18:49.084569267 -0700
+++ b/mm/mmap_lock.c 2026-04-29 11:18:49.093569614 -0700
@@ -44,7 +44,6 @@ EXPORT_SYMBOL(__mmap_lock_do_trace_relea
#endif /* CONFIG_TRACING */
#ifdef CONFIG_MMU
-#ifdef CONFIG_PER_VMA_LOCK
/* State shared across __vma_[start, end]_exclude_readers. */
struct vma_exclude_readers_state {
@@ -431,7 +430,6 @@ fallback:
return vma;
}
-#endif /* CONFIG_PER_VMA_LOCK */
#ifdef CONFIG_LOCK_MM_AND_FIND_VMA
#include <linux/extable.h>
_
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 1/6] mm: Make per-VMA locks available universally
2026-04-29 18:19 ` [PATCH 1/6] mm: Make per-VMA locks available universally Dave Hansen
@ 2026-05-08 10:12 ` David Hildenbrand (Arm)
2026-05-08 10:58 ` David Hildenbrand (Arm)
0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 10:12 UTC (permalink / raw)
To: Dave Hansen, linux-kernel
Cc: Andrew Morton, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On 4/29/26 20:19, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The per-VMA locks have been around for several years. They've had some
> bugs worked out of them and have seen quite wide use. However, they
> are still only available when architectures explicitly enable them.
> Remove the conditional compilation around the per-VMA locks, making
> them available on all architectures and configs.
Yes, we should really just make it now just a fixed part of the kernel design.
>
> The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
> when the architecture started using per-VMA locks in the fault
> handler. But, contrary to the naming, the Kconfig option does not
> really indicate whether the architecture supports per-VMA locks or
> not. It is more of a marker for whether the architecture is likely to
> benefit from per-VMA locks.
>
> To me, the most important thing side-effect of universal availability
> is letting per-VMA locks be used in SMP=n configs. This lets us use
> per-VMA locking in all x86 code without fallbacks.
>
> Overall, this just generally makes the kernel simpler. Just look at
> the diffstat. It also opens the door to users that want to use the
> per-VMA locks in common code. Doing *that* can bring additional
> simplifications.
>
> The downside of this is adding some fields to vm_area_struct and
> mm_struct.
I'd assume most distributions would already enable it.
mm_struct is very likely not a problem.
On x86-64, the smallest size for vma_area_struct possible (make allnoconfig)
seems to be 68bytes. The largest size (make allyesconfig) with lockdep and all
that is 256bytes. Without lockdep we are at 192 bytes: independent of per-VMA locks.
I'd expect that on most 64bit configs we usually end up with 192 bytes today.
Given that our slab sizes are ...32/64/96/128/192/..., I guess we'd have to be
lucky to jump between sizes on most configs.
Maybe on 32bit? Not sure if anybody would really notice.
> I suspect there are some very simple ways to implement the
> per-VMA locks that don't require any additional fields, especially if
> such an approach was limited to SMP=n configs*. For now, do the
> simplest thing: use the same implementation everywhere.
>
> * For example, since SMP=n configs don't care much about scalability or
> false sharing, there could be a single, global VMA seqcount that is
> bumped when any VMA is modified instead of having space in each VMA
> for a seqcount.
I'd do that only if we actually determine this to be a problem.
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] mm: Make per-VMA locks available universally
2026-05-08 10:12 ` David Hildenbrand (Arm)
@ 2026-05-08 10:58 ` David Hildenbrand (Arm)
2026-05-08 16:55 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 10:58 UTC (permalink / raw)
To: Dave Hansen, linux-kernel
Cc: Andrew Morton, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On 5/8/26 12:12, David Hildenbrand (Arm) wrote:
> On 4/29/26 20:19, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> The per-VMA locks have been around for several years. They've had some
>> bugs worked out of them and have seen quite wide use. However, they
>> are still only available when architectures explicitly enable them.
>> Remove the conditional compilation around the per-VMA locks, making
>> them available on all architectures and configs.
>
> Yes, we should really just make it now just a fixed part of the kernel design.
>
>>
>> The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
>> when the architecture started using per-VMA locks in the fault
>> handler. But, contrary to the naming, the Kconfig option does not
>> really indicate whether the architecture supports per-VMA locks or
>> not. It is more of a marker for whether the architecture is likely to
>> benefit from per-VMA locks.
>>
>> To me, the most important thing side-effect of universal availability
>> is letting per-VMA locks be used in SMP=n configs. This lets us use
>> per-VMA locking in all x86 code without fallbacks.
>>
>> Overall, this just generally makes the kernel simpler. Just look at
>> the diffstat. It also opens the door to users that want to use the
>> per-VMA locks in common code. Doing *that* can bring additional
>> simplifications.
>>
>> The downside of this is adding some fields to vm_area_struct and
>> mm_struct.
>
> I'd assume most distributions would already enable it.
>
> mm_struct is very likely not a problem.
>
> On x86-64, the smallest size for vma_area_struct possible (make allnoconfig)
> seems to be 68bytes. The largest size (make allyesconfig) with lockdep and all
> that is 256bytes. Without lockdep we are at 192 bytes: independent of per-VMA locks.
>
> I'd expect that on most 64bit configs we usually end up with 192 bytes today.
>
> Given that our slab sizes are ...32/64/96/128/192/..., I guess we'd have to be
> lucky to jump between sizes on most configs.
As Vlastimil reminded me, the have separate slab caches, so they are better
packed. So I don't think a small increase there would really be a problem.
--
Cheers,
David
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 1/6] mm: Make per-VMA locks available universally
2026-05-08 10:58 ` David Hildenbrand (Arm)
@ 2026-05-08 16:55 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:55 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: Dave Hansen, linux-kernel, Andrew Morton, Liam R. Howlett,
linux-mm, Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Fri, May 08, 2026 at 12:58:35PM +0200, David Hildenbrand (Arm) wrote:
> On 5/8/26 12:12, David Hildenbrand (Arm) wrote:
> > On 4/29/26 20:19, Dave Hansen wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >>
> >> The per-VMA locks have been around for several years. They've had some
> >> bugs worked out of them and have seen quite wide use. However, they
> >> are still only available when architectures explicitly enable them.
> >> Remove the conditional compilation around the per-VMA locks, making
> >> them available on all architectures and configs.
> >
> > Yes, we should really just make it now just a fixed part of the kernel design.
Agreed
> >
> >>
> >> The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
> >> when the architecture started using per-VMA locks in the fault
> >> handler. But, contrary to the naming, the Kconfig option does not
> >> really indicate whether the architecture supports per-VMA locks or
> >> not. It is more of a marker for whether the architecture is likely to
> >> benefit from per-VMA locks.
> >>
> >> To me, the most important thing side-effect of universal availability
> >> is letting per-VMA locks be used in SMP=n configs. This lets us use
> >> per-VMA locking in all x86 code without fallbacks.
> >>
> >> Overall, this just generally makes the kernel simpler. Just look at
> >> the diffstat. It also opens the door to users that want to use the
> >> per-VMA locks in common code. Doing *that* can bring additional
> >> simplifications.
> >>
> >> The downside of this is adding some fields to vm_area_struct and
> >> mm_struct.
> >
> > I'd assume most distributions would already enable it.
Yes, and I think any modern system will treat it as a necessity!
> >
> > mm_struct is very likely not a problem.
No not at all that's a lost cause :)) But also less improtant as less
propagated, obviously.
> >
> > On x86-64, the smallest size for vma_area_struct possible (make allnoconfig)
> > seems to be 68bytes. The largest size (make allyesconfig) with lockdep and all
> > that is 256bytes. Without lockdep we are at 192 bytes: independent of per-VMA locks.
> >
> > I'd expect that on most 64bit configs we usually end up with 192 bytes today.
> >
> > Given that our slab sizes are ...32/64/96/128/192/..., I guess we'd have to be
> > lucky to jump between sizes on most configs.
>
> As Vlastimil reminded me, the have separate slab caches, so they are better
> packed. So I don't think a small increase there would really be a problem.
Good to have this information also!
>
> --
> Cheers,
>
> David
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
2026-04-29 18:19 ` [PATCH 1/6] mm: Make per-VMA locks available universally Dave Hansen
@ 2026-04-29 18:19 ` Dave Hansen
2026-05-08 17:06 ` Lorenzo Stoakes
2026-04-29 18:19 ` [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Dave Hansen
` (7 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:19 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, Andrew Morton, Liam R. Howlett, linux-mm,
Lorenzo Stoakes, Shakeel Butt, Suren Baghdasaryan,
Vlastimil Babka
From: Dave Hansen <dave.hansen@linux.intel.com>
tl;dr: lock_vma_under_rcu() is already a trylock. No need to do both
it and mmap_read_trylock().
Long Version:
== Background ==
Historically, binder used an mmap_read_trylock() in its shrinker code.
This ensures that reclaim is not blocked on an mmap_lock. Commit
95bc2d4a9020 ("binder: use per-vma lock in page reclaiming") added
support for the per-VMA lock, but but left mmap_read_trylock() as a
fallback.
This was presumably because the per-VMA locking can fail for several
reasons and most (all?) lock_vma_under_rcu() callers have a fallback
to mmap_read_trylock().
== Problem ==
The fallback is not worth the complexity here. lock_vma_under_rcu() is
essentially already a non-blocking trylock. The main reason it fails
is also the reason mmap_read_trylock() fails: something is holding
mmap_write_lock().
The only remedy for a collision with mmap_write_lock() is to wait,
which this code can not do. So the "fallback" after
lock_vma_under_rcu() failure is not really a fallback: it is really
likely to just be retrying in vain. That retry in an of itself isn't
horrible. But it adds complexity.
== Solution ==
Now that per-VMA locks are universally available, lock_vma_under_rcu()
will not persistently fail. Rely on it alone and simplify the code.
Full disclosure: I originally tried to do this with
lock_vma_under_rcu_wait(), but it did not fit well with the mmap_lock
trylock semantics. Claude caught this in a review and suggested the
approach in this path. It seemed sane to me. So, Suggesed-by: Claude,
I guess.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
---
b/drivers/android/binder_alloc.c | 22 +++++-----------------
1 file changed, 5 insertions(+), 17 deletions(-)
diff -puN drivers/android/binder_alloc.c~binder-try-vma-lock drivers/android/binder_alloc.c
--- a/drivers/android/binder_alloc.c~binder-try-vma-lock 2026-04-29 11:18:50.066607065 -0700
+++ b/drivers/android/binder_alloc.c 2026-04-29 11:18:50.069607180 -0700
@@ -1142,7 +1142,6 @@ enum lru_status binder_alloc_free_page(s
struct vm_area_struct *vma;
struct page *page_to_free;
unsigned long page_addr;
- int mm_locked = 0;
size_t index;
if (!mmget_not_zero(mm))
@@ -1151,15 +1150,10 @@ enum lru_status binder_alloc_free_page(s
index = mdata->page_index;
page_addr = alloc->vm_start + index * PAGE_SIZE;
- /* attempt per-vma lock first */
+ /* attempt per-vma lock */
vma = lock_vma_under_rcu(mm, page_addr);
- if (!vma) {
- /* fall back to mmap_lock */
- if (!mmap_read_trylock(mm))
- goto err_mmap_read_lock_failed;
- mm_locked = 1;
- vma = vma_lookup(mm, page_addr);
- }
+ if (!vma)
+ goto err_mmap_read_lock_failed;
if (!mutex_trylock(&alloc->mutex))
goto err_get_alloc_mutex_failed;
@@ -1191,10 +1185,7 @@ enum lru_status binder_alloc_free_page(s
}
mutex_unlock(&alloc->mutex);
- if (mm_locked)
- mmap_read_unlock(mm);
- else
- vma_end_read(vma);
+ vma_end_read(vma);
mmput_async(mm);
binder_free_page(page_to_free);
@@ -1203,10 +1194,7 @@ enum lru_status binder_alloc_free_page(s
err_invalid_vma:
mutex_unlock(&alloc->mutex);
err_get_alloc_mutex_failed:
- if (mm_locked)
- mmap_read_unlock(mm);
- else
- vma_end_read(vma);
+ vma_end_read(vma);
err_mmap_read_lock_failed:
mmput_async(mm);
err_mmget:
_
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock
2026-04-29 18:19 ` [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock Dave Hansen
@ 2026-05-08 17:06 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:06 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Wed, Apr 29, 2026 at 11:19:57AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> tl;dr: lock_vma_under_rcu() is already a trylock. No need to do both
> it and mmap_read_trylock().
>
> Long Version:
>
> == Background ==
>
> Historically, binder used an mmap_read_trylock() in its shrinker code.
> This ensures that reclaim is not blocked on an mmap_lock. Commit
> 95bc2d4a9020 ("binder: use per-vma lock in page reclaiming") added
> support for the per-VMA lock, but but left mmap_read_trylock() as a
> fallback.
>
> This was presumably because the per-VMA locking can fail for several
> reasons and most (all?) lock_vma_under_rcu() callers have a fallback
> to mmap_read_trylock().
>
> == Problem ==
>
> The fallback is not worth the complexity here. lock_vma_under_rcu() is
> essentially already a non-blocking trylock. The main reason it fails
> is also the reason mmap_read_trylock() fails: something is holding
> mmap_write_lock().
>
> The only remedy for a collision with mmap_write_lock() is to wait,
> which this code can not do. So the "fallback" after
> lock_vma_under_rcu() failure is not really a fallback: it is really
> likely to just be retrying in vain. That retry in an of itself isn't
> horrible. But it adds complexity.
>
> == Solution ==
>
> Now that per-VMA locks are universally available, lock_vma_under_rcu()
> will not persistently fail. Rely on it alone and simplify the code.
>
> Full disclosure: I originally tried to do this with
> lock_vma_under_rcu_wait(), but it did not fit well with the mmap_lock
> trylock semantics. Claude caught this in a review and suggested the
> approach in this path. It seemed sane to me. So, Suggesed-by: Claude,
> I guess.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
I mean this seems reasonable to me, I don't really understand why we'd fall back
to mmap... try lock :)
If semantically you're trylocking then as you say, lock_vma_under_rcu() already
does that.
Honestly I feel this could be submitted separate from the series.
I'm not a binder guy, but this looks right to me so:
Acked-by: Lorenzo Stoakes <ljs@kernel.org> with nit below addressed.
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
> b/drivers/android/binder_alloc.c | 22 +++++-----------------
> 1 file changed, 5 insertions(+), 17 deletions(-)
>
> diff -puN drivers/android/binder_alloc.c~binder-try-vma-lock drivers/android/binder_alloc.c
> --- a/drivers/android/binder_alloc.c~binder-try-vma-lock 2026-04-29 11:18:50.066607065 -0700
> +++ b/drivers/android/binder_alloc.c 2026-04-29 11:18:50.069607180 -0700
> @@ -1142,7 +1142,6 @@ enum lru_status binder_alloc_free_page(s
> struct vm_area_struct *vma;
> struct page *page_to_free;
> unsigned long page_addr;
> - int mm_locked = 0;
Man why are we using int instead of a bool in 2026 in the first place :P
> size_t index;
>
> if (!mmget_not_zero(mm))
> @@ -1151,15 +1150,10 @@ enum lru_status binder_alloc_free_page(s
> index = mdata->page_index;
> page_addr = alloc->vm_start + index * PAGE_SIZE;
>
> - /* attempt per-vma lock first */
> + /* attempt per-vma lock */
> vma = lock_vma_under_rcu(mm, page_addr);
> - if (!vma) {
> - /* fall back to mmap_lock */
> - if (!mmap_read_trylock(mm))
> - goto err_mmap_read_lock_failed;
> - mm_locked = 1;
> - vma = vma_lookup(mm, page_addr);
> - }
> + if (!vma)
> + goto err_mmap_read_lock_failed;
Nit, but we probably want to rename that to err_vma_lock_failed or something!
>
> if (!mutex_trylock(&alloc->mutex))
> goto err_get_alloc_mutex_failed;
> @@ -1191,10 +1185,7 @@ enum lru_status binder_alloc_free_page(s
> }
>
> mutex_unlock(&alloc->mutex);
> - if (mm_locked)
> - mmap_read_unlock(mm);
> - else
> - vma_end_read(vma);
> + vma_end_read(vma);
> mmput_async(mm);
> binder_free_page(page_to_free);
>
> @@ -1203,10 +1194,7 @@ enum lru_status binder_alloc_free_page(s
> err_invalid_vma:
> mutex_unlock(&alloc->mutex);
> err_get_alloc_mutex_failed:
> - if (mm_locked)
> - mmap_read_unlock(mm);
> - else
> - vma_end_read(vma);
> + vma_end_read(vma);
> err_mmap_read_lock_failed:
> mmput_async(mm);
> err_mmget:
> _
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
2026-04-29 18:19 ` [PATCH 1/6] mm: Make per-VMA locks available universally Dave Hansen
2026-04-29 18:19 ` [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock Dave Hansen
@ 2026-04-29 18:19 ` Dave Hansen
2026-05-08 17:26 ` Lorenzo Stoakes
2026-04-29 18:20 ` [PATCH 4/6] binder: Remove mmap_lock fallback Dave Hansen
` (6 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:19 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, Andrew Morton, Liam R. Howlett, linux-mm,
Lorenzo Stoakes, Shakeel Butt, Suren Baghdasaryan,
Vlastimil Babka
From: Dave Hansen <dave.hansen@linux.intel.com>
== Background ==
There are basically two parallel ways to look up a VMA: the
traditional way, which is protected by mmap_lock, and the RCU-based
per-VMA lock way which is based on RCU and refcounts.
== Problems ==
The mmap_lock one is more straightforward to use but it has a big
disadvantage in that it can not be mixed with page faults since those
can take mmap_lock for read.
The RCU one can be mixed with faults, but it is not available in all
configs, so all RCU users need to be able to fall back to the
traditional way.
== Solution ==
Add a variant of the RCU-based lookup that waits for writers. This is
basically the same as the existing RCU-based lookup, but it also takes
mmap_lock for read and waits for writers to finish before returning
the VMA. This has two big advantages:
1. Callers do not need to have a fallback path for when they
collide with writers.
2. It can be used in contexts where page faults can happen because
it can take the mmap_lock for read but never *holds* it.
== Discussion ==
I am not married to the naming here at all. Naming suggestions would
be much appreciated.
This basically uses mmap_lock to wait for writers, nothing else. The
VMA is obviously stable under mmap_read_lock() and the code _can_
likely take advantage of that and possibly even remove the goto. For
instance, it could (probably) bump the VMA refcount and exclude future
writers. That would eliminate the goto.
But the approach as-is is probably the smallest line count and
arguably the simplest approach. It is a good place to start a
conversation if nothing else.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
---
b/include/linux/mmap_lock.h | 2 ++
b/mm/mmap_lock.c | 43 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 45 insertions(+)
diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
--- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait 2026-04-29 11:18:50.633628887 -0700
+++ b/include/linux/mmap_lock.h 2026-04-29 11:18:50.707631737 -0700
@@ -470,6 +470,8 @@ static inline void vma_mark_detached(str
struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
unsigned long address);
+struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
+ unsigned long address);
/*
* Locks next vma pointed by the iterator. Confirms the locked vma has not
diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
--- a/mm/mmap_lock.c~lock-vma-under-rcu-wait 2026-04-29 11:18:50.704631622 -0700
+++ b/mm/mmap_lock.c 2026-04-29 11:18:50.707631737 -0700
@@ -340,6 +340,49 @@ inval:
return NULL;
}
+/*
+ * Find the VMA covering 'address' and lock it for reading. Waits for writers to
+ * finish if the VMA is being modified. Returns NULL if there is no VMA covering
+ * 'address'.
+ *
+ * The fast path does not take mmap lock.
+ */
+struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
+ unsigned long address)
+{
+ struct vm_area_struct *vma;
+
+retry:
+ vma = lock_vma_under_rcu(mm, address);
+ /* Fast path: return stable VMA covering 'address': */
+ if (vma)
+ return vma;
+
+ /*
+ * Slow path: the VMA covering 'address' is being modified.
+ * or there is no VMA covering 'address'. Rule out the
+ * possibility that the VMA is being modified:
+ */
+ mmap_read_lock(mm);
+ vma = vma_lookup(mm, address);
+ mmap_read_unlock(mm);
+
+ /* There was for sure no VMA covering 'address': */
+ if (!vma)
+ return NULL;
+
+ /*
+ * VMA was likely being modified during RCU lookup. Try again.
+ * mmap_read_lock() waited for the writer to complete and the
+ * writer is now done.
+ *
+ * There is no guarantee that any single retry will succeed,
+ * and it is possible but highly unlikely this will loop
+ * forever.
+ */
+ goto retry;
+}
+
static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
struct vma_iterator *vmi,
unsigned long from_addr)
_
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
2026-04-29 18:19 ` [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Dave Hansen
@ 2026-05-08 17:26 ` Lorenzo Stoakes
2026-05-08 20:15 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:26 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> == Background ==
>
> There are basically two parallel ways to look up a VMA: the
> traditional way, which is protected by mmap_lock, and the RCU-based
> per-VMA lock way which is based on RCU and refcounts.
>
> == Problems ==
>
> The mmap_lock one is more straightforward to use but it has a big
> disadvantage in that it can not be mixed with page faults since those
> can take mmap_lock for read.
>
> The RCU one can be mixed with faults, but it is not available in all
> configs, so all RCU users need to be able to fall back to the
> traditional way.
>
> == Solution ==
>
> Add a variant of the RCU-based lookup that waits for writers. This is
> basically the same as the existing RCU-based lookup, but it also takes
> mmap_lock for read and waits for writers to finish before returning
> the VMA. This has two big advantages:
>
> 1. Callers do not need to have a fallback path for when they
> collide with writers.
> 2. It can be used in contexts where page faults can happen because
> it can take the mmap_lock for read but never *holds* it.
>
> == Discussion ==
>
> I am not married to the naming here at all. Naming suggestions would
> be much appreciated.
>
> This basically uses mmap_lock to wait for writers, nothing else. The
> VMA is obviously stable under mmap_read_lock() and the code _can_
> likely take advantage of that and possibly even remove the goto. For
> instance, it could (probably) bump the VMA refcount and exclude future
> writers. That would eliminate the goto.
>
> But the approach as-is is probably the smallest line count and
> arguably the simplest approach. It is a good place to start a
> conversation if nothing else.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
> b/include/linux/mmap_lock.h | 2 ++
> b/mm/mmap_lock.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 45 insertions(+)
>
> diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
> --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait 2026-04-29 11:18:50.633628887 -0700
> +++ b/include/linux/mmap_lock.h 2026-04-29 11:18:50.707631737 -0700
> @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str
>
> struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> unsigned long address);
> +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> + unsigned long address);
>
> /*
> * Locks next vma pointed by the iterator. Confirms the locked vma has not
> diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
> --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait 2026-04-29 11:18:50.704631622 -0700
> +++ b/mm/mmap_lock.c 2026-04-29 11:18:50.707631737 -0700
> @@ -340,6 +340,49 @@ inval:
> return NULL;
> }
>
> +/*
> + * Find the VMA covering 'address' and lock it for reading. Waits for writers to
> + * finish if the VMA is being modified. Returns NULL if there is no VMA covering
> + * 'address'.
> + *
> + * The fast path does not take mmap lock.
> + */
> +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> + unsigned long address)
> +{
> + struct vm_area_struct *vma;
> +
> +retry:
> + vma = lock_vma_under_rcu(mm, address);
> + /* Fast path: return stable VMA covering 'address': */
> + if (vma)
> + return vma;
> +
> + /*
> + * Slow path: the VMA covering 'address' is being modified.
> + * or there is no VMA covering 'address'. Rule out the
> + * possibility that the VMA is being modified:
> + */
> + mmap_read_lock(mm);
> + vma = vma_lookup(mm, address);
> + mmap_read_unlock(mm);
> +
> + /* There was for sure no VMA covering 'address': */
> + if (!vma)
> + return NULL;
> +
> + /*
> + * VMA was likely being modified during RCU lookup. Try again.
> + * mmap_read_lock() waited for the writer to complete and the
> + * writer is now done.
> + *
> + * There is no guarantee that any single retry will succeed,
> + * and it is possible but highly unlikely this will loop
> + * forever.
> + */
> + goto retry;
> +}
Hmm yeah this is not ideal :)
You don't have to do any of this we already have logic to help out here -
vma_start_read_locked().
That uses the fact the mmap read lock is held to pin the VMA lock, because VMA
write locks require an mmap write lock, and the mmap read lock prevents them.
That way you can eliminate the retry.
So instead:
/*
* Tries to lock under RCU, failing that it acquires the VMA lock with the mmap
* read lock held.
*/
struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
unsigned long address)
{
struct vm_area_struct *vma;
might_sleep();
vma = lock_vma_under_rcu(mm, address);
if (vma)
return vma;
/* Slow path: preclude VMA writers by getting mmap read lock. */
guard(rwsem_read)(&mm->mmap_lock);
vma = vma_lookup(mm, address);
/* VMA isn't there. */
if (!vma)
return NULL;
return vma_start_read_locked(vma);
}
(Untested, not even build tested code)
Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock()
also does some trace stuff, but the guard makes it WAY nicer so (when it goes
out of scope the mmap read lock is dropped).
Maybe we could add a custom mmap lock guard to cover that too?
Suren - I actually wonder if vma_start_read_locked() actually needs to return a
boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are -
detached or excluding readers on write/detach.
But in both of those cases, vma_lookup() would surely not find the VMA, and
since we're precluding writers (so no write lock, no detach possible) that means
we should never hit it?
Wonder if we should make it a void and add a VM_WARN_ON_ONCE()?
> +
> static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
> struct vma_iterator *vmi,
> unsigned long from_addr)
> _
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
2026-05-08 17:26 ` Lorenzo Stoakes
@ 2026-05-08 20:15 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 20:15 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Fri, May 08, 2026 at 06:26:57PM +0100, Lorenzo Stoakes wrote:
> On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > == Background ==
> >
> > There are basically two parallel ways to look up a VMA: the
> > traditional way, which is protected by mmap_lock, and the RCU-based
> > per-VMA lock way which is based on RCU and refcounts.
> >
> > == Problems ==
> >
> > The mmap_lock one is more straightforward to use but it has a big
> > disadvantage in that it can not be mixed with page faults since those
> > can take mmap_lock for read.
> >
> > The RCU one can be mixed with faults, but it is not available in all
> > configs, so all RCU users need to be able to fall back to the
> > traditional way.
> >
> > == Solution ==
> >
> > Add a variant of the RCU-based lookup that waits for writers. This is
> > basically the same as the existing RCU-based lookup, but it also takes
> > mmap_lock for read and waits for writers to finish before returning
> > the VMA. This has two big advantages:
> >
> > 1. Callers do not need to have a fallback path for when they
> > collide with writers.
> > 2. It can be used in contexts where page faults can happen because
> > it can take the mmap_lock for read but never *holds* it.
> >
> > == Discussion ==
> >
> > I am not married to the naming here at all. Naming suggestions would
> > be much appreciated.
> >
> > This basically uses mmap_lock to wait for writers, nothing else. The
> > VMA is obviously stable under mmap_read_lock() and the code _can_
> > likely take advantage of that and possibly even remove the goto. For
> > instance, it could (probably) bump the VMA refcount and exclude future
> > writers. That would eliminate the goto.
> >
> > But the approach as-is is probably the smallest line count and
> > arguably the simplest approach. It is a good place to start a
> > conversation if nothing else.
> >
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: Vlastimil Babka <vbabka@kernel.org>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: linux-mm@kvack.org
> > ---
> >
> > b/include/linux/mmap_lock.h | 2 ++
> > b/mm/mmap_lock.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 45 insertions(+)
> >
> > diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
> > --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait 2026-04-29 11:18:50.633628887 -0700
> > +++ b/include/linux/mmap_lock.h 2026-04-29 11:18:50.707631737 -0700
> > @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str
> >
> > struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> > unsigned long address);
> > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> > + unsigned long address);
> >
> > /*
> > * Locks next vma pointed by the iterator. Confirms the locked vma has not
> > diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
> > --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait 2026-04-29 11:18:50.704631622 -0700
> > +++ b/mm/mmap_lock.c 2026-04-29 11:18:50.707631737 -0700
> > @@ -340,6 +340,49 @@ inval:
> > return NULL;
> > }
> >
> > +/*
> > + * Find the VMA covering 'address' and lock it for reading. Waits for writers to
> > + * finish if the VMA is being modified. Returns NULL if there is no VMA covering
> > + * 'address'.
> > + *
> > + * The fast path does not take mmap lock.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> > + unsigned long address)
> > +{
> > + struct vm_area_struct *vma;
> > +
> > +retry:
> > + vma = lock_vma_under_rcu(mm, address);
> > + /* Fast path: return stable VMA covering 'address': */
> > + if (vma)
> > + return vma;
> > +
> > + /*
> > + * Slow path: the VMA covering 'address' is being modified.
> > + * or there is no VMA covering 'address'. Rule out the
> > + * possibility that the VMA is being modified:
> > + */
> > + mmap_read_lock(mm);
> > + vma = vma_lookup(mm, address);
> > + mmap_read_unlock(mm);
> > +
> > + /* There was for sure no VMA covering 'address': */
> > + if (!vma)
> > + return NULL;
> > +
> > + /*
> > + * VMA was likely being modified during RCU lookup. Try again.
> > + * mmap_read_lock() waited for the writer to complete and the
> > + * writer is now done.
> > + *
> > + * There is no guarantee that any single retry will succeed,
> > + * and it is possible but highly unlikely this will loop
> > + * forever.
> > + */
> > + goto retry;
> > +}
>
> Hmm yeah this is not ideal :)
>
> You don't have to do any of this we already have logic to help out here -
> vma_start_read_locked().
>
> That uses the fact the mmap read lock is held to pin the VMA lock, because VMA
> write locks require an mmap write lock, and the mmap read lock prevents them.
>
> That way you can eliminate the retry.
>
> So instead:
>
> /*
> * Tries to lock under RCU, failing that it acquires the VMA lock with the mmap
> * read lock held.
> */
> struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> unsigned long address)
> {
> struct vm_area_struct *vma;
>
> might_sleep();
>
> vma = lock_vma_under_rcu(mm, address);
> if (vma)
> return vma;
>
> /* Slow path: preclude VMA writers by getting mmap read lock. */
> guard(rwsem_read)(&mm->mmap_lock);
Actually we'd possibly want the killable version of this? So a fatal signal can
interrupt and not indefinitely block others.
> vma = vma_lookup(mm, address);
> /* VMA isn't there. */
> if (!vma)
> return NULL;
>
> return vma_start_read_locked(vma);
Sorry this should be:
if (!vma_start_read_locked(vma))
return NULL;
return vma;
But as per below not sure we will actually see vma_start_read_locked() fail, it
doesn't touch the write side sequence number so overflow shouldn't be a problem.
> }
>
> (Untested, not even build tested code)
>
> Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock()
> also does some trace stuff, but the guard makes it WAY nicer so (when it goes
> out of scope the mmap read lock is dropped).
>
> Maybe we could add a custom mmap lock guard to cover that too?
>
> Suren - I actually wonder if vma_start_read_locked() actually needs to return a
> boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are -
> detached or excluding readers on write/detach.
>
> But in both of those cases, vma_lookup() would surely not find the VMA, and
> since we're precluding writers (so no write lock, no detach possible) that means
> we should never hit it?
>
> Wonder if we should make it a void and add a VM_WARN_ON_ONCE()?
>
>
> > +
> > static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
> > struct vma_iterator *vmi,
> > unsigned long from_addr)
> > _
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 4/6] binder: Remove mmap_lock fallback
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (2 preceding siblings ...)
2026-04-29 18:19 ` [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Dave Hansen
@ 2026-04-29 18:20 ` Dave Hansen
2026-05-08 17:29 ` Lorenzo Stoakes
2026-04-29 18:20 ` [PATCH 5/6] tcp: Remove mmap_lock fallback path Dave Hansen
` (5 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:20 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, Andrew Morton, Liam R. Howlett, linux-mm,
Lorenzo Stoakes, Shakeel Butt, Suren Baghdasaryan,
Vlastimil Babka
From: Dave Hansen <dave.hansen@linux.intel.com>
Previously, the per-VMA locking could fail in the face of writers
which necessitate a fallback to mmap_lock. The new
lock_vma_under_rcu_wait() will wait for writers instead of failing.
Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
---
b/drivers/android/binder_alloc.c | 17 +++++------------
1 file changed, 5 insertions(+), 12 deletions(-)
diff -puN drivers/android/binder_alloc.c~binder-vma-waiter drivers/android/binder_alloc.c
--- a/drivers/android/binder_alloc.c~binder-vma-waiter 2026-04-29 11:18:51.307654829 -0700
+++ b/drivers/android/binder_alloc.c 2026-04-29 11:18:51.310654944 -0700
@@ -259,21 +259,14 @@ static int binder_page_insert(struct bin
struct vm_area_struct *vma;
int ret = -ESRCH;
- /* attempt per-vma lock first */
- vma = lock_vma_under_rcu(mm, addr);
- if (vma) {
- if (binder_alloc_is_mapped(alloc))
- ret = vm_insert_page(vma, addr, page);
- vma_end_read(vma);
+ vma = lock_vma_under_rcu_wait(mm, addr);
+ if (!vma)
return ret;
- }
- /* fall back to mmap_lock */
- mmap_read_lock(mm);
- vma = vma_lookup(mm, addr);
- if (vma && binder_alloc_is_mapped(alloc))
+ if (binder_alloc_is_mapped(alloc))
ret = vm_insert_page(vma, addr, page);
- mmap_read_unlock(mm);
+
+ vma_end_read(vma);
return ret;
}
_
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 4/6] binder: Remove mmap_lock fallback
2026-04-29 18:20 ` [PATCH 4/6] binder: Remove mmap_lock fallback Dave Hansen
@ 2026-05-08 17:29 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:29 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Wed, Apr 29, 2026 at 11:20:00AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Previously, the per-VMA locking could fail in the face of writers
> which necessitate a fallback to mmap_lock. The new
> lock_vma_under_rcu_wait() will wait for writers instead of failing.
>
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
LGTM in principal, though again not a binder dev so just an A-b :)
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
> b/drivers/android/binder_alloc.c | 17 +++++------------
> 1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff -puN drivers/android/binder_alloc.c~binder-vma-waiter drivers/android/binder_alloc.c
> --- a/drivers/android/binder_alloc.c~binder-vma-waiter 2026-04-29 11:18:51.307654829 -0700
> +++ b/drivers/android/binder_alloc.c 2026-04-29 11:18:51.310654944 -0700
> @@ -259,21 +259,14 @@ static int binder_page_insert(struct bin
> struct vm_area_struct *vma;
> int ret = -ESRCH;
>
> - /* attempt per-vma lock first */
> - vma = lock_vma_under_rcu(mm, addr);
> - if (vma) {
> - if (binder_alloc_is_mapped(alloc))
> - ret = vm_insert_page(vma, addr, page);
> - vma_end_read(vma);
> + vma = lock_vma_under_rcu_wait(mm, addr);
Yeah this name is definitely iffy haha!
> + if (!vma)
> return ret;
> - }
>
> - /* fall back to mmap_lock */
> - mmap_read_lock(mm);
> - vma = vma_lookup(mm, addr);
> - if (vma && binder_alloc_is_mapped(alloc))
> + if (binder_alloc_is_mapped(alloc))
> ret = vm_insert_page(vma, addr, page);
> - mmap_read_unlock(mm);
> +
> + vma_end_read(vma);
>
> return ret;
> }
> _
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 5/6] tcp: Remove mmap_lock fallback path
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (3 preceding siblings ...)
2026-04-29 18:20 ` [PATCH 4/6] binder: Remove mmap_lock fallback Dave Hansen
@ 2026-04-29 18:20 ` Dave Hansen
2026-05-08 17:32 ` Lorenzo Stoakes
2026-04-29 18:20 ` [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path Dave Hansen
` (4 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:20 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, Andrew Morton, Liam R. Howlett, linux-mm,
Lorenzo Stoakes, Shakeel Butt, Suren Baghdasaryan,
Vlastimil Babka
From: Dave Hansen <dave.hansen@linux.intel.com>
Previously, the per-VMA locking could fail in the face of writers
which necessitates a fallback to mmap_lock. The new
lock_vma_under_rcu_wait() will wait for writers instead of failing.
Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
This really is a nice cleanup. It removes the need to pass the lock
state back and forth to find_tcp_vma().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
---
b/net/ipv4/tcp.c | 31 +++++++++----------------------
1 file changed, 9 insertions(+), 22 deletions(-)
diff -puN net/ipv4/tcp.c~ipv4-tcp-vma-waiter net/ipv4/tcp.c
--- a/net/ipv4/tcp.c~ipv4-tcp-vma-waiter 2026-04-29 11:18:51.870676498 -0700
+++ b/net/ipv4/tcp.c 2026-04-29 11:18:51.874676652 -0700
@@ -2171,27 +2171,18 @@ static void tcp_zc_finalize_rx_tstamp(st
}
static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
- unsigned long address,
- bool *mmap_locked)
+ unsigned long address)
{
- struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
+ struct vm_area_struct *vma = lock_vma_under_rcu_wait(mm, address);
- if (vma) {
- if (vma->vm_ops != &tcp_vm_ops) {
- vma_end_read(vma);
- return NULL;
- }
- *mmap_locked = false;
- return vma;
- }
+ if (!vma)
+ return NULL;
- mmap_read_lock(mm);
- vma = vma_lookup(mm, address);
- if (!vma || vma->vm_ops != &tcp_vm_ops) {
- mmap_read_unlock(mm);
+ if (vma->vm_ops != &tcp_vm_ops) {
+ vma_end_read(vma);
return NULL;
}
- *mmap_locked = true;
+
return vma;
}
@@ -2212,7 +2203,6 @@ static int tcp_zerocopy_receive(struct s
u32 seq = tp->copied_seq;
u32 total_bytes_to_map;
int inq = tcp_inq(sk);
- bool mmap_locked;
int ret;
zc->copybuf_len = 0;
@@ -2237,7 +2227,7 @@ static int tcp_zerocopy_receive(struct s
return 0;
}
- vma = find_tcp_vma(current->mm, address, &mmap_locked);
+ vma = find_tcp_vma(current->mm, address);
if (!vma)
return -EINVAL;
@@ -2319,10 +2309,7 @@ static int tcp_zerocopy_receive(struct s
zc, total_bytes_to_map);
}
out:
- if (mmap_locked)
- mmap_read_unlock(current->mm);
- else
- vma_end_read(vma);
+ vma_end_read(vma);
/* Try to copy straggler data. */
if (!ret)
copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
_
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 5/6] tcp: Remove mmap_lock fallback path
2026-04-29 18:20 ` [PATCH 5/6] tcp: Remove mmap_lock fallback path Dave Hansen
@ 2026-05-08 17:32 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:32 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Wed, Apr 29, 2026 at 11:20:02AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Previously, the per-VMA locking could fail in the face of writers
> which necessitates a fallback to mmap_lock. The new
> lock_vma_under_rcu_wait() will wait for writers instead of failing.
>
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
>
> This really is a nice cleanup. It removes the need to pass the lock
> state back and forth to find_tcp_vma().
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Yeah, LGTM again, though am not a networking guy so:
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
> b/net/ipv4/tcp.c | 31 +++++++++----------------------
> 1 file changed, 9 insertions(+), 22 deletions(-)
>
> diff -puN net/ipv4/tcp.c~ipv4-tcp-vma-waiter net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c~ipv4-tcp-vma-waiter 2026-04-29 11:18:51.870676498 -0700
> +++ b/net/ipv4/tcp.c 2026-04-29 11:18:51.874676652 -0700
> @@ -2171,27 +2171,18 @@ static void tcp_zc_finalize_rx_tstamp(st
> }
>
> static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
> - unsigned long address,
> - bool *mmap_locked)
> + unsigned long address)
> {
> - struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
> + struct vm_area_struct *vma = lock_vma_under_rcu_wait(mm, address);
>
> - if (vma) {
> - if (vma->vm_ops != &tcp_vm_ops) {
> - vma_end_read(vma);
> - return NULL;
> - }
> - *mmap_locked = false;
> - return vma;
> - }
> + if (!vma)
> + return NULL;
>
> - mmap_read_lock(mm);
> - vma = vma_lookup(mm, address);
> - if (!vma || vma->vm_ops != &tcp_vm_ops) {
> - mmap_read_unlock(mm);
> + if (vma->vm_ops != &tcp_vm_ops) {
> + vma_end_read(vma);
> return NULL;
> }
> - *mmap_locked = true;
> +
> return vma;
> }
>
> @@ -2212,7 +2203,6 @@ static int tcp_zerocopy_receive(struct s
> u32 seq = tp->copied_seq;
> u32 total_bytes_to_map;
> int inq = tcp_inq(sk);
> - bool mmap_locked;
> int ret;
>
> zc->copybuf_len = 0;
> @@ -2237,7 +2227,7 @@ static int tcp_zerocopy_receive(struct s
> return 0;
> }
>
> - vma = find_tcp_vma(current->mm, address, &mmap_locked);
> + vma = find_tcp_vma(current->mm, address);
> if (!vma)
> return -EINVAL;
>
> @@ -2319,10 +2309,7 @@ static int tcp_zerocopy_receive(struct s
> zc, total_bytes_to_map);
> }
> out:
> - if (mmap_locked)
> - mmap_read_unlock(current->mm);
> - else
> - vma_end_read(vma);
> + vma_end_read(vma);
> /* Try to copy straggler data. */
> if (!ret)
> copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
> _
^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (4 preceding siblings ...)
2026-04-29 18:20 ` [PATCH 5/6] tcp: Remove mmap_lock fallback path Dave Hansen
@ 2026-04-29 18:20 ` Dave Hansen
2026-05-04 23:15 ` Edgecombe, Rick P
2026-04-29 18:22 ` [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (3 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:20 UTC (permalink / raw)
To: linux-kernel
Cc: Dave Hansen, Andrew Morton, Liam R. Howlett, linux-mm,
Lorenzo Stoakes, Shakeel Butt, Suren Baghdasaryan,
Vlastimil Babka
From: Dave Hansen <dave.hansen@linux.intel.com>
The shadow stack code needs to look at the VMA from which it is
reading a userspace "token" to ensure that the memory is shadow stack
memory. If it did not do this, it might read the token from
non-shadow-stack memory, which could result in a control flow hijack.
But that lookup requires two things:
* Looking at a VMA, which must be locked
* Touching userspace
That's a bit of a pain because mmap_lock can not be held while
touching userspace. So the code has to drop the lock, touch userspace,
then re-acquire the lock and check if the VMA might have changed.
The current implementation does with a combination of holding
mmap_lock and looping if the VMA might have changed. It works great.
But the lock_vma_under_rcu_wait() API is a little simpler and also
does not use mmap_lock in its fast path.
Switch to lock_vma_under_rcu_wait().
BTW, this does swap in a mmap_read_lock() for
mmap_read_lock_killable(). That obviously isn't ideal, but it's
trivially fixable with another variant of the helper. I'd apprecaite
if we could handwave that away for the moment. :)
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org
---
b/arch/x86/kernel/shstk.c | 47 ++++++++++++++++------------------------------
1 file changed, 17 insertions(+), 30 deletions(-)
diff -puN arch/x86/kernel/shstk.c~shstk-pop-rcu arch/x86/kernel/shstk.c
--- a/arch/x86/kernel/shstk.c~shstk-pop-rcu 2026-04-29 11:18:52.425697858 -0700
+++ b/arch/x86/kernel/shstk.c 2026-04-29 11:18:52.428697973 -0700
@@ -326,8 +326,9 @@ static int shstk_push_sigframe(unsigned
static int shstk_pop_sigframe(unsigned long *ssp)
{
+ struct vm_area_struct *vma;
unsigned long token_addr;
- unsigned int seq;
+ int err;
/*
* It is possible for the SSP to be off the end of a shadow stack by 4
@@ -338,35 +339,21 @@ static int shstk_pop_sigframe(unsigned l
if (!IS_ALIGNED(*ssp, 8))
return -EINVAL;
- do {
- struct vm_area_struct *vma;
- bool valid_vma;
- int err;
-
- if (mmap_read_lock_killable(current->mm))
- return -EINTR;
-
- vma = find_vma(current->mm, *ssp);
- valid_vma = vma && (vma->vm_flags & VM_SHADOW_STACK);
-
- /*
- * VMAs can change between get_shstk_data() and find_vma().
- * Watch for changes and ensure that 'token_addr' comes from
- * 'vma' by recording a seqcount.
- *
- * Ignore the return value of mmap_lock_speculate_try_begin()
- * because the mmap lock excludes the possibility of writers.
- */
- mmap_lock_speculate_try_begin(current->mm, &seq);
- mmap_read_unlock(current->mm);
-
- if (!valid_vma)
- return -EINVAL;
-
- err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
- if (err)
- return err;
- } while (mmap_lock_speculate_retry(current->mm, seq));
+ vma = lock_vma_under_rcu_wait(current->mm, *ssp);
+ if (!vma)
+ return -EINVAL;
+
+ if (!(vma->vm_flags & VM_SHADOW_STACK)) {
+ vma_end_read(vma);
+ return -EINVAL;
+ }
+
+ err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
+
+ vma_end_read(vma);
+
+ if (err)
+ return err;
/* Restore SSP aligned? */
if (unlikely(!IS_ALIGNED(token_addr, 8)))
_
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
2026-04-29 18:20 ` [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path Dave Hansen
@ 2026-05-04 23:15 ` Edgecombe, Rick P
2026-05-05 16:39 ` Dave Hansen
0 siblings, 1 reply; 27+ messages in thread
From: Edgecombe, Rick P @ 2026-05-04 23:15 UTC (permalink / raw)
To: linux-kernel@vger.kernel.org, dave.hansen@linux.intel.com
Cc: Liam.Howlett@oracle.com, linux-mm@kvack.org, ljs@kernel.org,
surenb@google.com, vbabka@kernel.org, shakeel.butt@linux.dev,
akpm@linux-foundation.org
On Wed, 2026-04-29 at 11:20 -0700, Dave Hansen wrote:
> + vma = lock_vma_under_rcu_wait(current->mm, *ssp);
> + if (!vma)
> + return -EINVAL;
> +
> + if (!(vma->vm_flags & VM_SHADOW_STACK)) {
> + vma_end_read(vma);
> + return -EINVAL;
> + }
> +
> + err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);
Unfortunately, I think it won't work for the shadow stack case with the user
access. I get this splat from the shadow stack selftests:
======================================================
WARNING: possible circular locking dependency detected
7.1.0-rc1+ #2936 Not tainted
------------------------------------------------------
test_shadow_sta/930 is trying to acquire lock:
ff32a05fbc6a1008 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x3c/0x80
but task is already holding lock:
ff32a05f4caf3c48 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xaf/0x2e0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (vm_lock){++++}-{0:0}:
lock_acquire+0xbd/0x2f0
__vma_start_exclude_readers+0x8d/0x1e0
__vma_start_write+0x56/0xe0
vma_expand+0x7e/0x390
relocate_vma_down+0x126/0x220
setup_arg_pages+0x269/0x430
load_elf_binary+0x3d1/0x1840
bprm_execve+0x2cf/0x730
kernel_execve+0xf6/0x160
kernel_init+0xb9/0x1c0
ret_from_fork+0x2eb/0x340
ret_from_fork_asm+0x1a/0x30
-> #0 (&mm->mmap_lock){++++}-{4:4}:
check_prev_add+0xf1/0xd00
__lock_acquire+0x14a8/0x1ac0
lock_acquire+0xbd/0x2f0
__might_fault+0x5b/0x80
restore_signal_shadow_stack+0xd6/0x270
__do_sys_rt_sigreturn+0xdf/0xf0
do_syscall_64+0x11c/0xf80
entry_SYSCALL_64_after_hwframe+0x77/0x7f
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
rlock(vm_lock);
lock(&mm->mmap_lock);
lock(vm_lock);
rlock(&mm->mmap_lock);
*** DEADLOCK ***
1 lock held by test_shadow_sta/930:
#0: ff32a05f4caf3c48 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xaf/0x2e0
stack backtrace:
CPU: 18 UID: 0 PID: 930 Comm: test_shadow_sta Not tainted 7.1.0-rc1+ #2936
PREEMPT(full)
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl+0x68/0xa0
print_circular_bug+0x2ca/0x400
check_noncircular+0x12f/0x150
? __lock_acquire+0x49c/0x1ac0
check_prev_add+0xf1/0xd00
? reacquire_held_locks+0xe4/0x200
__lock_acquire+0x14a8/0x1ac0
lock_acquire+0xbd/0x2f0
? __might_fault+0x3c/0x80
? lock_is_held_type+0xa0/0x120
? __might_fault+0x3c/0x80
__might_fault+0x5b/0x80
? __might_fault+0x3c/0x80
restore_signal_shadow_stack+0xd6/0x270
__do_sys_rt_sigreturn+0xdf/0xf0
do_syscall_64+0x11c/0xf80
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x40212f
Code: 61 00 00 e8 73 f1 ff ff 48 8b 05 4c 61 00 00 31 d2 48 0f 38 f6 10 48 8b
44 24 08 64 48 2b 08
RSP: 002b:00007ffc286fb208 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007ff628b187b0
RDX: 0000000000000000 RSI: 00000000066492a0 RDI: 0000000000000000
RBP: 00007ffc286fb360 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000001 R14: 00007ff628b6c000 R15: 0000000000406e18
I guess the problem is the lock ordering. Not sure if there is any slow path
avoidance details that could make this splat a false positive. But how about
this simpler munmap() case:
Shadow stack signal munmap()
------------------- --------
vma_start_read() (VM_SHADOW_STACK check)
mmap_write_lock()
mmap_read_lock() (user fault) <- deadlock
vma_start_write() <-deadlock
> +
> + vma_end_read(vma);
> +
> + if (err)
> + return err;
>
> /* Restore SSP aligned? */
> if (unlikely(!IS_ALIGNED(token_addr, 8)))
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
2026-05-04 23:15 ` Edgecombe, Rick P
@ 2026-05-05 16:39 ` Dave Hansen
2026-05-08 20:39 ` Lorenzo Stoakes
0 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-05-05 16:39 UTC (permalink / raw)
To: Edgecombe, Rick P, linux-kernel@vger.kernel.org,
dave.hansen@linux.intel.com
Cc: Liam.Howlett@oracle.com, linux-mm@kvack.org, ljs@kernel.org,
surenb@google.com, vbabka@kernel.org, shakeel.butt@linux.dev,
akpm@linux-foundation.org
On 5/4/26 16:15, Edgecombe, Rick P wrote:
>
> I guess the problem is the lock ordering. Not sure if there is any slow path
> avoidance details that could make this splat a false positive. But how about
> this simpler munmap() case:
>
> Shadow stack signal munmap()
> ------------------- --------
> vma_start_read() (VM_SHADOW_STACK check)
> mmap_write_lock()
> mmap_read_lock() (user fault) <- deadlock
> vma_start_write() <-deadlock
It's a little more complicated than that in practice, but I think you're
right.
I'm not sure when this would happen in practice because the fault is
actually on the VMA that's being held for read. So I think another
writer would have had to sneak in there and zap the VMA.
The funny thing is that the fault handler is really just trying to find
the VMA. The thing causing the fault *has* the VMA. So it's as simple as
just passing the VMA down into the fault handler, right? How hard could
it be? ;)
There are still games to play, but they all involve dropping locks and
retrying, like:
retry:
vma = lock_vma_under_rcu()
// muck with VMA
pagefault_disable() // avoid deadlock
ret = copy_from_user()
pagefault_enable()
vma_end_read();
if (!ret)
return SUCCESS;
mmap_read_lock()
vma = vma_lookup()
mmap_read_unlock() // avoid deadlock before touching userspace
// check for valid VMA to avoid looping when there is no VMA
if (!vma)
return -ERRNO;
// uh oh, slow path, something faulted
get_user_pages()??
//or
copy_from_user() without the VMA??
goto retry;
This also needs some very careful thought, but something like this
should work, where we avoid fault handling (and lock taking) in the
actual #PF and do it in a context where the VMA lock is held:
vma = lock_vma_under_rcu();
pagefault_disable() // avoid deadlock
while (1) {
ret = copy_from_user()
if (!ret)
break;
handle_mm_fault(vma, address, FAULT_FLAG_VMA_LOCK...);
};
pagefault_enable()
vma_end_read();
That's effectively just short-circuiting the #PF code which does the:
vma = lock_vma_under_rcu(mm, address);
...
fault = handle_mm_fault(vma, address, ... FAULT_FLAG_VMA_LOCK)
sequence _itself_.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
2026-05-05 16:39 ` Dave Hansen
@ 2026-05-08 20:39 ` Lorenzo Stoakes
0 siblings, 0 replies; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 20:39 UTC (permalink / raw)
To: Dave Hansen
Cc: Edgecombe, Rick P, linux-kernel@vger.kernel.org,
dave.hansen@linux.intel.com, Liam.Howlett@oracle.com,
linux-mm@kvack.org, surenb@google.com, vbabka@kernel.org,
shakeel.butt@linux.dev, akpm@linux-foundation.org
On Tue, May 05, 2026 at 09:39:09AM -0700, Dave Hansen wrote:
> On 5/4/26 16:15, Edgecombe, Rick P wrote:
> >
> > I guess the problem is the lock ordering. Not sure if there is any slow path
> > avoidance details that could make this splat a false positive. But how about
> > this simpler munmap() case:
> >
> > Shadow stack signal munmap()
> > ------------------- --------
> > vma_start_read() (VM_SHADOW_STACK check)
> > mmap_write_lock()
> > mmap_read_lock() (user fault) <- deadlock
> > vma_start_write() <-deadlock
>
> It's a little more complicated than that in practice, but I think you're
> right.
>
> I'm not sure when this would happen in practice because the fault is
> actually on the VMA that's being held for read. So I think another
> writer would have had to sneak in there and zap the VMA.
Honestly I think any work around is just going to be more complicated than the
existing code, which sort of defeats the purpose of the series.
There's not really a way to speculate with a VMA seqnum because you'd have to be
able to observe its vma->vm_lock_seq and to do that you'd have to find it again
immediately afterwards and then you'd looked up twice, got the lock twice only
to confirm its the same damn thing :)
The issue is that a page fault on the same thread is always going to risk an
mmap read lock being taken (possibly due to I/O waiting and fault retry for
one). And faults/zaps are inherently racey and neither acquire the write lock so
the read lock doesn't preclude them.
And you can't realy disable page faults because you're potentially relying on
them to populate what you're touching...
Also there's some tricky stuff done on initial stack initialisation that can
cause a headache as well (when stack is set up), see relocate_vma_down() to make
life more painful.
So I think the existing code is simpler.
It doesn't mean it isn't still useful to move towards having VMA locks
everywhere though :) unless Suren or others can find a flaw with that...
>
> The funny thing is that the fault handler is really just trying to find
> the VMA. The thing causing the fault *has* the VMA. So it's as simple as
> just passing the VMA down into the fault handler, right? How hard could
> it be? ;)
>
> There are still games to play, but they all involve dropping locks and
> retrying, like:
>
> retry:
> vma = lock_vma_under_rcu()
> // muck with VMA
> pagefault_disable() // avoid deadlock
> ret = copy_from_user()
> pagefault_enable()
> vma_end_read();
>
> if (!ret)
> return SUCCESS;
>
> mmap_read_lock()
> vma = vma_lookup()
> mmap_read_unlock() // avoid deadlock before touching userspace
> // check for valid VMA to avoid looping when there is no VMA
> if (!vma)
> return -ERRNO;
>
> // uh oh, slow path, something faulted
> get_user_pages()??
> //or
> copy_from_user() without the VMA??
>
> goto retry;
>
>
> This also needs some very careful thought, but something like this
> should work, where we avoid fault handling (and lock taking) in the
> actual #PF and do it in a context where the VMA lock is held:
>
> vma = lock_vma_under_rcu();
> pagefault_disable() // avoid deadlock
>
> while (1) {
> ret = copy_from_user()
> if (!ret)
> break;
> handle_mm_fault(vma, address, FAULT_FLAG_VMA_LOCK...);
> };
>
> pagefault_enable()
> vma_end_read();
>
> That's effectively just short-circuiting the #PF code which does the:
>
> vma = lock_vma_under_rcu(mm, address);
> ...
> fault = handle_mm_fault(vma, address, ... FAULT_FLAG_VMA_LOCK)
>
> sequence _itself_.
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (5 preceding siblings ...)
2026-04-29 18:20 ` [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path Dave Hansen
@ 2026-04-29 18:22 ` Dave Hansen
2026-04-30 8:11 ` Lorenzo Stoakes
2026-04-30 7:55 ` [syzbot ci] " syzbot ci
` (2 subsequent siblings)
9 siblings, 1 reply; 27+ messages in thread
From: Dave Hansen @ 2026-04-29 18:22 UTC (permalink / raw)
To: Dave Hansen, linux-kernel
Cc: Andrew Morton, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
BTW, this is *ENTIRELY* an [RFC]. It's just not tagged properly.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
2026-04-29 18:22 ` [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
@ 2026-04-30 8:11 ` Lorenzo Stoakes
2026-04-30 17:17 ` Suren Baghdasaryan
0 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-04-30 8:11 UTC (permalink / raw)
To: Dave Hansen
Cc: Dave Hansen, linux-kernel, Andrew Morton, Liam R. Howlett,
linux-mm, Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On Wed, Apr 29, 2026 at 11:22:28AM -0700, Dave Hansen wrote:
> BTW, this is *ENTIRELY* an [RFC]. It's just not tagged properly.
Was going to say :)
Not going to be able to get to this until after LSF... :) Likely the same for
Suren also.
Cheers, Lorenzo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
2026-04-30 8:11 ` Lorenzo Stoakes
@ 2026-04-30 17:17 ` Suren Baghdasaryan
2026-04-30 17:20 ` Dave Hansen
0 siblings, 1 reply; 27+ messages in thread
From: Suren Baghdasaryan @ 2026-04-30 17:17 UTC (permalink / raw)
To: Lorenzo Stoakes
Cc: Dave Hansen, Dave Hansen, linux-kernel, Andrew Morton,
Liam R. Howlett, linux-mm, Shakeel Butt, Vlastimil Babka
On Thu, Apr 30, 2026 at 1:11 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>
> On Wed, Apr 29, 2026 at 11:22:28AM -0700, Dave Hansen wrote:
> > BTW, this is *ENTIRELY* an [RFC]. It's just not tagged properly.
>
> Was going to say :)
>
> Not going to be able to get to this until after LSF... :) Likely the same for
> Suren also.
Yeah, sorry. Trying to wrap up all the urgent stuff before the trip. I
might be able to review the patches later this week before the
conference starts, but can't promise.
Thanks,
Suren.
>
> Cheers, Lorenzo
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
2026-04-30 17:17 ` Suren Baghdasaryan
@ 2026-04-30 17:20 ` Dave Hansen
0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2026-04-30 17:20 UTC (permalink / raw)
To: Suren Baghdasaryan, Lorenzo Stoakes
Cc: Dave Hansen, linux-kernel, Andrew Morton, Liam R. Howlett,
linux-mm, Shakeel Butt, Vlastimil Babka
On 4/30/26 10:17, Suren Baghdasaryan wrote:
> On Thu, Apr 30, 2026 at 1:11 AM Lorenzo Stoakes <ljs@kernel.org> wrote:
>> On Wed, Apr 29, 2026 at 11:22:28AM -0700, Dave Hansen wrote:
>>> BTW, this is *ENTIRELY* an [RFC]. It's just not tagged properly.
>> Was going to say 🙂
>>
>> Not going to be able to get to this until after LSF... 🙂 Likely the same for
>> Suren also.
> Yeah, sorry. Trying to wrap up all the urgent stuff before the trip. I
> might be able to review the patches later this week before the
> conference starts, but can't promise.
Seriously, don't worry about rushing this. After the conference is
perfectly fine with me. There are a few things that Sashiko complained
about that need to get fixed up anyway.
^ permalink raw reply [flat|nested] 27+ messages in thread
* [syzbot ci] Re: mm: Make per-VMA locks available in all builds
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (6 preceding siblings ...)
2026-04-29 18:22 ` [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
@ 2026-04-30 7:55 ` syzbot ci
2026-04-30 16:59 ` Dave Hansen
[not found] ` <20260430072053.e0be1b431bcff02831f07e9d@linux-foundation.org>
2026-05-08 16:52 ` Lorenzo Stoakes
9 siblings, 1 reply; 27+ messages in thread
From: syzbot ci @ 2026-04-30 7:55 UTC (permalink / raw)
To: akpm, dave.hansen, liam.howlett, linux-kernel, linux-mm, ljs,
shakeel.butt, surenb, vbabka
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] mm: Make per-VMA locks available in all builds
https://lore.kernel.org/all/20260429181954.F50224AE@davehans-spike.ostc.intel.com
* [PATCH 1/6] mm: Make per-VMA locks available universally
* [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock
* [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
* [PATCH 4/6] binder: Remove mmap_lock fallback
* [PATCH 5/6] tcp: Remove mmap_lock fallback path
* [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
and found the following issue:
WARNING in mbind_range
Full report is available here:
https://ci.syzbot.org/series/374f338e-4b3b-4645-871c-78964f944bbd
***
WARNING in mbind_range
tree: torvalds
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/torvalds/linux
base: 57b8e2d666a31fa201432d58f5fe3469a0dd83ba
arch: amd64
compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config: https://ci.syzbot.org/builds/786ac31f-0f5e-4ceb-88c7-45f4bee79d60/config
syz repro: https://ci.syzbot.org/findings/4b270176-3e48-4ac7-8ddb-f326d6883d93/syz_repro
pgoff 200000000 file 0000000000000000 private_data 0000000000000000
flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty)
------------[ cut here ]------------
1
WARNING: ./include/linux/mmap_lock.h:332 at vma_assert_write_locked include/linux/mmap_lock.h:332 [inline], CPU#0: syz.2.19/5876
WARNING: ./include/linux/mmap_lock.h:332 at vma_replace_policy mm/mempolicy.c:1016 [inline], CPU#0: syz.2.19/5876
WARNING: ./include/linux/mmap_lock.h:332 at mbind_range+0x57a/0x810 mm/mempolicy.c:1063, CPU#0: syz.2.19/5876
Modules linked in:
CPU: 0 UID: 0 PID: 5876 Comm: syz.2.19 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:vma_assert_write_locked include/linux/mmap_lock.h:332 [inline]
RIP: 0010:vma_replace_policy mm/mempolicy.c:1016 [inline]
RIP: 0010:mbind_range+0x57a/0x810 mm/mempolicy.c:1063
Code: 97 ff e9 b2 fb ff ff e8 a4 4a 97 ff 90 0f 0b 90 e9 14 fe ff ff e8 96 4a 97 ff 4c 89 ff e8 0e b5 f9 fe c6 05 9c 2e ec 0d 01 90 <0f> 0b 90 4d 85 e4 0f 85 48 fe ff ff e8 75 4a 97 ff 31 db 4d 8d 77
RSP: 0018:ffffc90003bc7c78 EFLAGS: 00010292
RAX: 000000000000011f RBX: 000000000000000b RCX: a410c902b7e34800
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: 0000000000000009 R08: ffffc90003bc7967 R09: 1ffff92000778f2c
R10: dffffc0000000000 R11: fffff52000778f2d R12: ffff8881012c9e00
R13: dffffc0000000000 R14: ffff888115b99bf8 R15: ffff8881162fa300
FS: 00007fa09187c6c0(0000) GS:ffff88818dc93000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b33b63fff CR3: 0000000115bda000 CR4: 00000000000006f0
Call Trace:
<TASK>
do_mbind mm/mempolicy.c:1560 [inline]
kernel_mbind mm/mempolicy.c:1757 [inline]
__do_sys_mbind mm/mempolicy.c:1831 [inline]
__se_sys_mbind+0xad4/0x10f0 mm/mempolicy.c:1827
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0x15f/0xf80 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fa09099cdd9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fa09187c028 EFLAGS: 00000246 ORIG_RAX: 00000000000000ed
RAX: ffffffffffffffda RBX: 00007fa090c15fa0 RCX: 00007fa09099cdd9
RDX: 0000000000000001 RSI: 0000000000600000 RDI: 0000200000000000
RBP: 00007fa090a32d69 R08: 0000000000000000 R09: 0000000000000003
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fa090c16038 R14: 00007fa090c15fa0 R15: 00007ffff596d0f8
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
To test a patch for this bug, please reply with `#syz test`
(should be on a separate line).
The patch should be attached to the email.
Note: arguments like custom git repos and branches are not supported.
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [syzbot ci] Re: mm: Make per-VMA locks available in all builds
2026-04-30 7:55 ` [syzbot ci] " syzbot ci
@ 2026-04-30 16:59 ` Dave Hansen
0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2026-04-30 16:59 UTC (permalink / raw)
To: syzbot ci, akpm, dave.hansen, liam.howlett, linux-kernel,
linux-mm, ljs, shakeel.butt, surenb, vbabka
Cc: syzbot, syzkaller-bugs
On 4/30/26 00:55, syzbot ci wrote:
> and found the following issue:
> WARNING in mbind_range
>
> Full report is available here:
> https://ci.syzbot.org/series/374f338e-4b3b-4645-871c-78964f944bbd
>
> ***
>
> WARNING in mbind_range
I left a whole bunch of #ifdef CONFIG_PER_VMA_LOCK cruft around in v1. I
suspect some of the debugging code ended up in a weird state in a config
that I didn't test.
I'll try to reproduce this splat, though. It looks straightforward enough.
^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20260430072053.e0be1b431bcff02831f07e9d@linux-foundation.org>]
* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
[not found] ` <20260430072053.e0be1b431bcff02831f07e9d@linux-foundation.org>
@ 2026-04-30 16:52 ` Dave Hansen
0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2026-04-30 16:52 UTC (permalink / raw)
To: Andrew Morton, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka, LKML
... adding all the cc's back.
On 4/30/26 07:20, Andrew Morton wrote:
> On Wed, 29 Apr 2026 11:19:54 -0700 Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
>> When working on some x86 shadow stack code, it was a real pain to
>> avoid causing recursive locking problems with mmap_lock. One way
>> to avoid those was to avoid mmap_lock and use per-VMA locks instead.
>> They are great, but they are not available in all configs which
>> makes them unusable in generic code, or if you want to completely
>> avoid mmap_lock.
>
> Did you see the AI review?
> https://sashiko.dev/#/patchset/20260429181954.F50224AE@davehans-spike.ostc.intel.com
I just went through it. There was some absolutely valid stuff in there
like a bunch of CONFIG_PER_VMA_LOCK references needing to be cleaned up.
It also made some good points about the binder shrinker sites that I
think I cleaned up for v2.
There are three very valid structural problems that it's concerned about.
First is that lock_vma_under_rcu_wait() doesn't use
mmap_read_lock_killable(). It probably needs to, or at least there would
need to be killable and non-killable variants. That's easy enough to do
if folks agree that this is overall something that should go forward.
I'd prefer to hand wave it away for the moment, though.
Second, it brings up concerns about lock_vma_under_rcu_wait() deadlocks
in the face of other per-VMA or mmap_lock holders. This is very valid,
but it's inherent for users of mmap_lock. I think it's just a
documentation issue.
Third, Sashiko is _very_ peeved about the lock_vma_under_rcu_wait() goto
loop. Broadly, it's concerned about fairness and livelocks in the face
of userspace being able to compel 'goto retry' to happen forever. It's a
valid theoretical concern for sure. I'm less convinced that it will be a
problem in practice, and I should probably hack together a torture test
to see how many retries actually happen.
The other way to fix it more robustly would be to acquire the vma
reference under the existing mmap_read_lock(). I _think_ it's just a
couple extra lines of code, but I haven't done the legwork to flesh out
how that would look.
But the key questions for this series remain:
1. Should/can per-VMA locking be available in all configs?
2. Is *a* lock_vma_under_rcu_wait() implementation feasible and useful?
The implementation in this series is highly imperfect. Is there a
chance of a better one or is it just impossible?
^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
2026-04-29 18:19 [PATCH 0/6] mm: Make per-VMA locks available in all builds Dave Hansen
` (8 preceding siblings ...)
[not found] ` <20260430072053.e0be1b431bcff02831f07e9d@linux-foundation.org>
@ 2026-05-08 16:52 ` Lorenzo Stoakes
2026-05-08 17:01 ` Dave Hansen
9 siblings, 1 reply; 27+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:52 UTC (permalink / raw)
To: Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
I'm guessing this is kinda an RFC? :P
On Wed, Apr 29, 2026 at 11:19:54AM -0700, Dave Hansen wrote:
> tl;dr: I hope I'm not missing something big here. The basic
> observastion here is that forcing code to account for per-VMA lock
> failure adds a lot of complexity. This series theorizes that with a
> some Kconfig changes and a new helper, many callers can avoid writing
> code that falls back to mmap_lock.
In general very much in support of this!
It'd be great to just know that this is available and frankly I think it's a
critical part of the kernel.
Obviously Suren needs to have a look through, most important of all :)
>
> --
>
> When working on some x86 shadow stack code, it was a real pain to
> avoid causing recursive locking problems with mmap_lock. One way
> to avoid those was to avoid mmap_lock and use per-VMA locks instead.
> They are great, but they are not available in all configs which
> makes them unusable in generic code, or if you want to completely
> avoid mmap_lock.
Yeah, lock ordering is a pain.
>
> Make per-VMA locks available in all configs. Right now, they are
> only available on select architectures when SMP and MMU are enabled.
> But all of the primitives that per-VMA locks are built on (RCU, maple
> trees, refcounts) work just fine without SMP or MMU.
>
> Their only real downside is that they make VMAs a wee bit bigger
> on !MMU and !SMP builds.
>
> The upside is much cleaner code, lower complexity and less #ifdeffery.
>
> Building on top of universally-available per-VMA locks, introduce a
> new helper. Since the new API does not require callers to have a
> fallback to mmap_lock, it's much easier to use. Callers could
> potentially replace this very common kernel idiom:
>
> mmap_read_lock(mm);
> vma = vma_lookup()
> // fiddle with vma
> mmap_read_unlock(mm);
>
> with:
>
> vma = lock_vma_under_rcu_wait(mm, address);
I will look at what you're proposing but this seems a bit like something I
proposed at LSF (but was probably not the right solution for what was under
discussion).
Doing this 'right' would require quite a bit of engineering effort. The VMA
locks are pretty bloody complicated :) so we have to be careful not to spread
the complexity around too much.
But I guess you could 'wait' by doing it in the slow path and then using
vma_start_read_locked()...
Of course that'd not help you with any lock inversions though!
Anyway need to read the code :)
> // fiddle with vma
> vma_end_read(vma);
>
> Which avoids mmap_lock entirely in the fast path.
Yeah it's nice!
>
> Things I think needs more discussion:
> * The new helper has a horrible name. Suggestions are very welcome.
> * I'm not very confident that this approach completely avoids the
> deadlock issues that arise from touching userspace while holding
> mm-related locks.
Yeah we have to be careful...
> * Can the helper avoid the goto, maybe by taking the VMA refcount
> while holding mmap_lock?
Surely that'd defeat the purpose of VMA locks though? you'd hold the mmap lock
for less time but you're still contending vs. _any_ VMA write locks whilst
trying to get a VMA read lock?
Unless it's on a slow path... hmm :)
> * mm_struct and vm_area_struct "bloat"
Probably not a problem really. For any modern system you're using the fields.
>
> I've included a couple patches where I think the new helper really
> makes the code nicer.
>
> I'm keeping the cc list on the short side for now because I'm not
> actually proposing that we go ahead and do the ipv4 changes, for
> example.
Ack!
>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
>
> arch/arm/Kconfig | 1
> arch/arm64/Kconfig | 1
> arch/loongarch/Kconfig | 1
> arch/powerpc/platforms/powernv/Kconfig | 1
> arch/powerpc/platforms/pseries/Kconfig | 1
> arch/riscv/Kconfig | 1
> arch/s390/Kconfig | 1
> arch/x86/Kconfig | 2 -
> arch/x86/kernel/shstk.c | 47 +++++++++++-------------------
> drivers/android/binder_alloc.c | 39 ++++++-------------------
> fs/proc/internal.h | 2 -
> fs/proc/task_mmu.c | 51 ---------------------------------
> include/linux/mm.h | 12 -------
> include/linux/mm_types.h | 7 ----
> include/linux/mmap_lock.h | 50 +-------------------------------
> kernel/fork.c | 2 -
> mm/Kconfig | 13 --------
> mm/mmap_lock.c | 45 +++++++++++++++++++++++++++--
> net/ipv4/tcp.c | 31 +++++---------------
> 19 files changed, 82 insertions(+), 226 deletions(-)
^ permalink raw reply [flat|nested] 27+ messages in thread* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
2026-05-08 16:52 ` Lorenzo Stoakes
@ 2026-05-08 17:01 ` Dave Hansen
0 siblings, 0 replies; 27+ messages in thread
From: Dave Hansen @ 2026-05-08 17:01 UTC (permalink / raw)
To: Lorenzo Stoakes, Dave Hansen
Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka
On 5/8/26 09:52, Lorenzo Stoakes wrote:
...
>> * Can the helper avoid the goto, maybe by taking the VMA refcount
>> while holding mmap_lock?
>
> Surely that'd defeat the purpose of VMA locks though? you'd hold the mmap lock
> for less time but you're still contending vs. _any_ VMA write locks whilst
> trying to get a VMA read lock?
>
> Unless it's on a slow path... hmm :)
Yup. It's in a slow path. the example helper I had here does:
retry:
vma = lock_vma_under_rcu(mm, address);
if (vma)
return vma;
mmap_read_lock(mm);
vma = vma_lookup(mm, address);
mmap_read_unlock(mm);
goto retry;
It avoids mmap_lock in the common, fast case. I was hoping to replace it
with something like:
vma = lock_vma_under_rcu(mm, address);
if (vma)
return vma;
mmap_read_lock(mm);
vma = vma_lookup(mm, address);
vma_start_read(vma->vm_mm, vma); // Is this safe?
mmap_read_unlock(mm);
return vma;
Which still uses mmap_lock, but avoids the goto. I'm pretty sure the
first one doesn't have any locking problems. The second one, I need to
think about a _lot_ more.
^ permalink raw reply [flat|nested] 27+ messages in thread