The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* Re: [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
       [not found] ` <20260429182005.00BF70D8@davehans-spike.ostc.intel.com>
@ 2026-05-04 23:15   ` Edgecombe, Rick P
  2026-05-05 16:39     ` Dave Hansen
  0 siblings, 1 reply; 13+ messages in thread
From: Edgecombe, Rick P @ 2026-05-04 23:15 UTC (permalink / raw)
  To: linux-kernel@vger.kernel.org, dave.hansen@linux.intel.com
  Cc: Liam.Howlett@oracle.com, linux-mm@kvack.org, ljs@kernel.org,
	surenb@google.com, vbabka@kernel.org, shakeel.butt@linux.dev,
	akpm@linux-foundation.org

On Wed, 2026-04-29 at 11:20 -0700, Dave Hansen wrote:
> +	vma = lock_vma_under_rcu_wait(current->mm, *ssp);
> +	if (!vma)
> +		return -EINVAL;
> +
> +	if (!(vma->vm_flags & VM_SHADOW_STACK)) {
> +		vma_end_read(vma);
> +		return -EINVAL;
> +	}
> +
> +	err = get_shstk_data(&token_addr, (unsigned long __user *)*ssp);

Unfortunately, I think it won't work for the shadow stack case with the user
access. I get this splat from the shadow stack selftests:

======================================================
 WARNING: possible circular locking dependency detected
 7.1.0-rc1+ #2936 Not tainted
 ------------------------------------------------------
 test_shadow_sta/930 is trying to acquire lock:
 ff32a05fbc6a1008 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x3c/0x80
 
               but task is already holding lock:
 ff32a05f4caf3c48 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xaf/0x2e0
 
               which lock already depends on the new lock.

 
               the existing dependency chain (in reverse order) is:
 
               -> #1 (vm_lock){++++}-{0:0}:
        lock_acquire+0xbd/0x2f0
        __vma_start_exclude_readers+0x8d/0x1e0
        __vma_start_write+0x56/0xe0
        vma_expand+0x7e/0x390
        relocate_vma_down+0x126/0x220
        setup_arg_pages+0x269/0x430
        load_elf_binary+0x3d1/0x1840
        bprm_execve+0x2cf/0x730
        kernel_execve+0xf6/0x160
        kernel_init+0xb9/0x1c0
        ret_from_fork+0x2eb/0x340
        ret_from_fork_asm+0x1a/0x30
 
               -> #0 (&mm->mmap_lock){++++}-{4:4}:
        check_prev_add+0xf1/0xd00
        __lock_acquire+0x14a8/0x1ac0
        lock_acquire+0xbd/0x2f0
        __might_fault+0x5b/0x80
        restore_signal_shadow_stack+0xd6/0x270
        __do_sys_rt_sigreturn+0xdf/0xf0
        do_syscall_64+0x11c/0xf80
        entry_SYSCALL_64_after_hwframe+0x77/0x7f
 
               other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   rlock(vm_lock);
                                lock(&mm->mmap_lock);
                                lock(vm_lock);
   rlock(&mm->mmap_lock);
 
                *** DEADLOCK ***

 1 lock held by test_shadow_sta/930:
  #0: ff32a05f4caf3c48 (vm_lock){++++}-{0:0}, at: lock_vma_under_rcu+0xaf/0x2e0
 
               stack backtrace:
 CPU: 18 UID: 0 PID: 930 Comm: test_shadow_sta Not tainted 7.1.0-rc1+ #2936
PREEMPT(full) 
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Call Trace:
  <TASK>
  dump_stack_lvl+0x68/0xa0
  print_circular_bug+0x2ca/0x400
  check_noncircular+0x12f/0x150
  ? __lock_acquire+0x49c/0x1ac0
  check_prev_add+0xf1/0xd00
  ? reacquire_held_locks+0xe4/0x200
  __lock_acquire+0x14a8/0x1ac0
  lock_acquire+0xbd/0x2f0
  ? __might_fault+0x3c/0x80
  ? lock_is_held_type+0xa0/0x120
  ? __might_fault+0x3c/0x80
  __might_fault+0x5b/0x80
  ? __might_fault+0x3c/0x80
  restore_signal_shadow_stack+0xd6/0x270
  __do_sys_rt_sigreturn+0xdf/0xf0
  do_syscall_64+0x11c/0xf80
  entry_SYSCALL_64_after_hwframe+0x77/0x7f
 RIP: 0033:0x40212f
 Code: 61 00 00 e8 73 f1 ff ff 48 8b 05 4c 61 00 00 31 d2 48 0f 38 f6 10 48 8b
44 24 08 64 48 2b 08
 RSP: 002b:00007ffc286fb208 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007ff628b187b0
 RDX: 0000000000000000 RSI: 00000000066492a0 RDI: 0000000000000000
 RBP: 00007ffc286fb360 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
 R13: 0000000000000001 R14: 00007ff628b6c000 R15: 0000000000406e18


I guess the problem is the lock ordering. Not sure if there is any slow path
avoidance details that could make this splat a false positive. But how about
this simpler munmap() case:

Shadow stack signal                          munmap()
-------------------                          --------
vma_start_read() (VM_SHADOW_STACK check)
                                             mmap_write_lock()
mmap_read_lock() (user fault) <- deadlock
                                             vma_start_write() <-deadlock



> +
> +	vma_end_read(vma);
> +
> +	if (err)
> +		return err;
>  
>  	/* Restore SSP aligned? */
>  	if (unlikely(!IS_ALIGNED(token_addr, 8)))


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
  2026-05-04 23:15   ` [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path Edgecombe, Rick P
@ 2026-05-05 16:39     ` Dave Hansen
  2026-05-08 20:39       ` Lorenzo Stoakes
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Hansen @ 2026-05-05 16:39 UTC (permalink / raw)
  To: Edgecombe, Rick P, linux-kernel@vger.kernel.org,
	dave.hansen@linux.intel.com
  Cc: Liam.Howlett@oracle.com, linux-mm@kvack.org, ljs@kernel.org,
	surenb@google.com, vbabka@kernel.org, shakeel.butt@linux.dev,
	akpm@linux-foundation.org

On 5/4/26 16:15, Edgecombe, Rick P wrote:
> 
> I guess the problem is the lock ordering. Not sure if there is any slow path
> avoidance details that could make this splat a false positive. But how about
> this simpler munmap() case:
> 
> Shadow stack signal                          munmap()
> -------------------                          --------
> vma_start_read() (VM_SHADOW_STACK check)
>                                              mmap_write_lock()
> mmap_read_lock() (user fault) <- deadlock
>                                              vma_start_write() <-deadlock

It's a little more complicated than that in practice, but I think you're
right.

I'm not sure when this would happen in practice because the fault is
actually on the VMA that's being held for read. So I think another
writer would have had to sneak in there and zap the VMA.

The funny thing is that the fault handler is really just trying to find
the VMA. The thing causing the fault *has* the VMA. So it's as simple as
just passing the VMA down into the fault handler, right? How hard could
it be? ;)

There are still games to play, but they all involve dropping locks and
retrying, like:

retry:
	vma = lock_vma_under_rcu()
	// muck with VMA
	pagefault_disable() // avoid deadlock
	ret = copy_from_user()
	pagefault_enable()
	vma_end_read();

	if (!ret)
		return SUCCESS;

	mmap_read_lock()
	vma = vma_lookup()
	mmap_read_unlock() // avoid deadlock before touching userspace
	// check for valid VMA to avoid looping when there is no VMA
	if (!vma)
		return -ERRNO;

	// uh oh, slow path, something faulted
	get_user_pages()??
	//or
	copy_from_user() without the VMA??

	goto retry;


This also needs some very careful thought, but something like this
should work, where we avoid fault handling (and lock taking) in the
actual #PF and do it in a context where the VMA lock is held:

	vma = lock_vma_under_rcu();
	pagefault_disable() // avoid deadlock

	while (1) {
		ret = copy_from_user()
		if (!ret)
			break;
		handle_mm_fault(vma, address, FAULT_FLAG_VMA_LOCK...);
	};

	pagefault_enable()
	vma_end_read();

That's effectively just short-circuiting the #PF code which does the:

        vma = lock_vma_under_rcu(mm, address);
	...
        fault = handle_mm_fault(vma, address, ... FAULT_FLAG_VMA_LOCK)

sequence _itself_.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] mm: Make per-VMA locks available universally
       [not found] ` <20260429181955.0C443845@davehans-spike.ostc.intel.com>
@ 2026-05-08 10:12   ` David Hildenbrand (Arm)
  2026-05-08 10:58     ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 10:12 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: Andrew Morton, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On 4/29/26 20:19, Dave Hansen wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The per-VMA locks have been around for several years. They've had some
> bugs worked out of them and have seen quite wide use. However, they
> are still only available when architectures explicitly enable them.
> Remove the conditional compilation around the per-VMA locks, making
> them available on all architectures and configs.

Yes, we should really just make it now just a fixed part of the kernel design.

> 
> The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
> when the architecture started using per-VMA locks in the fault
> handler. But, contrary to the naming, the Kconfig option does not
> really indicate whether the architecture supports per-VMA locks or
> not. It is more of a marker for whether the architecture is likely to
> benefit from per-VMA locks.
> 
> To me, the most important thing side-effect of universal availability
> is letting per-VMA locks be used in SMP=n configs. This lets us use
> per-VMA locking in all x86 code without fallbacks.
> 
> Overall, this just generally makes the kernel simpler. Just look at
> the diffstat. It also opens the door to users that want to use the
> per-VMA locks in common code. Doing *that* can bring additional
> simplifications.
> 
> The downside of this is adding some fields to vm_area_struct and
> mm_struct. 

I'd assume most distributions would already enable it.

mm_struct is very likely not a problem.

On x86-64, the smallest size for vma_area_struct possible (make allnoconfig)
seems to be 68bytes. The largest size (make allyesconfig) with lockdep and all
that is 256bytes. Without lockdep we are at 192 bytes: independent of per-VMA locks.

I'd expect that on most 64bit configs we usually end up with 192 bytes today.

Given that our slab sizes are ...32/64/96/128/192/..., I guess we'd have to be
lucky to jump between sizes on most configs.

Maybe on 32bit? Not sure if anybody would really notice.

> I suspect there are some very simple ways to implement the
> per-VMA locks that don't require any additional fields, especially if
> such an approach was limited to SMP=n configs*. For now, do the
> simplest thing: use the same implementation everywhere.
> 
>  * For example, since SMP=n configs don't care much about scalability or
>    false sharing, there could be a single, global VMA seqcount that is
>    bumped when any VMA is modified instead of having space in each VMA
>    for a seqcount.


I'd do that only if we actually determine this to be a problem.


-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] mm: Make per-VMA locks available universally
  2026-05-08 10:12   ` [PATCH 1/6] mm: Make per-VMA locks available universally David Hildenbrand (Arm)
@ 2026-05-08 10:58     ` David Hildenbrand (Arm)
  2026-05-08 16:55       ` Lorenzo Stoakes
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-08 10:58 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel
  Cc: Andrew Morton, Liam R. Howlett, linux-mm, Lorenzo Stoakes,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On 5/8/26 12:12, David Hildenbrand (Arm) wrote:
> On 4/29/26 20:19, Dave Hansen wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> The per-VMA locks have been around for several years. They've had some
>> bugs worked out of them and have seen quite wide use. However, they
>> are still only available when architectures explicitly enable them.
>> Remove the conditional compilation around the per-VMA locks, making
>> them available on all architectures and configs.
> 
> Yes, we should really just make it now just a fixed part of the kernel design.
> 
>>
>> The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
>> when the architecture started using per-VMA locks in the fault
>> handler. But, contrary to the naming, the Kconfig option does not
>> really indicate whether the architecture supports per-VMA locks or
>> not. It is more of a marker for whether the architecture is likely to
>> benefit from per-VMA locks.
>>
>> To me, the most important thing side-effect of universal availability
>> is letting per-VMA locks be used in SMP=n configs. This lets us use
>> per-VMA locking in all x86 code without fallbacks.
>>
>> Overall, this just generally makes the kernel simpler. Just look at
>> the diffstat. It also opens the door to users that want to use the
>> per-VMA locks in common code. Doing *that* can bring additional
>> simplifications.
>>
>> The downside of this is adding some fields to vm_area_struct and
>> mm_struct. 
> 
> I'd assume most distributions would already enable it.
> 
> mm_struct is very likely not a problem.
> 
> On x86-64, the smallest size for vma_area_struct possible (make allnoconfig)
> seems to be 68bytes. The largest size (make allyesconfig) with lockdep and all
> that is 256bytes. Without lockdep we are at 192 bytes: independent of per-VMA locks.
> 
> I'd expect that on most 64bit configs we usually end up with 192 bytes today.
> 
> Given that our slab sizes are ...32/64/96/128/192/..., I guess we'd have to be
> lucky to jump between sizes on most configs.

As Vlastimil reminded me, the have separate slab caches, so they are better
packed. So I don't think a small increase there would really be a problem.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
       [not found] <20260429181954.F50224AE@davehans-spike.ostc.intel.com>
       [not found] ` <20260429182005.00BF70D8@davehans-spike.ostc.intel.com>
       [not found] ` <20260429181955.0C443845@davehans-spike.ostc.intel.com>
@ 2026-05-08 16:52 ` Lorenzo Stoakes
  2026-05-08 17:01   ` Dave Hansen
       [not found] ` <20260429181957.7511C256@davehans-spike.ostc.intel.com>
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:52 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

I'm guessing this is kinda an RFC? :P

On Wed, Apr 29, 2026 at 11:19:54AM -0700, Dave Hansen wrote:
> tl;dr: I hope I'm not missing something big here. The basic
> observastion here is that forcing code to account for per-VMA lock
> failure adds a lot of complexity. This series theorizes that with a
> some Kconfig changes and a new helper, many callers can avoid writing
> code that falls back to mmap_lock.

In general very much in support of this!

It'd be great to just know that this is available and frankly I think it's a
critical part of the kernel.

Obviously Suren needs to have a look through, most important of all :)

>
> --
>
> When working on some x86 shadow stack code, it was a real pain to
> avoid causing recursive locking problems with mmap_lock. One way
> to avoid those was to avoid mmap_lock and use per-VMA locks instead.
> They are great, but they are not available in all configs which
> makes them unusable in generic code, or if you want to completely
> avoid mmap_lock.

Yeah, lock ordering is a pain.

>
> Make per-VMA locks available in all configs. Right now, they are
> only available on select architectures when SMP and MMU are enabled.
> But all of the primitives that per-VMA locks are built on (RCU, maple
> trees, refcounts) work just fine without SMP or MMU.
>
> Their only real downside is that they make VMAs a wee bit bigger
> on !MMU and !SMP builds.
>
> The upside is much cleaner code, lower complexity and less #ifdeffery.
>
> Building on top of universally-available per-VMA locks, introduce a
> new helper. Since the new API does not require callers to have a
> fallback to mmap_lock, it's much easier to use. Callers could
> potentially replace this very common kernel idiom:
>
> 	mmap_read_lock(mm);
> 	vma = vma_lookup()
> 	// fiddle with vma
> 	mmap_read_unlock(mm);
>
> with:
>
> 	vma = lock_vma_under_rcu_wait(mm, address);

I will look at what you're proposing but this seems a bit like something I
proposed at LSF (but was probably not the right solution for what was under
discussion).

Doing this 'right' would require quite a bit of engineering effort. The VMA
locks are pretty bloody complicated :) so we have to be careful not to spread
the complexity around too much.

But I guess you could 'wait' by doing it in the slow path and then using
vma_start_read_locked()...

Of course that'd not help you with any lock inversions though!

Anyway need to read the code :)

> 	// fiddle with vma
> 	vma_end_read(vma);
>
> Which avoids mmap_lock entirely in the fast path.

Yeah it's nice!

>
> Things I think needs more discussion:
>  * The new helper has a horrible name. Suggestions are very welcome.
>  * I'm not very confident that this approach completely avoids the
>    deadlock issues that arise from touching userspace while holding
>    mm-related locks.

Yeah we have to be careful...

>  * Can the helper avoid the goto, maybe by taking the VMA refcount
>    while holding mmap_lock?

Surely that'd defeat the purpose of VMA locks though? you'd hold the mmap lock
for less time but you're still contending vs. _any_ VMA write locks whilst
trying to get a VMA read lock?

Unless it's on a slow path... hmm :)

>  * mm_struct and vm_area_struct "bloat"

Probably not a problem really. For any modern system you're using the fields.

>
> I've included a couple patches where I think the new helper really
> makes the code nicer.
>
> I'm keeping the cc list on the short side for now because I'm not
> actually proposing that we go ahead and do the ipv4 changes, for
> example.

Ack!

>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
>
>  arch/arm/Kconfig                       |    1
>  arch/arm64/Kconfig                     |    1
>  arch/loongarch/Kconfig                 |    1
>  arch/powerpc/platforms/powernv/Kconfig |    1
>  arch/powerpc/platforms/pseries/Kconfig |    1
>  arch/riscv/Kconfig                     |    1
>  arch/s390/Kconfig                      |    1
>  arch/x86/Kconfig                       |    2 -
>  arch/x86/kernel/shstk.c                |   47 +++++++++++-------------------
>  drivers/android/binder_alloc.c         |   39 ++++++-------------------
>  fs/proc/internal.h                     |    2 -
>  fs/proc/task_mmu.c                     |   51 ---------------------------------
>  include/linux/mm.h                     |   12 -------
>  include/linux/mm_types.h               |    7 ----
>  include/linux/mmap_lock.h              |   50 +-------------------------------
>  kernel/fork.c                          |    2 -
>  mm/Kconfig                             |   13 --------
>  mm/mmap_lock.c                         |   45 +++++++++++++++++++++++++++--
>  net/ipv4/tcp.c                         |   31 +++++---------------
>  19 files changed, 82 insertions(+), 226 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/6] mm: Make per-VMA locks available universally
  2026-05-08 10:58     ` David Hildenbrand (Arm)
@ 2026-05-08 16:55       ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 16:55 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Dave Hansen, linux-kernel, Andrew Morton, Liam R. Howlett,
	linux-mm, Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On Fri, May 08, 2026 at 12:58:35PM +0200, David Hildenbrand (Arm) wrote:
> On 5/8/26 12:12, David Hildenbrand (Arm) wrote:
> > On 4/29/26 20:19, Dave Hansen wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >>
> >> The per-VMA locks have been around for several years. They've had some
> >> bugs worked out of them and have seen quite wide use. However, they
> >> are still only available when architectures explicitly enable them.
> >> Remove the conditional compilation around the per-VMA locks, making
> >> them available on all architectures and configs.
> >
> > Yes, we should really just make it now just a fixed part of the kernel design.

Agreed

> >
> >>
> >> The approach up to now seemed to be to add ARCH_SUPPORTS_PER_VMA_LOCK
> >> when the architecture started using per-VMA locks in the fault
> >> handler. But, contrary to the naming, the Kconfig option does not
> >> really indicate whether the architecture supports per-VMA locks or
> >> not. It is more of a marker for whether the architecture is likely to
> >> benefit from per-VMA locks.
> >>
> >> To me, the most important thing side-effect of universal availability
> >> is letting per-VMA locks be used in SMP=n configs. This lets us use
> >> per-VMA locking in all x86 code without fallbacks.
> >>
> >> Overall, this just generally makes the kernel simpler. Just look at
> >> the diffstat. It also opens the door to users that want to use the
> >> per-VMA locks in common code. Doing *that* can bring additional
> >> simplifications.
> >>
> >> The downside of this is adding some fields to vm_area_struct and
> >> mm_struct.
> >
> > I'd assume most distributions would already enable it.

Yes, and I think any modern system will treat it as a necessity!

> >
> > mm_struct is very likely not a problem.

No not at all that's a lost cause :)) But also less improtant as less
propagated, obviously.

> >
> > On x86-64, the smallest size for vma_area_struct possible (make allnoconfig)
> > seems to be 68bytes. The largest size (make allyesconfig) with lockdep and all
> > that is 256bytes. Without lockdep we are at 192 bytes: independent of per-VMA locks.
> >
> > I'd expect that on most 64bit configs we usually end up with 192 bytes today.
> >
> > Given that our slab sizes are ...32/64/96/128/192/..., I guess we'd have to be
> > lucky to jump between sizes on most configs.
>
> As Vlastimil reminded me, the have separate slab caches, so they are better
> packed. So I don't think a small increase there would really be a problem.

Good to have this information also!

>
> --
> Cheers,
>
> David

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/6] mm: Make per-VMA locks available in all builds
  2026-05-08 16:52 ` [PATCH 0/6] mm: Make per-VMA locks available in all builds Lorenzo Stoakes
@ 2026-05-08 17:01   ` Dave Hansen
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Hansen @ 2026-05-08 17:01 UTC (permalink / raw)
  To: Lorenzo Stoakes, Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On 5/8/26 09:52, Lorenzo Stoakes wrote:
...
>>  * Can the helper avoid the goto, maybe by taking the VMA refcount
>>    while holding mmap_lock?
> 
> Surely that'd defeat the purpose of VMA locks though? you'd hold the mmap lock
> for less time but you're still contending vs. _any_ VMA write locks whilst
> trying to get a VMA read lock?
> 
> Unless it's on a slow path... hmm :)

Yup. It's in a slow path. the example helper I had here does:

retry:
	vma = lock_vma_under_rcu(mm, address);
	if (vma)
		return vma;
	mmap_read_lock(mm);
	vma = vma_lookup(mm, address);
	mmap_read_unlock(mm);	
	goto retry;

It avoids mmap_lock in the common, fast case. I was hoping to replace it
with something like:

	vma = lock_vma_under_rcu(mm, address);
	if (vma)
		return vma;
	mmap_read_lock(mm);
	vma = vma_lookup(mm, address);
	vma_start_read(vma->vm_mm, vma); // Is this safe?
	mmap_read_unlock(mm);	
	return vma;

Which still uses mmap_lock, but avoids the goto. I'm pretty sure the
first one doesn't have any locking problems. The second one, I need to
think about a _lot_ more.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock
       [not found] ` <20260429181957.7511C256@davehans-spike.ostc.intel.com>
@ 2026-05-08 17:06   ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On Wed, Apr 29, 2026 at 11:19:57AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> tl;dr: lock_vma_under_rcu() is already a trylock. No need to do both
> it and mmap_read_trylock().
>
> Long Version:
>
> == Background ==
>
> Historically, binder used an mmap_read_trylock() in its shrinker code.
> This ensures that reclaim is not blocked on an mmap_lock. Commit
> 95bc2d4a9020 ("binder: use per-vma lock in page reclaiming") added
> support for the per-VMA lock, but but left mmap_read_trylock() as a
> fallback.
>
> This was presumably because the per-VMA locking can fail for several
> reasons and most (all?) lock_vma_under_rcu() callers have a fallback
> to mmap_read_trylock().
>
> == Problem ==
>
> The fallback is not worth the complexity here. lock_vma_under_rcu() is
> essentially already a non-blocking trylock. The main reason it fails
> is also the reason mmap_read_trylock() fails: something is holding
> mmap_write_lock().
>
> The only remedy for a collision with mmap_write_lock() is to wait,
> which this code can not do. So the "fallback" after
> lock_vma_under_rcu() failure is not really a fallback: it is really
> likely to just be retrying in vain. That retry in an of itself isn't
> horrible. But it adds complexity.
>
> == Solution ==
>
> Now that per-VMA locks are universally available, lock_vma_under_rcu()
> will not persistently fail. Rely on it alone and simplify the code.
>
> Full disclosure: I originally tried to do this with
> lock_vma_under_rcu_wait(), but it did not fit well with the mmap_lock
> trylock semantics. Claude caught this in a review and suggested the
> approach in this path. It seemed sane to me. So, Suggesed-by: Claude,
> I guess.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

I mean this seems reasonable to me, I don't really understand why we'd fall back
to mmap... try lock :)

If semantically you're trylocking then as you say, lock_vma_under_rcu() already
does that.

Honestly I feel this could be submitted separate from the series.

I'm not a binder guy, but this looks right to me so:

Acked-by: Lorenzo Stoakes <ljs@kernel.org> with nit below addressed.

> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
>  b/drivers/android/binder_alloc.c |   22 +++++-----------------
>  1 file changed, 5 insertions(+), 17 deletions(-)
>
> diff -puN drivers/android/binder_alloc.c~binder-try-vma-lock drivers/android/binder_alloc.c
> --- a/drivers/android/binder_alloc.c~binder-try-vma-lock	2026-04-29 11:18:50.066607065 -0700
> +++ b/drivers/android/binder_alloc.c	2026-04-29 11:18:50.069607180 -0700
> @@ -1142,7 +1142,6 @@ enum lru_status binder_alloc_free_page(s
>  	struct vm_area_struct *vma;
>  	struct page *page_to_free;
>  	unsigned long page_addr;
> -	int mm_locked = 0;

Man why are we using int instead of a bool in 2026 in the first place :P

>  	size_t index;
>
>  	if (!mmget_not_zero(mm))
> @@ -1151,15 +1150,10 @@ enum lru_status binder_alloc_free_page(s
>  	index = mdata->page_index;
>  	page_addr = alloc->vm_start + index * PAGE_SIZE;
>
> -	/* attempt per-vma lock first */
> +	/* attempt per-vma lock */
>  	vma = lock_vma_under_rcu(mm, page_addr);
> -	if (!vma) {
> -		/* fall back to mmap_lock */
> -		if (!mmap_read_trylock(mm))
> -			goto err_mmap_read_lock_failed;
> -		mm_locked = 1;
> -		vma = vma_lookup(mm, page_addr);
> -	}
> +	if (!vma)
> +		goto err_mmap_read_lock_failed;

Nit, but we probably want to rename that to err_vma_lock_failed or something!

>
>  	if (!mutex_trylock(&alloc->mutex))
>  		goto err_get_alloc_mutex_failed;
> @@ -1191,10 +1185,7 @@ enum lru_status binder_alloc_free_page(s
>  	}
>
>  	mutex_unlock(&alloc->mutex);
> -	if (mm_locked)
> -		mmap_read_unlock(mm);
> -	else
> -		vma_end_read(vma);
> +	vma_end_read(vma);
>  	mmput_async(mm);
>  	binder_free_page(page_to_free);
>
> @@ -1203,10 +1194,7 @@ enum lru_status binder_alloc_free_page(s
>  err_invalid_vma:
>  	mutex_unlock(&alloc->mutex);
>  err_get_alloc_mutex_failed:
> -	if (mm_locked)
> -		mmap_read_unlock(mm);
> -	else
> -		vma_end_read(vma);
> +	vma_end_read(vma);
>  err_mmap_read_lock_failed:
>  	mmput_async(mm);
>  err_mmget:
> _

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
       [not found] ` <20260429181959.BC9DABC5@davehans-spike.ostc.intel.com>
@ 2026-05-08 17:26   ` Lorenzo Stoakes
  2026-05-08 20:15     ` Lorenzo Stoakes
  0 siblings, 1 reply; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> == Background ==
>
> There are basically two parallel ways to look up a VMA: the
> traditional way, which is protected by mmap_lock, and the RCU-based
> per-VMA lock way which is based on RCU and refcounts.
>
> == Problems ==
>
> The mmap_lock one is more straightforward to use but it has a big
> disadvantage in that it can not be mixed with page faults since those
> can take mmap_lock for read.
>
> The RCU one can be mixed with faults, but it is not available in all
> configs, so all RCU users need to be able to fall back to the
> traditional way.
>
> == Solution ==
>
> Add a variant of the RCU-based lookup that waits for writers. This is
> basically the same as the existing RCU-based lookup, but it also takes
> mmap_lock for read and waits for writers to finish before returning
> the VMA. This has two big advantages:
>
>  1. Callers do not need to have a fallback path for when they
>     collide with writers.
>  2. It can be used in contexts where page faults can happen because
>     it can take the mmap_lock for read but never *holds* it.
>
> == Discussion ==
>
> I am not married to the naming here at all. Naming suggestions would
> be much appreciated.
>
> This basically uses mmap_lock to wait for writers, nothing else. The
> VMA is obviously stable under mmap_read_lock() and the code _can_
> likely take advantage of that and possibly even remove the goto. For
> instance, it could (probably) bump the VMA refcount and exclude future
> writers. That would eliminate the goto.
>
> But the approach as-is is probably the smallest line count and
> arguably the simplest approach. It is a good place to start a
> conversation if nothing else.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
>  b/include/linux/mmap_lock.h |    2 ++
>  b/mm/mmap_lock.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 45 insertions(+)
>
> diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
> --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait	2026-04-29 11:18:50.633628887 -0700
> +++ b/include/linux/mmap_lock.h	2026-04-29 11:18:50.707631737 -0700
> @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str
>
>  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
>  					  unsigned long address);
> +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> +					  unsigned long address);
>
>  /*
>   * Locks next vma pointed by the iterator. Confirms the locked vma has not
> diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
> --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait	2026-04-29 11:18:50.704631622 -0700
> +++ b/mm/mmap_lock.c	2026-04-29 11:18:50.707631737 -0700
> @@ -340,6 +340,49 @@ inval:
>  	return NULL;
>  }
>
> +/*
> + * Find the VMA covering 'address' and lock it for reading. Waits for writers to
> + * finish if the VMA is being modified. Returns NULL if there is no VMA covering
> + * 'address'.
> + *
> + * The fast path does not take mmap lock.
> + */
> +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> +					       unsigned long address)
> +{
> +	struct vm_area_struct *vma;
> +
> +retry:
> +	vma = lock_vma_under_rcu(mm, address);
> +	/* Fast path: return stable VMA covering 'address': */
> +	if (vma)
> +		return vma;
> +
> +	/*
> +	 * Slow path: the VMA covering 'address' is being modified.
> +	 * or there is no VMA covering 'address'. Rule out the
> +	 * possibility that the VMA is being modified:
> +	 */
> +	mmap_read_lock(mm);
> +	vma = vma_lookup(mm, address);
> +	mmap_read_unlock(mm);
> +
> +	/* There was for sure no VMA covering 'address': */
> +	if (!vma)
> +		return NULL;
> +
> +	/*
> +	 * VMA was likely being modified during RCU lookup. Try again.
> +	 * mmap_read_lock() waited for the writer to complete and the
> +	 * writer is now done.
> +	 *
> +	 * There is no guarantee that any single retry will succeed,
> +	 * and it is possible but highly unlikely this will loop
> +	 * forever.
> +	 */
> +	goto retry;
> +}

Hmm yeah this is not ideal :)

You don't have to do any of this we already have logic to help out here -
vma_start_read_locked().

That uses the fact the mmap read lock is held to pin the VMA lock, because VMA
write locks require an mmap write lock, and the mmap read lock prevents them.

That way you can eliminate the retry.

So instead:

/*
 * Tries to lock under RCU, failing that it acquires the VMA lock with the mmap
 * read lock held.
 */
struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
					       unsigned long address)
{
	struct vm_area_struct *vma;

	might_sleep();

	vma = lock_vma_under_rcu(mm, address);
	if (vma)
		return vma;

	/* Slow path: preclude VMA writers by getting mmap read lock. */
	guard(rwsem_read)(&mm->mmap_lock);
	vma = vma_lookup(mm, address);
	/* VMA isn't there. */
	if (!vma)
		return NULL;

	return vma_start_read_locked(vma);
}

(Untested, not even build tested code)

Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock()
also does some trace stuff, but the guard makes it WAY nicer so (when it goes
out of scope the mmap read lock is dropped).

Maybe we could add a custom mmap lock guard to cover that too?

Suren - I actually wonder if vma_start_read_locked() actually needs to return a
boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are -
detached or excluding readers on write/detach.

But in both of those cases, vma_lookup() would surely not find the VMA, and
since we're precluding writers (so no write lock, no detach possible) that means
we should never hit it?

Wonder if we should make it a void and add a VM_WARN_ON_ONCE()?


> +
>  static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
>  							    struct vma_iterator *vmi,
>  							    unsigned long from_addr)
> _

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] binder: Remove mmap_lock fallback
       [not found] ` <20260429182000.93887DFB@davehans-spike.ostc.intel.com>
@ 2026-05-08 17:29   ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On Wed, Apr 29, 2026 at 11:20:00AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Previously, the per-VMA locking could fail in the face of writers
> which necessitate a fallback to mmap_lock. The new
> lock_vma_under_rcu_wait() will wait for writers instead of failing.
>
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

LGTM in principal, though again not a binder dev so just an A-b :)

Acked-by: Lorenzo Stoakes <ljs@kernel.org>

> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
>  b/drivers/android/binder_alloc.c |   17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
>
> diff -puN drivers/android/binder_alloc.c~binder-vma-waiter drivers/android/binder_alloc.c
> --- a/drivers/android/binder_alloc.c~binder-vma-waiter	2026-04-29 11:18:51.307654829 -0700
> +++ b/drivers/android/binder_alloc.c	2026-04-29 11:18:51.310654944 -0700
> @@ -259,21 +259,14 @@ static int binder_page_insert(struct bin
>  	struct vm_area_struct *vma;
>  	int ret = -ESRCH;
>
> -	/* attempt per-vma lock first */
> -	vma = lock_vma_under_rcu(mm, addr);
> -	if (vma) {
> -		if (binder_alloc_is_mapped(alloc))
> -			ret = vm_insert_page(vma, addr, page);
> -		vma_end_read(vma);
> +	vma = lock_vma_under_rcu_wait(mm, addr);

Yeah this name is definitely iffy haha!

> +	if (!vma)
>  		return ret;
> -	}
>
> -	/* fall back to mmap_lock */
> -	mmap_read_lock(mm);
> -	vma = vma_lookup(mm, addr);
> -	if (vma && binder_alloc_is_mapped(alloc))
> +	if (binder_alloc_is_mapped(alloc))
>  		ret = vm_insert_page(vma, addr, page);
> -	mmap_read_unlock(mm);
> +
> +	vma_end_read(vma);
>
>  	return ret;
>  }
> _

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/6] tcp: Remove mmap_lock fallback path
       [not found] ` <20260429182002.BB61C7BC@davehans-spike.ostc.intel.com>
@ 2026-05-08 17:32   ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 17:32 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On Wed, Apr 29, 2026 at 11:20:02AM -0700, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Previously, the per-VMA locking could fail in the face of writers
> which necessitates a fallback to mmap_lock. The new
> lock_vma_under_rcu_wait() will wait for writers instead of failing.
>
> Use the new helper. Wait for writers. Remove the fallback to mmap_lock.
>
> This really is a nice cleanup. It removes the need to pass the lock
> state back and forth to find_tcp_vma().
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>

Yeah, LGTM again, though am not a networking guy so:

Acked-by: Lorenzo Stoakes <ljs@kernel.org>

> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Lorenzo Stoakes <ljs@kernel.org>
> Cc: Vlastimil Babka <vbabka@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: linux-mm@kvack.org
> ---
>
>  b/net/ipv4/tcp.c |   31 +++++++++----------------------
>  1 file changed, 9 insertions(+), 22 deletions(-)
>
> diff -puN net/ipv4/tcp.c~ipv4-tcp-vma-waiter net/ipv4/tcp.c
> --- a/net/ipv4/tcp.c~ipv4-tcp-vma-waiter	2026-04-29 11:18:51.870676498 -0700
> +++ b/net/ipv4/tcp.c	2026-04-29 11:18:51.874676652 -0700
> @@ -2171,27 +2171,18 @@ static void tcp_zc_finalize_rx_tstamp(st
>  }
>
>  static struct vm_area_struct *find_tcp_vma(struct mm_struct *mm,
> -					   unsigned long address,
> -					   bool *mmap_locked)
> +					   unsigned long address)
>  {
> -	struct vm_area_struct *vma = lock_vma_under_rcu(mm, address);
> +	struct vm_area_struct *vma = lock_vma_under_rcu_wait(mm, address);
>
> -	if (vma) {
> -		if (vma->vm_ops != &tcp_vm_ops) {
> -			vma_end_read(vma);
> -			return NULL;
> -		}
> -		*mmap_locked = false;
> -		return vma;
> -	}
> +	if (!vma)
> +		return NULL;
>
> -	mmap_read_lock(mm);
> -	vma = vma_lookup(mm, address);
> -	if (!vma || vma->vm_ops != &tcp_vm_ops) {
> -		mmap_read_unlock(mm);
> +	if (vma->vm_ops != &tcp_vm_ops) {
> +		vma_end_read(vma);
>  		return NULL;
>  	}
> -	*mmap_locked = true;
> +
>  	return vma;
>  }
>
> @@ -2212,7 +2203,6 @@ static int tcp_zerocopy_receive(struct s
>  	u32 seq = tp->copied_seq;
>  	u32 total_bytes_to_map;
>  	int inq = tcp_inq(sk);
> -	bool mmap_locked;
>  	int ret;
>
>  	zc->copybuf_len = 0;
> @@ -2237,7 +2227,7 @@ static int tcp_zerocopy_receive(struct s
>  		return 0;
>  	}
>
> -	vma = find_tcp_vma(current->mm, address, &mmap_locked);
> +	vma = find_tcp_vma(current->mm, address);
>  	if (!vma)
>  		return -EINVAL;
>
> @@ -2319,10 +2309,7 @@ static int tcp_zerocopy_receive(struct s
>  						   zc, total_bytes_to_map);
>  	}
>  out:
> -	if (mmap_locked)
> -		mmap_read_unlock(current->mm);
> -	else
> -		vma_end_read(vma);
> +	vma_end_read(vma);
>  	/* Try to copy straggler data. */
>  	if (!ret)
>  		copylen = tcp_zc_handle_leftover(zc, sk, skb, &seq, copybuf_len, tss);
> _

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
  2026-05-08 17:26   ` [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Lorenzo Stoakes
@ 2026-05-08 20:15     ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 20:15 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, Andrew Morton, Liam R. Howlett, linux-mm,
	Shakeel Butt, Suren Baghdasaryan, Vlastimil Babka

On Fri, May 08, 2026 at 06:26:57PM +0100, Lorenzo Stoakes wrote:
> On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > == Background ==
> >
> > There are basically two parallel ways to look up a VMA: the
> > traditional way, which is protected by mmap_lock, and the RCU-based
> > per-VMA lock way which is based on RCU and refcounts.
> >
> > == Problems ==
> >
> > The mmap_lock one is more straightforward to use but it has a big
> > disadvantage in that it can not be mixed with page faults since those
> > can take mmap_lock for read.
> >
> > The RCU one can be mixed with faults, but it is not available in all
> > configs, so all RCU users need to be able to fall back to the
> > traditional way.
> >
> > == Solution ==
> >
> > Add a variant of the RCU-based lookup that waits for writers. This is
> > basically the same as the existing RCU-based lookup, but it also takes
> > mmap_lock for read and waits for writers to finish before returning
> > the VMA. This has two big advantages:
> >
> >  1. Callers do not need to have a fallback path for when they
> >     collide with writers.
> >  2. It can be used in contexts where page faults can happen because
> >     it can take the mmap_lock for read but never *holds* it.
> >
> > == Discussion ==
> >
> > I am not married to the naming here at all. Naming suggestions would
> > be much appreciated.
> >
> > This basically uses mmap_lock to wait for writers, nothing else. The
> > VMA is obviously stable under mmap_read_lock() and the code _can_
> > likely take advantage of that and possibly even remove the goto. For
> > instance, it could (probably) bump the VMA refcount and exclude future
> > writers. That would eliminate the goto.
> >
> > But the approach as-is is probably the smallest line count and
> > arguably the simplest approach. It is a good place to start a
> > conversation if nothing else.
> >
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: Vlastimil Babka <vbabka@kernel.org>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: linux-mm@kvack.org
> > ---
> >
> >  b/include/linux/mmap_lock.h |    2 ++
> >  b/mm/mmap_lock.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 45 insertions(+)
> >
> > diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
> > --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait	2026-04-29 11:18:50.633628887 -0700
> > +++ b/include/linux/mmap_lock.h	2026-04-29 11:18:50.707631737 -0700
> > @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str
> >
> >  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >  					  unsigned long address);
> > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> > +					  unsigned long address);
> >
> >  /*
> >   * Locks next vma pointed by the iterator. Confirms the locked vma has not
> > diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
> > --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait	2026-04-29 11:18:50.704631622 -0700
> > +++ b/mm/mmap_lock.c	2026-04-29 11:18:50.707631737 -0700
> > @@ -340,6 +340,49 @@ inval:
> >  	return NULL;
> >  }
> >
> > +/*
> > + * Find the VMA covering 'address' and lock it for reading. Waits for writers to
> > + * finish if the VMA is being modified. Returns NULL if there is no VMA covering
> > + * 'address'.
> > + *
> > + * The fast path does not take mmap lock.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> > +					       unsigned long address)
> > +{
> > +	struct vm_area_struct *vma;
> > +
> > +retry:
> > +	vma = lock_vma_under_rcu(mm, address);
> > +	/* Fast path: return stable VMA covering 'address': */
> > +	if (vma)
> > +		return vma;
> > +
> > +	/*
> > +	 * Slow path: the VMA covering 'address' is being modified.
> > +	 * or there is no VMA covering 'address'. Rule out the
> > +	 * possibility that the VMA is being modified:
> > +	 */
> > +	mmap_read_lock(mm);
> > +	vma = vma_lookup(mm, address);
> > +	mmap_read_unlock(mm);
> > +
> > +	/* There was for sure no VMA covering 'address': */
> > +	if (!vma)
> > +		return NULL;
> > +
> > +	/*
> > +	 * VMA was likely being modified during RCU lookup. Try again.
> > +	 * mmap_read_lock() waited for the writer to complete and the
> > +	 * writer is now done.
> > +	 *
> > +	 * There is no guarantee that any single retry will succeed,
> > +	 * and it is possible but highly unlikely this will loop
> > +	 * forever.
> > +	 */
> > +	goto retry;
> > +}
>
> Hmm yeah this is not ideal :)
>
> You don't have to do any of this we already have logic to help out here -
> vma_start_read_locked().
>
> That uses the fact the mmap read lock is held to pin the VMA lock, because VMA
> write locks require an mmap write lock, and the mmap read lock prevents them.
>
> That way you can eliminate the retry.
>
> So instead:
>
> /*
>  * Tries to lock under RCU, failing that it acquires the VMA lock with the mmap
>  * read lock held.
>  */
> struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> 					       unsigned long address)
> {
> 	struct vm_area_struct *vma;
>
> 	might_sleep();
>
> 	vma = lock_vma_under_rcu(mm, address);
> 	if (vma)
> 		return vma;
>
> 	/* Slow path: preclude VMA writers by getting mmap read lock. */
> 	guard(rwsem_read)(&mm->mmap_lock);

Actually we'd possibly want the killable version of this? So a fatal signal can
interrupt and not indefinitely block others.

> 	vma = vma_lookup(mm, address);
> 	/* VMA isn't there. */
> 	if (!vma)
> 		return NULL;
>
> 	return vma_start_read_locked(vma);

Sorry this should be:

 	if (!vma_start_read_locked(vma))
		return NULL;
	return vma;

But as per below not sure we will actually see vma_start_read_locked() fail, it
doesn't touch the write side sequence number so overflow shouldn't be a problem.

> }
>
> (Untested, not even build tested code)
>
> Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock()
> also does some trace stuff, but the guard makes it WAY nicer so (when it goes
> out of scope the mmap read lock is dropped).
>
> Maybe we could add a custom mmap lock guard to cover that too?
>
> Suren - I actually wonder if vma_start_read_locked() actually needs to return a
> boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are -
> detached or excluding readers on write/detach.
>
> But in both of those cases, vma_lookup() would surely not find the VMA, and
> since we're precluding writers (so no write lock, no detach possible) that means
> we should never hit it?
>
> Wonder if we should make it a void and add a VM_WARN_ON_ONCE()?
>
>
> > +
> >  static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
> >  							    struct vma_iterator *vmi,
> >  							    unsigned long from_addr)
> > _
>
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path
  2026-05-05 16:39     ` Dave Hansen
@ 2026-05-08 20:39       ` Lorenzo Stoakes
  0 siblings, 0 replies; 13+ messages in thread
From: Lorenzo Stoakes @ 2026-05-08 20:39 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Edgecombe, Rick P, linux-kernel@vger.kernel.org,
	dave.hansen@linux.intel.com, Liam.Howlett@oracle.com,
	linux-mm@kvack.org, surenb@google.com, vbabka@kernel.org,
	shakeel.butt@linux.dev, akpm@linux-foundation.org

On Tue, May 05, 2026 at 09:39:09AM -0700, Dave Hansen wrote:
> On 5/4/26 16:15, Edgecombe, Rick P wrote:
> >
> > I guess the problem is the lock ordering. Not sure if there is any slow path
> > avoidance details that could make this splat a false positive. But how about
> > this simpler munmap() case:
> >
> > Shadow stack signal                          munmap()
> > -------------------                          --------
> > vma_start_read() (VM_SHADOW_STACK check)
> >                                              mmap_write_lock()
> > mmap_read_lock() (user fault) <- deadlock
> >                                              vma_start_write() <-deadlock
>
> It's a little more complicated than that in practice, but I think you're
> right.
>
> I'm not sure when this would happen in practice because the fault is
> actually on the VMA that's being held for read. So I think another
> writer would have had to sneak in there and zap the VMA.

Honestly I think any work around is just going to be more complicated than the
existing code, which sort of defeats the purpose of the series.

There's not really a way to speculate with a VMA seqnum because you'd have to be
able to observe its vma->vm_lock_seq and to do that you'd have to find it again
immediately afterwards and then you'd looked up twice, got the lock twice only
to confirm its the same damn thing :)

The issue is that a page fault on the same thread is always going to risk an
mmap read lock being taken (possibly due to I/O waiting and fault retry for
one). And faults/zaps are inherently racey and neither acquire the write lock so
the read lock doesn't preclude them.

And you can't realy disable page faults because you're potentially relying on
them to populate what you're touching...

Also there's some tricky stuff done on initial stack initialisation that can
cause a headache as well (when stack is set up), see relocate_vma_down() to make
life more painful.

So I think the existing code is simpler.

It doesn't mean it isn't still useful to move towards having VMA locks
everywhere though :) unless Suren or others can find a flaw with that...

>
> The funny thing is that the fault handler is really just trying to find
> the VMA. The thing causing the fault *has* the VMA. So it's as simple as
> just passing the VMA down into the fault handler, right? How hard could
> it be? ;)
>
> There are still games to play, but they all involve dropping locks and
> retrying, like:
>
> retry:
> 	vma = lock_vma_under_rcu()
> 	// muck with VMA
> 	pagefault_disable() // avoid deadlock
> 	ret = copy_from_user()
> 	pagefault_enable()
> 	vma_end_read();
>
> 	if (!ret)
> 		return SUCCESS;
>
> 	mmap_read_lock()
> 	vma = vma_lookup()
> 	mmap_read_unlock() // avoid deadlock before touching userspace
> 	// check for valid VMA to avoid looping when there is no VMA
> 	if (!vma)
> 		return -ERRNO;
>
> 	// uh oh, slow path, something faulted
> 	get_user_pages()??
> 	//or
> 	copy_from_user() without the VMA??
>
> 	goto retry;
>
>
> This also needs some very careful thought, but something like this
> should work, where we avoid fault handling (and lock taking) in the
> actual #PF and do it in a context where the VMA lock is held:
>
> 	vma = lock_vma_under_rcu();
> 	pagefault_disable() // avoid deadlock
>
> 	while (1) {
> 		ret = copy_from_user()
> 		if (!ret)
> 			break;
> 		handle_mm_fault(vma, address, FAULT_FLAG_VMA_LOCK...);
> 	};
>
> 	pagefault_enable()
> 	vma_end_read();
>
> That's effectively just short-circuiting the #PF code which does the:
>
>         vma = lock_vma_under_rcu(mm, address);
> 	...
>         fault = handle_mm_fault(vma, address, ... FAULT_FLAG_VMA_LOCK)
>
> sequence _itself_.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-08 20:39 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20260429181954.F50224AE@davehans-spike.ostc.intel.com>
     [not found] ` <20260429182005.00BF70D8@davehans-spike.ostc.intel.com>
2026-05-04 23:15   ` [PATCH 6/6] x86/mm: Avoid mmap lock for shadow stack pop fast path Edgecombe, Rick P
2026-05-05 16:39     ` Dave Hansen
2026-05-08 20:39       ` Lorenzo Stoakes
     [not found] ` <20260429181955.0C443845@davehans-spike.ostc.intel.com>
2026-05-08 10:12   ` [PATCH 1/6] mm: Make per-VMA locks available universally David Hildenbrand (Arm)
2026-05-08 10:58     ` David Hildenbrand (Arm)
2026-05-08 16:55       ` Lorenzo Stoakes
2026-05-08 16:52 ` [PATCH 0/6] mm: Make per-VMA locks available in all builds Lorenzo Stoakes
2026-05-08 17:01   ` Dave Hansen
     [not found] ` <20260429181957.7511C256@davehans-spike.ostc.intel.com>
2026-05-08 17:06   ` [PATCH 2/6] binder: Make shrinker rely solely on per-VMA lock Lorenzo Stoakes
     [not found] ` <20260429181959.BC9DABC5@davehans-spike.ostc.intel.com>
2026-05-08 17:26   ` [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Lorenzo Stoakes
2026-05-08 20:15     ` Lorenzo Stoakes
     [not found] ` <20260429182000.93887DFB@davehans-spike.ostc.intel.com>
2026-05-08 17:29   ` [PATCH 4/6] binder: Remove mmap_lock fallback Lorenzo Stoakes
     [not found] ` <20260429182002.BB61C7BC@davehans-spike.ostc.intel.com>
2026-05-08 17:32   ` [PATCH 5/6] tcp: Remove mmap_lock fallback path Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox