From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 670293815F0
	for <linux-kernel@vger.kernel.org>; Fri,  8 May 2026 20:16:00 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1778271360; cv=none; b=IlL3jix4c1ioxXll1Krhbkf0OC5mGenxf2wIAqF8qe3cLpNTROrgMYukPauC+IKS/m9r8Yu/zeWME/f0hWCbbamtPFUNSj55oojq9yj7RjIFKi5HKFOIOTtCai4NknDV5XN6e50k3TXWEjn7gzDHJFsk7DbXLZczG+JYAkSGFtY=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1778271360; c=relaxed/simple;
	bh=EUVf7dVfM6XM6R0WF+nQvJgmitOqMSF+wWFDiR/8P30=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=cPqTsQXtSpRxfKZfIr2EWzAhCEmdu46+cOem4qTMKpryWar+UN2s0ZkvhwWFz5CnZGYt+wraIgFQ0w7HMG0ftpOGjTRrEPS+40I2W4mGDlE+XKukFdexI4KFERvr06v0cAeIN6F19H5iKCgaT7l+qUf37BqKF70ZJHDmTXZsbj0=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=u50BobTE; arc=none smtp.client-ip=10.30.226.201
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="u50BobTE"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 305A7C2BCB0;
	Fri,  8 May 2026 20:15:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1778271360;
	bh=EUVf7dVfM6XM6R0WF+nQvJgmitOqMSF+wWFDiR/8P30=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
	b=u50BobTE1QWn5Dss5GpCc1Ihn3k46gGgAW7HalUO1pYsD2N+XIxnqiVJLnhBPtsFd
	 DSdsE4TAYeckiIBEBhrtp70KotUb3eYrsZ+LK+pNP/RlbZd1oJM4/iBK6iAbl80pKv
	 qdMZSBcTayhRWKTSPnR7uTJniE3Vn0y3cEz9o09EaUsTxTQTT+PAq2Kb0PHbk3kSv0
	 l3+Sx/iAkjLNZnC2dmKnXtYeMvRTi1M80HzEcbbJ0teC8gYrBZbEJqunXIG8J4u4k6
	 8jGPXLMylcnIRx4FeAeRQju8c4YFmIVKdsxVL+BDAqhuZitMus2mlrENPPy4M8+QV8
	 Nu2VCJiQeIh8g==
Date: Fri, 8 May 2026 21:15:54 +0100
From: Lorenzo Stoakes <ljs@kernel.org>
To: Dave Hansen <dave.hansen@linux.intel.com>
Cc: linux-kernel@vger.kernel.org, 
	Andrew Morton <akpm@linux-foundation.org>, "Liam R. Howlett" <Liam.Howlett@oracle.com>, linux-mm@kvack.org, 
	Shakeel Butt <shakeel.butt@linux.dev>, Suren Baghdasaryan <surenb@google.com>, 
	Vlastimil Babka <vbabka@kernel.org>
Subject: Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers
Message-ID: <af5C8bxEKEkQEjwV@lucifer>
References: <20260429181954.F50224AE@davehans-spike.ostc.intel.com>
 <20260429181959.BC9DABC5@davehans-spike.ostc.intel.com>
 <af4Zx0gJIWbdDeY2@lucifer>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <af4Zx0gJIWbdDeY2@lucifer>

On Fri, May 08, 2026 at 06:26:57PM +0100, Lorenzo Stoakes wrote:
> On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote:
> >
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > == Background ==
> >
> > There are basically two parallel ways to look up a VMA: the
> > traditional way, which is protected by mmap_lock, and the RCU-based
> > per-VMA lock way which is based on RCU and refcounts.
> >
> > == Problems ==
> >
> > The mmap_lock one is more straightforward to use but it has a big
> > disadvantage in that it can not be mixed with page faults since those
> > can take mmap_lock for read.
> >
> > The RCU one can be mixed with faults, but it is not available in all
> > configs, so all RCU users need to be able to fall back to the
> > traditional way.
> >
> > == Solution ==
> >
> > Add a variant of the RCU-based lookup that waits for writers. This is
> > basically the same as the existing RCU-based lookup, but it also takes
> > mmap_lock for read and waits for writers to finish before returning
> > the VMA. This has two big advantages:
> >
> >  1. Callers do not need to have a fallback path for when they
> >     collide with writers.
> >  2. It can be used in contexts where page faults can happen because
> >     it can take the mmap_lock for read but never *holds* it.
> >
> > == Discussion ==
> >
> > I am not married to the naming here at all. Naming suggestions would
> > be much appreciated.
> >
> > This basically uses mmap_lock to wait for writers, nothing else. The
> > VMA is obviously stable under mmap_read_lock() and the code _can_
> > likely take advantage of that and possibly even remove the goto. For
> > instance, it could (probably) bump the VMA refcount and exclude future
> > writers. That would eliminate the goto.
> >
> > But the approach as-is is probably the smallest line count and
> > arguably the simplest approach. It is a good place to start a
> > conversation if nothing else.
> >
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Lorenzo Stoakes <ljs@kernel.org>
> > Cc: Vlastimil Babka <vbabka@kernel.org>
> > Cc: Shakeel Butt <shakeel.butt@linux.dev>
> > Cc: linux-mm@kvack.org
> > ---
> >
> >  b/include/linux/mmap_lock.h |    2 ++
> >  b/mm/mmap_lock.c            |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 45 insertions(+)
> >
> > diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h
> > --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait	2026-04-29 11:18:50.633628887 -0700
> > +++ b/include/linux/mmap_lock.h	2026-04-29 11:18:50.707631737 -0700
> > @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str
> >
> >  struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
> >  					  unsigned long address);
> > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> > +					  unsigned long address);
> >
> >  /*
> >   * Locks next vma pointed by the iterator. Confirms the locked vma has not
> > diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c
> > --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait	2026-04-29 11:18:50.704631622 -0700
> > +++ b/mm/mmap_lock.c	2026-04-29 11:18:50.707631737 -0700
> > @@ -340,6 +340,49 @@ inval:
> >  	return NULL;
> >  }
> >
> > +/*
> > + * Find the VMA covering 'address' and lock it for reading. Waits for writers to
> > + * finish if the VMA is being modified. Returns NULL if there is no VMA covering
> > + * 'address'.
> > + *
> > + * The fast path does not take mmap lock.
> > + */
> > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm,
> > +					       unsigned long address)
> > +{
> > +	struct vm_area_struct *vma;
> > +
> > +retry:
> > +	vma = lock_vma_under_rcu(mm, address);
> > +	/* Fast path: return stable VMA covering 'address': */
> > +	if (vma)
> > +		return vma;
> > +
> > +	/*
> > +	 * Slow path: the VMA covering 'address' is being modified.
> > +	 * or there is no VMA covering 'address'. Rule out the
> > +	 * possibility that the VMA is being modified:
> > +	 */
> > +	mmap_read_lock(mm);
> > +	vma = vma_lookup(mm, address);
> > +	mmap_read_unlock(mm);
> > +
> > +	/* There was for sure no VMA covering 'address': */
> > +	if (!vma)
> > +		return NULL;
> > +
> > +	/*
> > +	 * VMA was likely being modified during RCU lookup. Try again.
> > +	 * mmap_read_lock() waited for the writer to complete and the
> > +	 * writer is now done.
> > +	 *
> > +	 * There is no guarantee that any single retry will succeed,
> > +	 * and it is possible but highly unlikely this will loop
> > +	 * forever.
> > +	 */
> > +	goto retry;
> > +}
>
> Hmm yeah this is not ideal :)
>
> You don't have to do any of this we already have logic to help out here -
> vma_start_read_locked().
>
> That uses the fact the mmap read lock is held to pin the VMA lock, because VMA
> write locks require an mmap write lock, and the mmap read lock prevents them.
>
> That way you can eliminate the retry.
>
> So instead:
>
> /*
>  * Tries to lock under RCU, failing that it acquires the VMA lock with the mmap
>  * read lock held.
>  */
> struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm,
> 					       unsigned long address)
> {
> 	struct vm_area_struct *vma;
>
> 	might_sleep();
>
> 	vma = lock_vma_under_rcu(mm, address);
> 	if (vma)
> 		return vma;
>
> 	/* Slow path: preclude VMA writers by getting mmap read lock. */
> 	guard(rwsem_read)(&mm->mmap_lock);

Actually we'd possibly want the killable version of this? So a fatal signal can
interrupt and not indefinitely block others.

> 	vma = vma_lookup(mm, address);
> 	/* VMA isn't there. */
> 	if (!vma)
> 		return NULL;
>
> 	return vma_start_read_locked(vma);

Sorry this should be:

 	if (!vma_start_read_locked(vma))
		return NULL;
	return vma;

But as per below not sure we will actually see vma_start_read_locked() fail, it
doesn't touch the write side sequence number so overflow shouldn't be a problem.

> }
>
> (Untested, not even build tested code)
>
> Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock()
> also does some trace stuff, but the guard makes it WAY nicer so (when it goes
> out of scope the mmap read lock is dropped).
>
> Maybe we could add a custom mmap lock guard to cover that too?
>
> Suren - I actually wonder if vma_start_read_locked() actually needs to return a
> boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are -
> detached or excluding readers on write/detach.
>
> But in both of those cases, vma_lookup() would surely not find the VMA, and
> since we're precluding writers (so no write lock, no detach possible) that means
> we should never hit it?
>
> Wonder if we should make it a void and add a VM_WARN_ON_ONCE()?
>
>
> > +
> >  static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm,
> >  							    struct vma_iterator *vmi,
> >  							    unsigned long from_addr)
> > _
>
> Cheers, Lorenzo