From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 670293815F0 for ; Fri, 8 May 2026 20:16:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778271360; cv=none; b=IlL3jix4c1ioxXll1Krhbkf0OC5mGenxf2wIAqF8qe3cLpNTROrgMYukPauC+IKS/m9r8Yu/zeWME/f0hWCbbamtPFUNSj55oojq9yj7RjIFKi5HKFOIOTtCai4NknDV5XN6e50k3TXWEjn7gzDHJFsk7DbXLZczG+JYAkSGFtY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778271360; c=relaxed/simple; bh=EUVf7dVfM6XM6R0WF+nQvJgmitOqMSF+wWFDiR/8P30=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=cPqTsQXtSpRxfKZfIr2EWzAhCEmdu46+cOem4qTMKpryWar+UN2s0ZkvhwWFz5CnZGYt+wraIgFQ0w7HMG0ftpOGjTRrEPS+40I2W4mGDlE+XKukFdexI4KFERvr06v0cAeIN6F19H5iKCgaT7l+qUf37BqKF70ZJHDmTXZsbj0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=u50BobTE; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="u50BobTE" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 305A7C2BCB0; Fri, 8 May 2026 20:15:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778271360; bh=EUVf7dVfM6XM6R0WF+nQvJgmitOqMSF+wWFDiR/8P30=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=u50BobTE1QWn5Dss5GpCc1Ihn3k46gGgAW7HalUO1pYsD2N+XIxnqiVJLnhBPtsFd DSdsE4TAYeckiIBEBhrtp70KotUb3eYrsZ+LK+pNP/RlbZd1oJM4/iBK6iAbl80pKv qdMZSBcTayhRWKTSPnR7uTJniE3Vn0y3cEz9o09EaUsTxTQTT+PAq2Kb0PHbk3kSv0 l3+Sx/iAkjLNZnC2dmKnXtYeMvRTi1M80HzEcbbJ0teC8gYrBZbEJqunXIG8J4u4k6 8jGPXLMylcnIRx4FeAeRQju8c4YFmIVKdsxVL+BDAqhuZitMus2mlrENPPy4M8+QV8 Nu2VCJiQeIh8g== Date: Fri, 8 May 2026 21:15:54 +0100 From: Lorenzo Stoakes To: Dave Hansen Cc: linux-kernel@vger.kernel.org, Andrew Morton , "Liam R. Howlett" , linux-mm@kvack.org, Shakeel Butt , Suren Baghdasaryan , Vlastimil Babka Subject: Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Message-ID: References: <20260429181954.F50224AE@davehans-spike.ostc.intel.com> <20260429181959.BC9DABC5@davehans-spike.ostc.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Fri, May 08, 2026 at 06:26:57PM +0100, Lorenzo Stoakes wrote: > On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote: > > > > From: Dave Hansen > > > > == Background == > > > > There are basically two parallel ways to look up a VMA: the > > traditional way, which is protected by mmap_lock, and the RCU-based > > per-VMA lock way which is based on RCU and refcounts. > > > > == Problems == > > > > The mmap_lock one is more straightforward to use but it has a big > > disadvantage in that it can not be mixed with page faults since those > > can take mmap_lock for read. > > > > The RCU one can be mixed with faults, but it is not available in all > > configs, so all RCU users need to be able to fall back to the > > traditional way. > > > > == Solution == > > > > Add a variant of the RCU-based lookup that waits for writers. This is > > basically the same as the existing RCU-based lookup, but it also takes > > mmap_lock for read and waits for writers to finish before returning > > the VMA. This has two big advantages: > > > > 1. Callers do not need to have a fallback path for when they > > collide with writers. > > 2. It can be used in contexts where page faults can happen because > > it can take the mmap_lock for read but never *holds* it. > > > > == Discussion == > > > > I am not married to the naming here at all. Naming suggestions would > > be much appreciated. > > > > This basically uses mmap_lock to wait for writers, nothing else. The > > VMA is obviously stable under mmap_read_lock() and the code _can_ > > likely take advantage of that and possibly even remove the goto. For > > instance, it could (probably) bump the VMA refcount and exclude future > > writers. That would eliminate the goto. > > > > But the approach as-is is probably the smallest line count and > > arguably the simplest approach. It is a good place to start a > > conversation if nothing else. > > > > Signed-off-by: Dave Hansen > > Cc: Suren Baghdasaryan > > Cc: Andrew Morton > > Cc: "Liam R. Howlett" > > Cc: Lorenzo Stoakes > > Cc: Vlastimil Babka > > Cc: Shakeel Butt > > Cc: linux-mm@kvack.org > > --- > > > > b/include/linux/mmap_lock.h | 2 ++ > > b/mm/mmap_lock.c | 43 +++++++++++++++++++++++++++++++++++++++++++ > > 2 files changed, 45 insertions(+) > > > > diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h > > --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait 2026-04-29 11:18:50.633628887 -0700 > > +++ b/include/linux/mmap_lock.h 2026-04-29 11:18:50.707631737 -0700 > > @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str > > > > struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > > unsigned long address); > > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm, > > + unsigned long address); > > > > /* > > * Locks next vma pointed by the iterator. Confirms the locked vma has not > > diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c > > --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait 2026-04-29 11:18:50.704631622 -0700 > > +++ b/mm/mmap_lock.c 2026-04-29 11:18:50.707631737 -0700 > > @@ -340,6 +340,49 @@ inval: > > return NULL; > > } > > > > +/* > > + * Find the VMA covering 'address' and lock it for reading. Waits for writers to > > + * finish if the VMA is being modified. Returns NULL if there is no VMA covering > > + * 'address'. > > + * > > + * The fast path does not take mmap lock. > > + */ > > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm, > > + unsigned long address) > > +{ > > + struct vm_area_struct *vma; > > + > > +retry: > > + vma = lock_vma_under_rcu(mm, address); > > + /* Fast path: return stable VMA covering 'address': */ > > + if (vma) > > + return vma; > > + > > + /* > > + * Slow path: the VMA covering 'address' is being modified. > > + * or there is no VMA covering 'address'. Rule out the > > + * possibility that the VMA is being modified: > > + */ > > + mmap_read_lock(mm); > > + vma = vma_lookup(mm, address); > > + mmap_read_unlock(mm); > > + > > + /* There was for sure no VMA covering 'address': */ > > + if (!vma) > > + return NULL; > > + > > + /* > > + * VMA was likely being modified during RCU lookup. Try again. > > + * mmap_read_lock() waited for the writer to complete and the > > + * writer is now done. > > + * > > + * There is no guarantee that any single retry will succeed, > > + * and it is possible but highly unlikely this will loop > > + * forever. > > + */ > > + goto retry; > > +} > > Hmm yeah this is not ideal :) > > You don't have to do any of this we already have logic to help out here - > vma_start_read_locked(). > > That uses the fact the mmap read lock is held to pin the VMA lock, because VMA > write locks require an mmap write lock, and the mmap read lock prevents them. > > That way you can eliminate the retry. > > So instead: > > /* > * Tries to lock under RCU, failing that it acquires the VMA lock with the mmap > * read lock held. > */ > struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm, > unsigned long address) > { > struct vm_area_struct *vma; > > might_sleep(); > > vma = lock_vma_under_rcu(mm, address); > if (vma) > return vma; > > /* Slow path: preclude VMA writers by getting mmap read lock. */ > guard(rwsem_read)(&mm->mmap_lock); Actually we'd possibly want the killable version of this? So a fatal signal can interrupt and not indefinitely block others. > vma = vma_lookup(mm, address); > /* VMA isn't there. */ > if (!vma) > return NULL; > > return vma_start_read_locked(vma); Sorry this should be: if (!vma_start_read_locked(vma)) return NULL; return vma; But as per below not sure we will actually see vma_start_read_locked() fail, it doesn't touch the write side sequence number so overflow shouldn't be a problem. > } > > (Untested, not even build tested code) > > Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock() > also does some trace stuff, but the guard makes it WAY nicer so (when it goes > out of scope the mmap read lock is dropped). > > Maybe we could add a custom mmap lock guard to cover that too? > > Suren - I actually wonder if vma_start_read_locked() actually needs to return a > boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are - > detached or excluding readers on write/detach. > > But in both of those cases, vma_lookup() would surely not find the VMA, and > since we're precluding writers (so no write lock, no detach possible) that means > we should never hit it? > > Wonder if we should make it a void and add a VM_WARN_ON_ONCE()? > > > > + > > static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm, > > struct vma_iterator *vmi, > > unsigned long from_addr) > > _ > > Cheers, Lorenzo