From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE49B3C1973 for ; Fri, 8 May 2026 17:26:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778261218; cv=none; b=WDRtPW/LIyOHnzV9NcvnUETNDBnZu4q8VrA+SS8XCPDPqfWzDNrTywuJQOdw0/ZfWUnHYRIsCkxzJikHwoHMhwX6g7TTXhUYDvU+X9E/cXOGQy/gFwjqNU1Py4ZOS/gXqLNmjk2uzyV557Q5vgREVRdBMAYcFHg+PV3w4dzSccw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778261218; c=relaxed/simple; bh=ldwPmyn++xeXa2y+rf2fbdlKgqflWX5XNaVgw3l8qkg=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=ePM01Aw4wdrC0XH1+xjHYJMkDV9Ko8soBPdavFNn7W0ADXeMmUCNC5nmwodU3XXiCof1OBIr5BDzPgZuT+BF8XAFm880lyORMvZznwRxE6Aovnm3ZbBAxaldyPIuRZ74c5OxiLSpwCiN85+Sj/bzx8/M7K+zeMefNqSPOgPLVho= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nVAov2zG; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nVAov2zG" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1A4A3C2BCB0; Fri, 8 May 2026 17:26:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778261218; bh=ldwPmyn++xeXa2y+rf2fbdlKgqflWX5XNaVgw3l8qkg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=nVAov2zGJkcYAFjCf23xDaugbKvhaEwFMZQ50HcimdX+Ie2fl+DnK4uKX3p9qufGB EuESuBzJCrFoSeydMXlBfzc8qGsylKLLXDmIye2VApIn7a/yzQmeS/0P4nCu8WUubW kwcmc1a4D18wIvWDYRm/03Itvvm3RVxsw1BY1JdfzBOlofg5UYEobtI21WDJhHouIn Acz8nRHGn82c1ycR7Z//zOyrENXISqauoKX9MrMxM/fRr30Y7RlnhxQ/KMRlJBzrwM M7kFsal9BaofCSRGda6J2627YTmfqLkKQEk3xn9PkjgRAGGgQPajsaNpq9W6dGCXpP dzj8gSVa5eO1Q== Date: Fri, 8 May 2026 18:26:51 +0100 From: Lorenzo Stoakes To: Dave Hansen Cc: linux-kernel@vger.kernel.org, Andrew Morton , "Liam R. Howlett" , linux-mm@kvack.org, Shakeel Butt , Suren Baghdasaryan , Vlastimil Babka Subject: Re: [PATCH 3/6] mm: Add RCU-based VMA lookup that waits for writers Message-ID: References: <20260429181954.F50224AE@davehans-spike.ostc.intel.com> <20260429181959.BC9DABC5@davehans-spike.ostc.intel.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260429181959.BC9DABC5@davehans-spike.ostc.intel.com> On Wed, Apr 29, 2026 at 11:19:59AM -0700, Dave Hansen wrote: > > From: Dave Hansen > > == Background == > > There are basically two parallel ways to look up a VMA: the > traditional way, which is protected by mmap_lock, and the RCU-based > per-VMA lock way which is based on RCU and refcounts. > > == Problems == > > The mmap_lock one is more straightforward to use but it has a big > disadvantage in that it can not be mixed with page faults since those > can take mmap_lock for read. > > The RCU one can be mixed with faults, but it is not available in all > configs, so all RCU users need to be able to fall back to the > traditional way. > > == Solution == > > Add a variant of the RCU-based lookup that waits for writers. This is > basically the same as the existing RCU-based lookup, but it also takes > mmap_lock for read and waits for writers to finish before returning > the VMA. This has two big advantages: > > 1. Callers do not need to have a fallback path for when they > collide with writers. > 2. It can be used in contexts where page faults can happen because > it can take the mmap_lock for read but never *holds* it. > > == Discussion == > > I am not married to the naming here at all. Naming suggestions would > be much appreciated. > > This basically uses mmap_lock to wait for writers, nothing else. The > VMA is obviously stable under mmap_read_lock() and the code _can_ > likely take advantage of that and possibly even remove the goto. For > instance, it could (probably) bump the VMA refcount and exclude future > writers. That would eliminate the goto. > > But the approach as-is is probably the smallest line count and > arguably the simplest approach. It is a good place to start a > conversation if nothing else. > > Signed-off-by: Dave Hansen > Cc: Suren Baghdasaryan > Cc: Andrew Morton > Cc: "Liam R. Howlett" > Cc: Lorenzo Stoakes > Cc: Vlastimil Babka > Cc: Shakeel Butt > Cc: linux-mm@kvack.org > --- > > b/include/linux/mmap_lock.h | 2 ++ > b/mm/mmap_lock.c | 43 +++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 45 insertions(+) > > diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h > --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait 2026-04-29 11:18:50.633628887 -0700 > +++ b/include/linux/mmap_lock.h 2026-04-29 11:18:50.707631737 -0700 > @@ -470,6 +470,8 @@ static inline void vma_mark_detached(str > > struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, > unsigned long address); > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm, > + unsigned long address); > > /* > * Locks next vma pointed by the iterator. Confirms the locked vma has not > diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c > --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait 2026-04-29 11:18:50.704631622 -0700 > +++ b/mm/mmap_lock.c 2026-04-29 11:18:50.707631737 -0700 > @@ -340,6 +340,49 @@ inval: > return NULL; > } > > +/* > + * Find the VMA covering 'address' and lock it for reading. Waits for writers to > + * finish if the VMA is being modified. Returns NULL if there is no VMA covering > + * 'address'. > + * > + * The fast path does not take mmap lock. > + */ > +struct vm_area_struct *lock_vma_under_rcu_wait(struct mm_struct *mm, > + unsigned long address) > +{ > + struct vm_area_struct *vma; > + > +retry: > + vma = lock_vma_under_rcu(mm, address); > + /* Fast path: return stable VMA covering 'address': */ > + if (vma) > + return vma; > + > + /* > + * Slow path: the VMA covering 'address' is being modified. > + * or there is no VMA covering 'address'. Rule out the > + * possibility that the VMA is being modified: > + */ > + mmap_read_lock(mm); > + vma = vma_lookup(mm, address); > + mmap_read_unlock(mm); > + > + /* There was for sure no VMA covering 'address': */ > + if (!vma) > + return NULL; > + > + /* > + * VMA was likely being modified during RCU lookup. Try again. > + * mmap_read_lock() waited for the writer to complete and the > + * writer is now done. > + * > + * There is no guarantee that any single retry will succeed, > + * and it is possible but highly unlikely this will loop > + * forever. > + */ > + goto retry; > +} Hmm yeah this is not ideal :) You don't have to do any of this we already have logic to help out here - vma_start_read_locked(). That uses the fact the mmap read lock is held to pin the VMA lock, because VMA write locks require an mmap write lock, and the mmap read lock prevents them. That way you can eliminate the retry. So instead: /* * Tries to lock under RCU, failing that it acquires the VMA lock with the mmap * read lock held. */ struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm, unsigned long address) { struct vm_area_struct *vma; might_sleep(); vma = lock_vma_under_rcu(mm, address); if (vma) return vma; /* Slow path: preclude VMA writers by getting mmap read lock. */ guard(rwsem_read)(&mm->mmap_lock); vma = vma_lookup(mm, address); /* VMA isn't there. */ if (!vma) return NULL; return vma_start_read_locked(vma); } (Untested, not even build tested code) Not sure if we can use the linux/cleanup.h guard here, because mmap_read_lock() also does some trace stuff, but the guard makes it WAY nicer so (when it goes out of scope the mmap read lock is dropped). Maybe we could add a custom mmap lock guard to cover that too? Suren - I actually wonder if vma_start_read_locked() actually needs to return a boolean? The failure cases for __refcount_inc_not_zero_limited_acquire() are - detached or excluding readers on write/detach. But in both of those cases, vma_lookup() would surely not find the VMA, and since we're precluding writers (so no write lock, no detach possible) that means we should never hit it? Wonder if we should make it a void and add a VM_WARN_ON_ONCE()? > + > static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm, > struct vma_iterator *vmi, > unsigned long from_addr) > _ Cheers, Lorenzo