From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BD5581D61B7; Fri, 12 Jun 2026 18:00:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781287232; cv=none; b=U0R0hfvpki4pwoJgr0g5AExXbSTiT6+HaU+wapdNkzPCnl1yhe1kX0j4GLwmLZ0KibcepkmEME44aVI0fwUAm85u5AIDeakyi2mNGHuk4qX4KHbLRFta6t57VikRbC86P9cLoh4naqoFKlirHAFaM5TW2G88HO4OKxImJk+weBo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781287232; c=relaxed/simple; bh=ISE/uWZwFotOgSsgIOI+zqkcLeeaEUNMCG7GjfD8ODI=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=NOb+RkXpc245WMGBH+aR/FpKK5PiE2MX2JZRcnlymYFFCFmz9kz4lo+rG9UCgFGQoGnvK2Apgxe7DoRt5MLhLXlZFSYZWAFVKMiqmRgRQbF3gnA9UK06vASvqdtOrsQHwzyuPyvaHeBdOjV95nnU1Ia3hItP7+ESfK/TuvnpTbs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=i7U8bB7R; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="i7U8bB7R" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0C8D71F00A3A; Fri, 12 Jun 2026 18:00:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781287230; bh=0noTcRNyUy73IbfdopzjHumsVBr2v5V38UCY07mEA6M=; h=Date:Subject:To:Cc:References:From:In-Reply-To; b=i7U8bB7REAw8r3svcGM5BIjC9+ZJ0BEJiSBO8FPOH/RJCKtQZPQGy8IeHZKcsXZSe pYye9f4LkCVCombP2E/jL9LSyZNbSUM4OO3BZ9dDka3KXfn3uJ9GVEFvHesX6AP9bT +iGFN+FxBOBCZUrHfb7jZh5esiFywJRo/WVimkm7021//8D3aQ3rdlK41ZHEOP4iK9 GT479ioDW9bjYtEHIstlQ++uV0zVmSRaCTAK3mx2n1o7MKhEzM2jJq4pQMAsvnt+zZ UlBSk+9vfx5lfLJSdFBh5fhkMRSJbTk2HTwL5F19M4RQXRmYTyAo9Q7LbZGIgMmuOK 7OO+drbPJcJwA== Message-ID: <5b27d22d-a3fd-4267-b63b-5de9e6586d5e@kernel.org> Date: Fri, 12 Jun 2026 20:00:25 +0200 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 3/5] mm: Add RCU-based VMA lookup helper that waits for writers Content-Language: en-US To: Dave Hansen , linux-kernel@vger.kernel.org Cc: Alice Ryhl , Andrew Morton , =?UTF-8?B?QXJ2ZSBIasO4bm5ldsOl?= =?UTF-8?Q?g?= , Carlos Llamas , Christian Brauner , David Ahern , "David S. Miller" , Greg Kroah-Hartman , "Liam R. Howlett" , linux-mm@kvack.org, Lorenzo Stoakes , netdev@vger.kernel.org, Shakeel Butt , Suren Baghdasaryan , Todd Kjos References: <20260610230409.A44D29FA@davehans-spike.ostc.intel.com> <20260610230415.C0521C88@davehans-spike.ostc.intel.com> From: "Vlastimil Babka (SUSE)" Autocrypt: addr=vbabka@kernel.org; keydata= xsFNBFZdmxYBEADsw/SiUSjB0dM+vSh95UkgcHjzEVBlby/Fg+g42O7LAEkCYXi/vvq31JTB KxRWDHX0R2tgpFDXHnzZcQywawu8eSq0LxzxFNYMvtB7sV1pxYwej2qx9B75qW2plBs+7+YB 87tMFA+u+L4Z5xAzIimfLD5EKC56kJ1CsXlM8S/LHcmdD9Ctkn3trYDNnat0eoAcfPIP2OZ+ 9oe9IF/R28zmh0ifLXyJQQz5ofdj4bPf8ecEW0rhcqHfTD8k4yK0xxt3xW+6Exqp9n9bydiy tcSAw/TahjW6yrA+6JhSBv1v2tIm+itQc073zjSX8OFL51qQVzRFr7H2UQG33lw2QrvHRXqD Ot7ViKam7v0Ho9wEWiQOOZlHItOOXFphWb2yq3nzrKe45oWoSgkxKb97MVsQ+q2SYjJRBBH4 8qKhphADYxkIP6yut/eaj9ImvRUZZRi0DTc8xfnvHGTjKbJzC2xpFcY0DQbZzuwsIZ8OPJCc LM4S7mT25NE5kUTG/TKQCk922vRdGVMoLA7dIQrgXnRXtyT61sg8PG4wcfOnuWf8577aXP1x 6mzw3/jh3F+oSBHb/GcLC7mvWreJifUL2gEdssGfXhGWBo6zLS3qhgtwjay0Jl+kza1lo+Cv BB2T79D4WGdDuVa4eOrQ02TxqGN7G0Biz5ZLRSFzQSQwLn8fbwARAQABzSNWbGFzdGltaWwg QmFia2EgPHZiYWJrYUBrZXJuZWwub3JnPsLBsAQTAQoAWhYhBKlA1DSZLC6OmRA9UCJPp+fM gqZkBQJqFFy6GxSAAAAAAAQADm1hbnUyLDIuNSsxLjEyLDIsMgIbAwUJGtCBUAULCQgHAwUV CgkICwUWAgMBAAIeBQIXgAAKCRAiT6fnzIKmZJIUEADFx/tREzUImHrEwVHeSvDFmA7tJysI UVrlvrM09E7GIuzphzv7jYmo8n3ANpCczLEVr4G0syYQdTigaZgv3+FQDIIzhKih1IHhu1Ei XHlywNWKnQxxQEUNi5Mwx43wQz5XVw9F1A7gtKBKNtfogO511hAbrzagrYajyQacEJ/+sfhZ 9Da8ltHIXD8pcYaHUfQgEusCgmEd9+KrUwrTbckFKmYq5chuE6yJ4J0EmWknL096jIE6CnzF FRslQ3B1UKDjxVsm1ZHfir5NeWszLkTvGFsddFaWTgh8UycESG6VQzKXjjewXu2pG7YQYRpj QKm1W5X2TkwWkXRBZTmfmbhxIUMh3+zf5wQ463rSmDN/8v81tdqBtAW6rH/kzg1GvkaTHXn0 507yEHFzBksk2viAuIxxr7km8+/KARYLIdGtx30EG8cKzAUZOK6WqxtNCsXUJNrVE8CWrCaD icoNu7Fs1c5hmPHdSTnU48ce67449DdnO4neLSNhRiGlMHJgfJUmgrxu/hcYeOZ3haWmEQ2w uW1Mh01OHi8QZHCEyAbABrPs9GUgccc/4eYXX9hIgxfSkYzn8f+8NuIFPWl/0uTvjgqU29FQ SbzOLxHq9439Ox40G5mS5eZXRGxITYR+6TXvRGI6P/264jvflnr/pDGUttaikU+0W+1uxgKH cmYbEc7ATQRbGTU1AQgAn0H6UrFiWcovkh6EXVcl+SeqyO6JHOPm+e9Wu0Vw+VIUvXZVUVVQ La1PQDUi6j00ChlcR66g9/V0sPIcSutacPKfdKYOBvzd4rlhL8rfrdEsQw5ApZxrA8kYZVMh FmBRKAa6wos25moTlMKpCWzTH84+WO5+ziCTsTUZASAToz3RdunTD+vQcHj0GqNTPAHK63sf bAB2I0BslZkXkY1RLb/YhuA6E7JyEd2pilZOrIuBGl/5q2qSakgnAVFWFBR/DO27JuAksYnq +aH8vI0xGvwn75KqSk4UzAkDzWSmO4ZHuahKtQgZNsMYV+PGayRBX9b9zbldzopoLBdqHc4n jQARAQABwsF8BBgBCgAmAhsMFiEEqUDUNJksLo6ZED1QIk+n58yCpmQFAmfIHFQFCRYU6J8A CgkQIk+n58yCpmS2PA//bqN1LfcotmArgElsa+0EGZSQlYgK48pm8WAeTXTngudP9IJ4SuKY HR5RNjHcBeqN+Me0zxRqYzRb8nGanHEkDyf4Im8DQM8d6vbyU+FcPmG4skud4kgS1zMHnlVd SXfSIwKC/hKgdHG8aBV7545Lz9X6Iohea+94wneD0aw/hqF+QWewGZhWJriWAZtvEkzNjQOi 4U9F/trLten/x7bpphDSnDMKJtITbtzATT1Dq7o7VpIUK1nCTQALMuMjKCdi8OdU/+V+R3O4 0PXWvX8qrvqYapVbZ+9KqT74FsuB0Ya9uXwgBF2Q6cRuETZk5vqaqKxzqoQZCO8AOz/58j6O 2RHNy/mZEN+7tJ5Tsq42zVJ4jxsT8b9YplavCMsnBgDeRWhcbYhCyttoL7nYISyWg4kQYZ/P wIV3OuNv2f8iKYsxNsRuClOAF82+gvqOy1/1pprFjy8uo2pkoOrb63aOP3vO5VHnRKgra6dq NcaZ+c6J4H+nEJGi2SkHAUJz5oBzuThvPudLvPA/SK8sKoM01IRxSihev/S/5WLazXB1PGem OCbvzC1IjWJJraxiDJ5IygokapUa2RP7+WBR22skQ3SSl6G107QgWKSyTOGWEaRmV53vxQLV jXuCmzSSasTL60zq5yGrT4/DYQVSNEUiUbG4pYekxJujNeEDkUlky0Y= In-Reply-To: <20260610230415.C0521C88@davehans-spike.ostc.intel.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On 6/11/26 01:04, Dave Hansen wrote: > From: Dave Hansen > > == Background == > > There are basically two parallel ways to look up a VMA: the > traditional way, which is protected by mmap_lock, and the RCU-based > per-VMA lock way which is based on RCU and refcounts. > > == Problem == > > The mmap_lock one is more straightforward to use but it has a big > disadvantage in that it can not be mixed with page faults since those > can take mmap_lock for read, which can deadlock when mixed with page > faults. For example: ... mixed with nested page faults and parallel writers, perhaps? > > mmap_read_lock(mm); > // Another thread does mmap_write_lock(). > // New mmap_lock readers are blocked. > vma = vma_lookup(mm, address); > // This deadlocks on mmap_read_lock() if it faults: > copy_from_user(address); > mmap_read_unlock(mm); > > The RCU one can be mixed with faults, but it is not available in all I'd stick to the per-VMA lock term than "RCU one" > configs, so all RCU users need to be able to fall back to the > traditional way. This is now an obsolete statement as patch 1 makes them available? But the problem is that they can fail and using mmap read lock as a fallback in a simple way has the above issue? > == Solution == > > Add a variant of the RCU-based lookup that waits for writers. This is > basically the same as the existing RCU-based lookup, but it also takes > mmap_lock for read and waits for writers to finish before returning > the VMA. This has some advantages: I would stress the part that the mmap lock is taken *only temporarily* to wait for the writers and ensure we obtain a per-vma read lock, and then dropped again? As that's the main trick IIUC. > 1. Callers do not need to have a fallback path for when they > collide with writers. > 2. It can be used in contexts where page faults can happen because > it can take the mmap_lock for read but never *holds* it. > 3. Its fast path does not require taking mmap_lock for read. > > Basically, when applied correctly, this approach results in faster > *and* simpler code. > > Signed-off-by: Dave Hansen > Cc: Suren Baghdasaryan > Cc: Andrew Morton > Cc: "Liam R. Howlett" > Cc: Lorenzo Stoakes > Cc: Vlastimil Babka > Cc: Shakeel Butt > Cc: linux-mm@kvack.org > Cc: Greg Kroah-Hartman > Cc: Arve Hjønnevåg > Cc: Todd Kjos > Cc: Christian Brauner > Cc: Carlos Llamas > Cc: Alice Ryhl > Cc: "David S. Miller" > Cc: David Ahern > Cc: netdev@vger.kernel.org > > -- > > Changes from v1: > * Add a comment explaining that this can not be mixed with other > per-VMA lock or mmap_lock users. It is prone to deadlocks if so. > * Add a FIXME about making the mmap_read_lock() killable I don't see it anywhere? > * Add more chaneglog bits about the possibility for an infinite goto > loop. > * Adopt vma_start_read_unlocked() implementation from Lorenzo > --- > > b/include/linux/mmap_lock.h | 3 +++ > b/mm/mmap_lock.c | 27 +++++++++++++++++++++++++++ > 2 files changed, 30 insertions(+) > > diff -puN include/linux/mmap_lock.h~lock-vma-under-rcu-wait include/linux/mmap_lock.h > --- a/include/linux/mmap_lock.h~lock-vma-under-rcu-wait 2026-06-10 15:57:55.828431712 -0700 > +++ b/include/linux/mmap_lock.h 2026-06-10 15:57:55.834431925 -0700 > @@ -257,6 +257,9 @@ static inline bool vma_start_read_locked > return vma_start_read_locked_nested(vma, 0); > } > > +struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm, > + unsigned long address); > + > static inline void vma_end_read(struct vm_area_struct *vma) > { > vma_refcount_put(vma); > diff -puN mm/mmap_lock.c~lock-vma-under-rcu-wait mm/mmap_lock.c > --- a/mm/mmap_lock.c~lock-vma-under-rcu-wait 2026-06-10 15:57:55.831431819 -0700 > +++ b/mm/mmap_lock.c 2026-06-10 16:02:50.723860779 -0700 > @@ -338,6 +338,33 @@ inval: > return NULL; > } > > +/* > + * Find the VMA covering 'address' and lock it for reading. Waits for writers to > + * finish if the VMA is being modified. Returns NULL if there is no VMA covering > + * 'address'. > + * > + * Use only in code paths where no mmap_lock and no VMA lock is held. I think we have various asserts that could be used and are stronger than a comment ;) > + * > + * The fast path does not take mmap_lock. > + */ > +struct vm_area_struct *vma_start_read_unlocked(struct mm_struct *mm, > + unsigned long address) > +{ > + struct vm_area_struct *vma; > + > + /* Fast path: return stable VMA covering 'address': */ > + vma = lock_vma_under_rcu(mm, address); > + if (vma) > + return vma; > + > + /* Slow path: preclude VMA writers by getting mmap read lock. */ Again I would say "temporarily". > + guard(rwsem_read)(&mm->mmap_lock); Aside from the missing vma_lookup() I'm not sure we should also trust the result of the lookup blindly? Should we also verify we found a vma? Some callers might not fail the lookup because they will only lookup something that's sure to be present, but some might fail? > + if (!vma_start_read_locked(vma)) > + return NULL; You can count me on the side that would rather see explicit operations than the guard. Exactly because it's a subtle usage of the mmap sem, and yeah also the tracing that Suren pointed out. Seems to me uffd_lock_vma() mostly does all this right (but also does more stuff that we don't want to do). I'm just not sure right know when vma_start_read_locked() failures can happen in practice here (can it be only recfount overflow or also refcount being zero? hopefully not zero if we found the vma under mmap lock for read? comment in lock_next_vma_under_mmap_lock() seems to hint at that) and what to do about them. We seem to have some unhelpfully stale comments around. - uffd_lock_vma() doesn't document that it can return -EAGAIN. (and should the caller then retry or what?) - vma_start_read_locked() has a comment saying how it cannot fail, but it in fact can. > + > + return vma; > +} > + > static struct vm_area_struct *lock_next_vma_under_mmap_lock(struct mm_struct *mm, > struct vma_iterator *vmi, > unsigned long from_addr) > _