From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Jonathan Corbet <corbet@lwn.net>,
Matthew Wilcox <willy@infradead.org>, Guo Ren <guoren@kernel.org>,
Thomas Bogendoerfer <tsbogend@alpha.franken.de>,
Heiko Carstens <hca@linux.ibm.com>,
Vasily Gorbik <gor@linux.ibm.com>,
Alexander Gordeev <agordeev@linux.ibm.com>,
Christian Borntraeger <borntraeger@linux.ibm.com>,
Sven Schnelle <svens@linux.ibm.com>,
"David S . Miller" <davem@davemloft.net>,
Andreas Larsson <andreas@gaisler.com>,
Arnd Bergmann <arnd@arndb.de>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Dan Williams <dan.j.williams@intel.com>,
Vishal Verma <vishal.l.verma@intel.com>,
Dave Jiang <dave.jiang@intel.com>,
Nicolas Pitre <nico@fluxnic.net>,
Muchun Song <muchun.song@linux.dev>,
Oscar Salvador <osalvador@suse.de>,
Konstantin Komarov <almaz.alexandrovich@paragon-software.com>,
Baoquan He <bhe@redhat.com>, Vivek Goyal <vgoyal@redhat.com>,
Dave Young <dyoung@redhat.com>, Tony Luck <tony.luck@intel.com>,
Reinette Chatre <reinette.chatre@intel.com>,
Dave Martin <Dave.Martin@arm.com>,
James Morse <james.morse@arm.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
"Liam R . Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Uladzislau Rezki <urezki@gmail.com>,
Dmitry Vyukov <dvyukov@google.com>,
Andrey Konovalov <andreyknvl@gmail.com>,
Jann Horn <jannh@google.com>, Pedro Falcato <pfalcato@suse.de>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org, linux-csky@vger.kernel.org,
linux-mips@vger.kernel.org, linux-s390@vger.kernel.org,
sparclinux@vger.kernel.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-mm@kvack.org,
ntfs3@lists.linux.dev, kexec@lists.infradead.org,
kasan-dev@googlegroups.com, Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
Date: Tue, 9 Sep 2025 10:13:45 +0100 [thread overview]
Message-ID: <c04357f9-795e-4a5d-b762-f140e3d413d8@lucifer.local> (raw)
In-Reply-To: <ad69e837-b5c7-4e2d-a268-c63c9b4095cf@redhat.com>
On Mon, Sep 08, 2025 at 05:27:37PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > We have introduced the f_op->mmap_prepare hook to allow for setting up a
> > VMA far earlier in the process of mapping memory, reducing problematic
> > error handling paths, but this does not provide what all
> > drivers/filesystems need.
> >
> > In order to supply this, and to be able to move forward with removing
> > f_op->mmap altogether, introduce f_op->mmap_complete.
> >
> > This hook is called once the VMA is fully mapped and everything is done,
> > however with the mmap write lock and VMA write locks held.
> >
> > The hook is then provided with a fully initialised VMA which it can do what
> > it needs with, though the mmap and VMA write locks must remain held
> > throughout.
> >
> > It is not intended that the VMA be modified at this point, attempts to do
> > so will end in tears.
> >
> > This allows for operations such as pre-population typically via a remap, or
> > really anything that requires access to the VMA once initialised.
> >
> > In addition, a caller may need to take a lock in mmap_prepare, when it is
> > possible to modify the VMA, and release it on mmap_complete. In order to
> > handle errors which may arise between the two operations, f_op->mmap_abort
> > is provided.
> >
> > This hook should be used to drop any lock and clean up anything before the
> > VMA mapping operation is aborted. After this point the VMA will not be
> > added to any mapping and will not exist.
> >
> > We also add a new mmap_context field to the vm_area_desc type which can be
> > used to pass information pertinent to any locks which are held or any state
> > which is required for mmap_complete, abort to operate correctly.
> >
> > We also update the compatibility layer for nested filesystems which
> > currently still only specify an f_op->mmap() handler so that it correctly
> > invokes f_op->mmap_complete as necessary (note that no error can occur
> > between mmap_prepare and mmap_complete so mmap_abort will never be called
> > in this case).
> >
> > Also update the VMA tests to account for the changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> > include/linux/fs.h | 4 ++
> > include/linux/mm_types.h | 5 ++
> > mm/util.c | 18 +++++--
> > mm/vma.c | 82 ++++++++++++++++++++++++++++++--
> > tools/testing/vma/vma_internal.h | 31 ++++++++++--
> > 5 files changed, 129 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 594bd4d0521e..bb432924993a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2195,6 +2195,10 @@ struct file_operations {
> > int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> > unsigned int poll_flags);
> > int (*mmap_prepare)(struct vm_area_desc *);
> > + int (*mmap_complete)(struct file *, struct vm_area_struct *,
> > + const void *context);
> > + void (*mmap_abort)(const struct file *, const void *vm_private_data,
> > + const void *context);
>
> Do we have a description somewhere what these things do, when they are
> called, and what a driver may be allowed to do with a VMA?
Yeah there's a doc patch that follows this.
>
> In particular, the mmap_complete() looks like another candidate for letting
> a driver just go crazy on the vma? :)
Well there's only so much we can do. In an ideal world we'd treat VMAs as
entirely internal data structures and pass some sort of opaque thing around, but
we have to keep things real here :)
So the main purpose of these changes is not so much to be as ambitious as
_that_, but to only provide the VMA _when it's safe to do so_.
Before we were providing a pointer to an incompletely-initialised VMA that was
not yet in the maple tree, with which the driver could do _anything_, and then
afterwards have:
a. a bunch of stuff left to do with a VMA that might be in some broken state due
to drivers.
b. (the really bad case) have error paths to handle because the driver returned
an error, but did who-knows-what with the VMA and page tables.
So we address this by:
1. mmap_prepare being done _super early_ and _not_ providing a VMA. We
essentially ask the driver 'hey what do you want these fields that you are
allowed to change in the VMA to be?'
2. mmap_complete being done _super_ late, essentially just before we release the
VMA/mmap locks. If an error arises - we can just unmap it, easy. And then
there's a lot less damage the driver can do.
I think it's probably the most sensible means of doing something about the
legacy we have where we've been rather too 'free and easy' with allowing drivers
to do whatever.
>
> --
> Cheers
>
> David / dhildenb
>
Cheers, Lorenzo
next prev parent reply other threads:[~2025-09-09 9:14 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
2025-09-08 14:59 ` David Hildenbrand
2025-09-08 15:28 ` Lorenzo Stoakes
2025-09-09 3:19 ` Baolin Wang
2025-09-09 9:08 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 02/16] device/dax: update devdax " Lorenzo Stoakes
2025-09-08 15:03 ` David Hildenbrand
2025-09-08 15:28 ` Lorenzo Stoakes
2025-09-08 15:31 ` David Hildenbrand
2025-09-08 11:10 ` [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
2025-09-08 12:51 ` Jason Gunthorpe
2025-09-08 13:12 ` Lorenzo Stoakes
2025-09-08 13:32 ` Jason Gunthorpe
2025-09-08 14:09 ` Lorenzo Stoakes
2025-09-08 14:20 ` Jason Gunthorpe
2025-09-08 14:47 ` Lorenzo Stoakes
2025-09-08 15:07 ` David Hildenbrand
2025-09-08 15:35 ` Lorenzo Stoakes
2025-09-08 17:30 ` David Hildenbrand
2025-09-09 9:21 ` Lorenzo Stoakes
2025-09-08 15:16 ` Jason Gunthorpe
2025-09-08 15:24 ` David Hildenbrand
2025-09-08 15:33 ` Jason Gunthorpe
2025-09-08 15:46 ` David Hildenbrand
2025-09-08 15:50 ` David Hildenbrand
2025-09-08 15:56 ` Jason Gunthorpe
2025-09-08 17:36 ` David Hildenbrand
2025-09-08 20:24 ` Lorenzo Stoakes
2025-09-08 15:33 ` Lorenzo Stoakes
2025-09-08 15:10 ` David Hildenbrand
2025-09-08 11:10 ` [PATCH 04/16] relay: update relay to use mmap_prepare Lorenzo Stoakes
2025-09-08 15:15 ` David Hildenbrand
2025-09-08 15:29 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion Lorenzo Stoakes
2025-09-08 15:19 ` David Hildenbrand
2025-09-08 15:31 ` Lorenzo Stoakes
2025-09-08 17:38 ` David Hildenbrand
2025-09-09 9:04 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
2025-09-08 12:55 ` Jason Gunthorpe
2025-09-08 13:19 ` Lorenzo Stoakes
2025-09-08 15:27 ` David Hildenbrand
2025-09-09 9:13 ` Lorenzo Stoakes [this message]
2025-09-09 9:26 ` David Hildenbrand
2025-09-09 9:37 ` Lorenzo Stoakes
2025-09-09 16:43 ` Suren Baghdasaryan
2025-09-09 17:36 ` Lorenzo Stoakes
2025-09-09 16:44 ` Suren Baghdasaryan
2025-09-08 11:10 ` [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort] Lorenzo Stoakes
2025-09-08 23:17 ` Randy Dunlap
2025-09-09 9:02 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
2025-09-08 13:00 ` Jason Gunthorpe
2025-09-08 13:27 ` Lorenzo Stoakes
2025-09-08 13:35 ` Jason Gunthorpe
2025-09-08 14:18 ` Lorenzo Stoakes
2025-09-08 16:03 ` Jason Gunthorpe
2025-09-08 16:07 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 09/16] mm: introduce io_remap_pfn_range_prepare, complete Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
2025-09-08 13:11 ` Jason Gunthorpe
2025-09-08 13:37 ` Lorenzo Stoakes
2025-09-08 13:52 ` Jason Gunthorpe
2025-09-08 14:19 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 11/16] mm: update mem char driver " Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort Lorenzo Stoakes
2025-09-08 13:24 ` Jason Gunthorpe
2025-09-08 13:40 ` Lorenzo Stoakes
2025-09-08 14:27 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
2025-09-08 13:27 ` Jason Gunthorpe
2025-09-08 13:44 ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 14/16] fs/proc: add proc_mmap_[prepare, complete] hooks for procfs Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 15/16] fs/proc: update vmcore to use .proc_mmap_[prepare, complete] Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete Lorenzo Stoakes
2025-09-08 13:30 ` Jason Gunthorpe
2025-09-08 13:47 ` Lorenzo Stoakes
2025-09-08 13:27 ` [PATCH 00/16] expand mmap_prepare functionality, port more users Jan Kara
2025-09-08 14:48 ` Lorenzo Stoakes
2025-09-08 15:04 ` Jason Gunthorpe
2025-09-08 15:15 ` Lorenzo Stoakes
2025-09-09 8:31 ` Alexander Gordeev
2025-09-09 8:59 ` Lorenzo Stoakes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c04357f9-795e-4a5d-b762-f140e3d413d8@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Dave.Martin@arm.com \
--cc=Liam.Howlett@oracle.com \
--cc=agordeev@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=almaz.alexandrovich@paragon-software.com \
--cc=andreas@gaisler.com \
--cc=andreyknvl@gmail.com \
--cc=arnd@arndb.de \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhe@redhat.com \
--cc=borntraeger@linux.ibm.com \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=davem@davemloft.net \
--cc=david@redhat.com \
--cc=dvyukov@google.com \
--cc=dyoung@redhat.com \
--cc=gor@linux.ibm.com \
--cc=gregkh@linuxfoundation.org \
--cc=guoren@kernel.org \
--cc=hca@linux.ibm.com \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=james.morse@arm.com \
--cc=jannh@google.com \
--cc=jgg@nvidia.com \
--cc=kasan-dev@googlegroups.com \
--cc=kexec@lists.infradead.org \
--cc=linux-csky@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=nico@fluxnic.net \
--cc=ntfs3@lists.linux.dev \
--cc=nvdimm@lists.linux.dev \
--cc=osalvador@suse.de \
--cc=pfalcato@suse.de \
--cc=reinette.chatre@intel.com \
--cc=rppt@kernel.org \
--cc=sparclinux@vger.kernel.org \
--cc=surenb@google.com \
--cc=svens@linux.ibm.com \
--cc=tony.luck@intel.com \
--cc=tsbogend@alpha.franken.de \
--cc=urezki@gmail.com \
--cc=vbabka@suse.cz \
--cc=vgoyal@redhat.com \
--cc=viro@zeniv.linux.org.uk \
--cc=vishal.l.verma@intel.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).