linux-csky.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/16] expand mmap_prepare functionality, port more users
@ 2025-09-08 11:10 Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
                   ` (17 more replies)
  0 siblings, 18 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
callback"), The f_op->mmap hook has been deprecated in favour of
f_op->mmap_prepare.

This was introduced in order to make it possible for us to eventually
eliminate the f_op->mmap hook which is highly problematic as it allows
drivers and filesystems raw access to a VMA which is not yet correctly
initialised.

This hook also introduces complexity for the memory mapping operation, as
we must correctly unwind what we do should an error arises.

Overall this interface being so open has caused significant problems for
us, including security issues, it is important for us to simply eliminate
this as a source of problems.

Therefore this series continues what was established by extending the
functionality further to permit more drivers and filesystems to use
mmap_prepare.

After updating some areas that can simply use mmap_prepare as-is, and
performing some housekeeping, we then introduce two new hooks:

f_op->mmap_complete - this is invoked at the point of the VMA having been
correctly inserted, though with the VMA write lock still held. mmap_prepare
must also be specified.

This expands the use of mmap_prepare to those callers which need to
prepopulate mappings, as well as any which does genuinely require access to
the VMA.

It's simple - we will let the caller access the VMA, but only once it's
established. At this point unwinding issues is simple - we just unmap the
VMA.

The VMA is also then correctly initialised at this stage so there can be no
issues arising from a not-fully initialised VMA at this point.

The other newly added hook is:

f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
mmap_complete. This is called should an error arise between mmap_prepare
and mmap_complete (not as a result of mmap_prepare but rather some other
part of the mapping logic).

This is required in case mmap_prepare wishes to establish state or locks
which need to be cleaned up on completion. If we did not provide this, then
this could not be permitted as this cleanup would otherwise not occur
should the mapping fail between the two calls.

We then add split remap_pfn_range*() functions which allow for PFN remap (a
typical mapping prepopulation operation) split between a prepare/complete
step, as well as io_mremap_pfn_range_prepare, complete for a similar
purpose.

From there we update various mm-adjacent logic to use this functionality as
a first set of changes, as well as resctl and cramfs filesystems to round
off the non-stacked filesystem instances.


REVIEWER NOTE:
~~~~~~~~~~~~~~

I considered putting the complete, abort callbacks in vm_ops, however this
won't work because then we would be unable to adjust helpers like
generic_file_mmap_prepare() (which provides vm_ops) to provide the correct
complete, abort callbacks.

Conceptually it also makes more sense to have these in f_op as they are
one-off operations performed at mmap time to establish the VMA, rather than
a property of the VMA itself.

Lorenzo Stoakes (16):
  mm/shmem: update shmem to use mmap_prepare
  device/dax: update devdax to use mmap_prepare
  mm: add vma_desc_size(), vma_desc_pages() helpers
  relay: update relay to use mmap_prepare
  mm/vma: rename mmap internal functions to avoid confusion
  mm: introduce the f_op->mmap_complete, mmap_abort hooks
  doc: update porting, vfs documentation for mmap_[complete, abort]
  mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  mm: introduce io_remap_pfn_range_prepare, complete
  mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  mm: update mem char driver to use mmap_prepare, mmap_complete
  mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
  mm: update cramfs to use mmap_prepare, mmap_complete
  fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
  fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
  kcov: update kcov to use mmap_prepare, mmap_complete

 Documentation/filesystems/porting.rst |   9 ++
 Documentation/filesystems/vfs.rst     |  35 +++++++
 arch/csky/include/asm/pgtable.h       |   5 +
 arch/mips/alchemy/common/setup.c      |  28 +++++-
 arch/mips/include/asm/pgtable.h       |  10 ++
 arch/s390/kernel/crash_dump.c         |   6 +-
 arch/sparc/include/asm/pgtable_32.h   |  29 +++++-
 arch/sparc/include/asm/pgtable_64.h   |  29 +++++-
 drivers/char/mem.c                    |  80 ++++++++-------
 drivers/dax/device.c                  |  32 +++---
 fs/cramfs/inode.c                     | 134 ++++++++++++++++++--------
 fs/hugetlbfs/inode.c                  |  86 +++++++++--------
 fs/ntfs3/file.c                       |   2 +-
 fs/proc/inode.c                       |  13 ++-
 fs/proc/vmcore.c                      |  53 +++++++---
 fs/resctrl/pseudo_lock.c              |  56 ++++++++---
 include/linux/fs.h                    |   4 +
 include/linux/mm.h                    |  53 +++++++++-
 include/linux/mm_types.h              |   5 +
 include/linux/proc_fs.h               |   5 +
 include/linux/shmem_fs.h              |   3 +-
 include/linux/vmalloc.h               |  10 +-
 kernel/kcov.c                         |  40 +++++---
 kernel/relay.c                        |  32 +++---
 mm/memory.c                           | 128 +++++++++++++++---------
 mm/secretmem.c                        |   2 +-
 mm/shmem.c                            |  49 +++++++---
 mm/util.c                             |  18 +++-
 mm/vma.c                              |  96 +++++++++++++++---
 mm/vmalloc.c                          |  16 ++-
 tools/testing/vma/vma_internal.h      |  31 +++++-
 31 files changed, 810 insertions(+), 289 deletions(-)

--
2.51.0

^ permalink raw reply	[flat|nested] 84+ messages in thread

* [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 14:59   ` David Hildenbrand
  2025-09-09  3:19   ` Baolin Wang
  2025-09-08 11:10 ` [PATCH 02/16] device/dax: update devdax " Lorenzo Stoakes
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

This simply assigns the vm_ops so is easily updated - do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/shmem.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 29e1eb690125..cfc33b99a23a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2950,16 +2950,17 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
 	return retval;
 }
 
-static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int shmem_mmap_prepare(struct vm_area_desc *desc)
 {
+	struct file *file = desc->file;
 	struct inode *inode = file_inode(file);
 
 	file_accessed(file);
 	/* This is anonymous shared memory if it is unlinked at the time of mmap */
 	if (inode->i_nlink)
-		vma->vm_ops = &shmem_vm_ops;
+		desc->vm_ops = &shmem_vm_ops;
 	else
-		vma->vm_ops = &shmem_anon_vm_ops;
+		desc->vm_ops = &shmem_anon_vm_ops;
 	return 0;
 }
 
@@ -5229,7 +5230,7 @@ static const struct address_space_operations shmem_aops = {
 };
 
 static const struct file_operations shmem_file_operations = {
-	.mmap		= shmem_mmap,
+	.mmap_prepare	= shmem_mmap_prepare,
 	.open		= shmem_file_open,
 	.get_unmapped_area = shmem_get_unmapped_area,
 #ifdef CONFIG_TMPFS
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 02/16] device/dax: update devdax to use mmap_prepare
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 15:03   ` David Hildenbrand
  2025-09-08 11:10 ` [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

The devdax driver does nothing special in its f_op->mmap hook, so
straightforwardly update it to use the mmap_prepare hook instead.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 drivers/dax/device.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 2bb40a6060af..c2181439f925 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -13,8 +13,9 @@
 #include "dax-private.h"
 #include "bus.h"
 
-static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
-		const char *func)
+static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
+		       unsigned long start, unsigned long end, struct file *file,
+		       const char *func)
 {
 	struct device *dev = &dev_dax->dev;
 	unsigned long mask;
@@ -23,7 +24,7 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		return -ENXIO;
 
 	/* prevent private mappings from being established */
-	if ((vma->vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
+	if ((vm_flags & VM_MAYSHARE) != VM_MAYSHARE) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, attempted private mapping\n",
 				current->comm, func);
@@ -31,15 +32,15 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	}
 
 	mask = dev_dax->align - 1;
-	if (vma->vm_start & mask || vma->vm_end & mask) {
+	if (start & mask || end & mask) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
-				current->comm, func, vma->vm_start, vma->vm_end,
+				current->comm, func, start, end,
 				mask);
 		return -EINVAL;
 	}
 
-	if (!vma_is_dax(vma)) {
+	if (!file_is_dax(file)) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, vma is not DAX capable\n",
 				current->comm, func);
@@ -49,6 +50,13 @@ static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 	return 0;
 }
 
+static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
+		     const char *func)
+{
+	return __check_vma(dev_dax, vma->vm_flags, vma->vm_start, vma->vm_end,
+			   vma->vm_file, func);
+}
+
 /* see "strong" declaration in tools/testing/nvdimm/dax-dev.c */
 __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 		unsigned long size)
@@ -285,8 +293,9 @@ static const struct vm_operations_struct dax_vm_ops = {
 	.pagesize = dev_dax_pagesize,
 };
 
-static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
+static int dax_mmap_prepare(struct vm_area_desc *desc)
 {
+	struct file *filp = desc->file;
 	struct dev_dax *dev_dax = filp->private_data;
 	int rc, id;
 
@@ -297,13 +306,14 @@ static int dax_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * fault time.
 	 */
 	id = dax_read_lock();
-	rc = check_vma(dev_dax, vma, __func__);
+	rc = __check_vma(dev_dax, desc->vm_flags, desc->start, desc->end, filp,
+			 __func__);
 	dax_read_unlock(id);
 	if (rc)
 		return rc;
 
-	vma->vm_ops = &dax_vm_ops;
-	vm_flags_set(vma, VM_HUGEPAGE);
+	desc->vm_ops = &dax_vm_ops;
+	desc->vm_flags |= VM_HUGEPAGE;
 	return 0;
 }
 
@@ -377,7 +387,7 @@ static const struct file_operations dax_fops = {
 	.open = dax_open,
 	.release = dax_release,
 	.get_unmapped_area = dax_get_unmapped_area,
-	.mmap = dax_mmap,
+	.mmap_prepare = dax_mmap_prepare,
 	.fop_flags = FOP_MMAP_SYNC,
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 02/16] device/dax: update devdax " Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 12:51   ` Jason Gunthorpe
  2025-09-08 15:10   ` David Hildenbrand
  2025-09-08 11:10 ` [PATCH 04/16] relay: update relay to use mmap_prepare Lorenzo Stoakes
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

It's useful to be able to determine the size of a VMA descriptor range used
on f_op->mmap_prepare, expressed both in bytes and pages, so add helpers
for both and update code that could make use of it to do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/ntfs3/file.c    |  2 +-
 include/linux/mm.h | 10 ++++++++++
 mm/secretmem.c     |  2 +-
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
index c1ece707b195..86eb88f62714 100644
--- a/fs/ntfs3/file.c
+++ b/fs/ntfs3/file.c
@@ -304,7 +304,7 @@ static int ntfs_file_mmap_prepare(struct vm_area_desc *desc)
 
 	if (rw) {
 		u64 to = min_t(loff_t, i_size_read(inode),
-			       from + desc->end - desc->start);
+			       from + vma_desc_size(desc));
 
 		if (is_sparsed(ni)) {
 			/* Allocate clusters for rw map. */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a6bfa46937a8..9d4508b20be3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3560,6 +3560,16 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
 }
 
+static inline unsigned long vma_desc_size(struct vm_area_desc *desc)
+{
+	return desc->end - desc->start;
+}
+
+static inline unsigned long vma_desc_pages(struct vm_area_desc *desc)
+{
+	return vma_desc_size(desc) >> PAGE_SHIFT;
+}
+
 /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
 static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
 				unsigned long vm_start, unsigned long vm_end)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 60137305bc20..62066ddb1e9c 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -120,7 +120,7 @@ static int secretmem_release(struct inode *inode, struct file *file)
 
 static int secretmem_mmap_prepare(struct vm_area_desc *desc)
 {
-	const unsigned long len = desc->end - desc->start;
+	const unsigned long len = vma_desc_size(desc);
 
 	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
 		return -EINVAL;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 04/16] relay: update relay to use mmap_prepare
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (2 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 15:15   ` David Hildenbrand
  2025-09-08 11:10 ` [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion Lorenzo Stoakes
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

It is relatively trivial to update this code to use the f_op->mmap_prepare
hook in favour of the deprecated f_op->mmap hook, so do so.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/relay.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/kernel/relay.c b/kernel/relay.c
index 8d915fe98198..8866054104fe 100644
--- a/kernel/relay.c
+++ b/kernel/relay.c
@@ -72,17 +72,17 @@ static void relay_free_page_array(struct page **array)
 }
 
 /**
- *	relay_mmap_buf: - mmap channel buffer to process address space
- *	@buf: relay channel buffer
- *	@vma: vm_area_struct describing memory to be mapped
+ *	relay_mmap_prepare_buf: - mmap channel buffer to process address space
+ *	@desc: describing what to map
  *
  *	Returns 0 if ok, negative on error
  *
  *	Caller should already have grabbed mmap_lock.
  */
-static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
+static int relay_mmap_prepare_buf(struct rchan_buf *buf,
+				  struct vm_area_desc *desc)
 {
-	unsigned long length = vma->vm_end - vma->vm_start;
+	unsigned long length = vma_desc_size(desc);
 
 	if (!buf)
 		return -EBADF;
@@ -90,9 +90,9 @@ static int relay_mmap_buf(struct rchan_buf *buf, struct vm_area_struct *vma)
 	if (length != (unsigned long)buf->chan->alloc_size)
 		return -EINVAL;
 
-	vma->vm_ops = &relay_file_mmap_ops;
-	vm_flags_set(vma, VM_DONTEXPAND);
-	vma->vm_private_data = buf;
+	desc->vm_ops = &relay_file_mmap_ops;
+	desc->vm_flags |= VM_DONTEXPAND;
+	desc->private_data = buf;
 
 	return 0;
 }
@@ -749,16 +749,16 @@ static int relay_file_open(struct inode *inode, struct file *filp)
 }
 
 /**
- *	relay_file_mmap - mmap file op for relay files
- *	@filp: the file
- *	@vma: the vma describing what to map
+ *	relay_file_mmap_prepare - mmap file op for relay files
+ *	@desc: describing what to map
  *
- *	Calls upon relay_mmap_buf() to map the file into user space.
+ *	Calls upon relay_mmap_prepare_buf() to map the file into user space.
  */
-static int relay_file_mmap(struct file *filp, struct vm_area_struct *vma)
+static int relay_file_mmap_prepare(struct vm_area_desc *desc)
 {
-	struct rchan_buf *buf = filp->private_data;
-	return relay_mmap_buf(buf, vma);
+	struct rchan_buf *buf = desc->file->private_data;
+
+	return relay_mmap_prepare_buf(buf, desc);
 }
 
 /**
@@ -1006,7 +1006,7 @@ static ssize_t relay_file_read(struct file *filp,
 const struct file_operations relay_file_operations = {
 	.open		= relay_file_open,
 	.poll		= relay_file_poll,
-	.mmap		= relay_file_mmap,
+	.mmap_prepare	= relay_file_mmap_prepare,
 	.read		= relay_file_read,
 	.release	= relay_file_release,
 };
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (3 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 04/16] relay: update relay to use mmap_prepare Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 15:19   ` David Hildenbrand
  2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

Now we have the f_op->mmap_prepare() hook, having a static function called
__mmap_prepare() that has nothing to do with it is confusing, so rename the
function.

Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
provide a f_op->mmap_complete() callback.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 mm/vma.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vma.c b/mm/vma.c
index abe0da33c844..0efa4288570e 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2329,7 +2329,7 @@ static void update_ksm_flags(struct mmap_state *map)
 }
 
 /*
- * __mmap_prepare() - Prepare to gather any overlapping VMAs that need to be
+ * __mmap_prelude() - Prepare to gather any overlapping VMAs that need to be
  * unmapped once the map operation is completed, check limits, account mapping
  * and clean up any pre-existing VMAs.
  *
@@ -2338,7 +2338,7 @@ static void update_ksm_flags(struct mmap_state *map)
  *
  * Returns: 0 on success, error code otherwise.
  */
-static int __mmap_prepare(struct mmap_state *map, struct list_head *uf)
+static int __mmap_prelude(struct mmap_state *map, struct list_head *uf)
 {
 	int error;
 	struct vma_iterator *vmi = map->vmi;
@@ -2515,13 +2515,13 @@ static int __mmap_new_vma(struct mmap_state *map, struct vm_area_struct **vmap)
 }
 
 /*
- * __mmap_complete() - Unmap any VMAs we overlap, account memory mapping
+ * __mmap_epilogue() - Unmap any VMAs we overlap, account memory mapping
  *                     statistics, handle locking and finalise the VMA.
  *
  * @map: Mapping state.
  * @vma: Merged or newly allocated VMA for the mmap()'d region.
  */
-static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+static void __mmap_epilogue(struct mmap_state *map, struct vm_area_struct *vma)
 {
 	struct mm_struct *mm = map->mm;
 	vm_flags_t vm_flags = vma->vm_flags;
@@ -2649,7 +2649,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 
 	map.check_ksm_early = can_set_ksm_flags_early(&map);
 
-	error = __mmap_prepare(&map, uf);
+	error = __mmap_prelude(&map, uf);
 	if (!error && have_mmap_prepare)
 		error = call_mmap_prepare(&map);
 	if (error)
@@ -2675,11 +2675,11 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 	if (have_mmap_prepare)
 		set_vma_user_defined_fields(vma, &map);
 
-	__mmap_complete(&map, vma);
+	__mmap_epilogue(&map, vma);
 
 	return addr;
 
-	/* Accounting was done by __mmap_prepare(). */
+	/* Accounting was done by __mmap_prelude(). */
 unacct_error:
 	if (map.charged)
 		vm_unacct_memory(map.charged);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (4 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 12:55   ` Jason Gunthorpe
                     ` (2 more replies)
  2025-09-08 11:10 ` [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort] Lorenzo Stoakes
                   ` (11 subsequent siblings)
  17 siblings, 3 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

We have introduced the f_op->mmap_prepare hook to allow for setting up a
VMA far earlier in the process of mapping memory, reducing problematic
error handling paths, but this does not provide what all
drivers/filesystems need.

In order to supply this, and to be able to move forward with removing
f_op->mmap altogether, introduce f_op->mmap_complete.

This hook is called once the VMA is fully mapped and everything is done,
however with the mmap write lock and VMA write locks held.

The hook is then provided with a fully initialised VMA which it can do what
it needs with, though the mmap and VMA write locks must remain held
throughout.

It is not intended that the VMA be modified at this point, attempts to do
so will end in tears.

This allows for operations such as pre-population typically via a remap, or
really anything that requires access to the VMA once initialised.

In addition, a caller may need to take a lock in mmap_prepare, when it is
possible to modify the VMA, and release it on mmap_complete. In order to
handle errors which may arise between the two operations, f_op->mmap_abort
is provided.

This hook should be used to drop any lock and clean up anything before the
VMA mapping operation is aborted. After this point the VMA will not be
added to any mapping and will not exist.

We also add a new mmap_context field to the vm_area_desc type which can be
used to pass information pertinent to any locks which are held or any state
which is required for mmap_complete, abort to operate correctly.

We also update the compatibility layer for nested filesystems which
currently still only specify an f_op->mmap() handler so that it correctly
invokes f_op->mmap_complete as necessary (note that no error can occur
between mmap_prepare and mmap_complete so mmap_abort will never be called
in this case).

Also update the VMA tests to account for the changes.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/fs.h               |  4 ++
 include/linux/mm_types.h         |  5 ++
 mm/util.c                        | 18 +++++--
 mm/vma.c                         | 82 ++++++++++++++++++++++++++++++--
 tools/testing/vma/vma_internal.h | 31 ++++++++++--
 5 files changed, 129 insertions(+), 11 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 594bd4d0521e..bb432924993a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2195,6 +2195,10 @@ struct file_operations {
 	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
 				unsigned int poll_flags);
 	int (*mmap_prepare)(struct vm_area_desc *);
+	int (*mmap_complete)(struct file *, struct vm_area_struct *,
+			     const void *context);
+	void (*mmap_abort)(const struct file *, const void *vm_private_data,
+			   const void *context);
 } __randomize_layout;
 
 /* Supports async buffered reads */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cf759fe08bb3..052db1f31fb3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -793,6 +793,11 @@ struct vm_area_desc {
 	/* Write-only fields. */
 	const struct vm_operations_struct *vm_ops;
 	void *private_data;
+	/*
+	 * A user-defined field, value will be passed to mmap_complete,
+	 * mmap_abort.
+	 */
+	void *mmap_context;
 };
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 248f877f629b..f5bcac140cb9 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1161,17 +1161,26 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
 	err = f_op->mmap_prepare(&desc);
 	if (err)
 		return err;
+
 	set_vma_from_desc(vma, &desc);
 
-	return 0;
+	/*
+	 * No error can occur between mmap_prepare() and mmap_complete so no
+	 * need to invoke mmap_abort().
+	 */
+
+	if (f_op->mmap_complete)
+		err = f_op->mmap_complete(file, vma, desc.mmap_context);
+
+	return err;
 }
 EXPORT_SYMBOL(__compat_vma_mmap_prepare);
 
 /**
  * compat_vma_mmap_prepare() - Apply the file's .mmap_prepare() hook to an
- * existing VMA.
+ * existing VMA and invoke .mmap_complete() if provided.
  * @file: The file which possesss an f_op->mmap_prepare() hook.
- * @vma: The VMA to apply the .mmap_prepare() hook to.
+ * @vma: The VMA to apply the hooks to.
  *
  * Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain
  * stacked filesystems invoke a nested mmap hook of an underlying file.
@@ -1188,6 +1197,9 @@ EXPORT_SYMBOL(__compat_vma_mmap_prepare);
  * establishes a struct vm_area_desc descriptor, passes to the underlying
  * .mmap_prepare() hook and applies any changes performed by it.
  *
+ * If the relevant hooks are provided, it also invokes .mmap_complete() upon
+ * successful completion.
+ *
  * Once the conversion of filesystems is complete this function will no longer
  * be required and will be removed.
  *
diff --git a/mm/vma.c b/mm/vma.c
index 0efa4288570e..a0b568fe9e8d 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -22,6 +22,7 @@ struct mmap_state {
 	/* User-defined fields, perhaps updated by .mmap_prepare(). */
 	const struct vm_operations_struct *vm_ops;
 	void *vm_private_data;
+	void *mmap_context;
 
 	unsigned long charged;
 
@@ -2343,6 +2344,23 @@ static int __mmap_prelude(struct mmap_state *map, struct list_head *uf)
 	int error;
 	struct vma_iterator *vmi = map->vmi;
 	struct vma_munmap_struct *vms = &map->vms;
+	struct file *file = map->file;
+
+	if (file) {
+		/* f_op->mmap_complete requires f_op->mmap_prepare. */
+		if (file->f_op->mmap_complete && !file->f_op->mmap_prepare)
+			return -EINVAL;
+
+		/*
+		 * It's not valid to provide an f_op->mmap_abort hook without also
+		 * providing the f_op->mmap_prepare and f_op->mmap_complete hooks it is
+		 * used with.
+		 */
+		if (file->f_op->mmap_abort &&
+		     (!file->f_op->mmap_prepare ||
+		      !file->f_op->mmap_complete))
+			return -EINVAL;
+	}
 
 	/* Find the first overlapping VMA and initialise unmap state. */
 	vms->vma = vma_find(vmi, map->end);
@@ -2595,6 +2613,7 @@ static int call_mmap_prepare(struct mmap_state *map)
 	/* User-defined fields. */
 	map->vm_ops = desc.vm_ops;
 	map->vm_private_data = desc.private_data;
+	map->mmap_context = desc.mmap_context;
 
 	return 0;
 }
@@ -2636,16 +2655,61 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
 	return false;
 }
 
+/*
+ * Invoke the f_op->mmap_complete hook, providing it with a fully initialised
+ * VMA to operate upon.
+ *
+ * The mmap and VMA write locks must be held prior to and after the hook has
+ * been invoked.
+ */
+static int call_mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
+{
+	struct file *file = map->file;
+	void *context = map->mmap_context;
+	int error;
+	size_t len;
+
+	if (!file || !file->f_op->mmap_complete)
+		return 0;
+
+	error = file->f_op->mmap_complete(file, vma, context);
+	/* The hook must NOT drop the write locks. */
+	vma_assert_write_locked(vma);
+	mmap_assert_write_locked(current->mm);
+	if (!error)
+		return 0;
+
+	/*
+	 * If an error occurs, unmap the VMA altogether and return an error. We
+	 * only clear the newly allocated VMA, since this function is only
+	 * invoked if we do NOT merge, so we only clean up the VMA we created.
+	 */
+	len = vma_pages(vma) << PAGE_SHIFT;
+	do_munmap(current->mm, vma->vm_start, len, NULL);
+	return error;
+}
+
+static void call_mmap_abort(struct mmap_state *map)
+{
+	struct file *file = map->file;
+	void *vm_private_data = map->vm_private_data;
+
+	VM_WARN_ON_ONCE(!file || !file->f_op);
+	file->f_op->mmap_abort(file, vm_private_data, map->mmap_context);
+}
+
 static unsigned long __mmap_region(struct file *file, unsigned long addr,
 		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
 		struct list_head *uf)
 {
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	int error;
 	bool have_mmap_prepare = file && file->f_op->mmap_prepare;
+	bool have_mmap_abort = file && file->f_op->mmap_abort;
+	struct mm_struct *mm = current->mm;
 	VMA_ITERATOR(vmi, mm, addr);
 	MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
+	struct vm_area_struct *vma = NULL;
+	bool allocated_new = false;
+	int error;
 
 	map.check_ksm_early = can_set_ksm_flags_early(&map);
 
@@ -2668,8 +2732,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 	/* ...but if we can't, allocate a new VMA. */
 	if (!vma) {
 		error = __mmap_new_vma(&map, &vma);
-		if (error)
+		if (error) {
+			if (have_mmap_abort)
+				call_mmap_abort(&map);
 			goto unacct_error;
+		}
+		allocated_new = true;
 	}
 
 	if (have_mmap_prepare)
@@ -2677,6 +2745,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 
 	__mmap_epilogue(&map, vma);
 
+	if (allocated_new) {
+		error = call_mmap_complete(&map, vma);
+		if (error)
+			return error;
+	}
+
 	return addr;
 
 	/* Accounting was done by __mmap_prelude(). */
diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
index 07167446dcf4..566cef1c0e0b 100644
--- a/tools/testing/vma/vma_internal.h
+++ b/tools/testing/vma/vma_internal.h
@@ -297,11 +297,20 @@ struct vm_area_desc {
 	/* Write-only fields. */
 	const struct vm_operations_struct *vm_ops;
 	void *private_data;
+	/*
+	 * A user-defined field, value will be passed to mmap_complete,
+	 * mmap_abort.
+	 */
+	void *mmap_context;
 };
 
 struct file_operations {
 	int (*mmap)(struct file *, struct vm_area_struct *);
 	int (*mmap_prepare)(struct vm_area_desc *);
+	void (*mmap_abort)(const struct file *, const void *vm_private_data,
+			   const void *context);
+	int (*mmap_complete)(struct file *, struct vm_area_struct *,
+			     const void *context);
 };
 
 struct file {
@@ -1471,7 +1480,7 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
 {
 	struct vm_area_desc desc = {
 		.mm = vma->vm_mm,
-		.file = vma->vm_file,
+		.file = file,
 		.start = vma->vm_start,
 		.end = vma->vm_end,
 
@@ -1485,13 +1494,21 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
 	err = f_op->mmap_prepare(&desc);
 	if (err)
 		return err;
+
 	set_vma_from_desc(vma, &desc);
 
-	return 0;
+	/*
+	 * No error can occur between mmap_prepare() and mmap_complete so no
+	 * need to invoke mmap_abort().
+	 */
+
+	if (f_op->mmap_complete)
+		err = f_op->mmap_complete(file, vma, desc.mmap_context);
+
+	return err;
 }
 
-static inline int compat_vma_mmap_prepare(struct file *file,
-		struct vm_area_struct *vma)
+static inline int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
 {
 	return __compat_vma_mmap_prepare(file->f_op, file, vma);
 }
@@ -1548,4 +1565,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
 	return vm_flags;
 }
 
+static inline int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
+	      struct list_head *uf)
+{
+	return 0;
+}
+
 #endif	/* __MM_VMA_INTERNAL_H */
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort]
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (5 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 23:17   ` Randy Dunlap
  2025-09-08 11:10 ` [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

We have introduced the mmap_complete() and mmap_abort() callbacks, which
work in conjunction with mmap_prepare(), so describe what they used for.

We update both the VFS documentation and the porting guide.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 Documentation/filesystems/porting.rst |  9 +++++++
 Documentation/filesystems/vfs.rst     | 35 +++++++++++++++++++++++++++
 2 files changed, 44 insertions(+)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 85f590254f07..abc1b8c95d24 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1285,3 +1285,12 @@ rather than a VMA, as the VMA at this stage is not yet valid.
 The vm_area_desc provides the minimum required information for a filesystem
 to initialise state upon memory mapping of a file-backed region, and output
 parameters for the file system to set this state.
+
+In nearly all cases, this is all that is required for a filesystem. However,
+should there be a need to operate on the newly inserted VMA, the mmap_complete()
+can be specified to do so.
+
+Additionally, if mmap_prepare() and mmap_complete() are specified, mmap_abort()
+may also be provided which is invoked if the mapping fails between mmap_prepare
+and mmap_complete(). It is only valid to specify mmap_abort() if both other
+hooks are provided.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 486a91633474..172d36a13e13 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1114,6 +1114,10 @@ This describes how the VFS can manipulate an open file.  As of kernel
 		int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
 					unsigned int poll_flags);
 		int (*mmap_prepare)(struct vm_area_desc *);
+		int (*mmap_complete)(struct file *, struct vm_area_struct *,
+				     const void *context);
+		void (*mmap_abort)(const struct file *, const void *vm_private_data,
+				   const void *context);
 	};
 
 Again, all methods are called without any locks being held, unless
@@ -1236,6 +1240,37 @@ otherwise noted.
 	file-backed memory mapping, most notably establishing relevant
 	private state and VMA callbacks.
 
+``mmap_complete``
+	If mmap_prepare is provided, will be invoked after the mapping is fully
+	established, with the mmap and VMA write locks held.
+
+	It is useful for prepopulating VMAs before they may be accessed by
+	users.
+
+	The hook MUST NOT release either the VMA or mmap write locks. This is
+	asserted by the mmap logic.
+
+	If an error is returned by the hook, the VMA is unmapped and the
+	mmap() operation fails with that error.
+
+	It is not valid to specify this hook if mmap_prepare is not also
+	specified, doing so will result in an error upon mapping.
+
+``mmap_abort``
+	If mmap_prepare() and mmap_complete() are provided, then mmap_abort
+	may also be provided, which will be invoked if the mapping operation
+	fails between the two calls.
+
+	This is important, because mmap_prepare may succeed, but some other part
+	of the mapping operation may fail before mmap_complete can be called.
+
+	This allows a caller to acquire locks in mmap_prepare with certainty
+	that the locks will be released by either mmap_abort or mmap_complete no
+	matter what happens.
+
+	It is not valid to specify this unless mmap_prepare and mmap_complete
+	are both specified, doing so will result in an error upon mapping.
+
 Note that the file operations are implemented by the specific
 filesystem in which the inode resides.  When opening a device node
 (character or block special) most filesystems will call special
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (6 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort] Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 13:00   ` Jason Gunthorpe
  2025-09-08 11:10 ` [PATCH 09/16] mm: introduce io_remap_pfn_range_prepare, complete Lorenzo Stoakes
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

We need the ability to split PFN remap between updating the VMA and
performing the actual remap, in order to do away with the legacy f_op->mmap
hook.

To do so, update the PFN remap code to provide shared logic, and also make
remap_pfn_range_notrack() static, as its one user, io_mapping_map_user()
was removed in commit 9a4f90e24661 ("mm: remove mm/io-mapping.c").

Then, introduce remap_pfn_range_prepare(), which accepts VMA descriptor and
PFN parameters, and remap_pfn_range_complete() which accepts the same
parameters as remap_pfn_rangte().

remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
it must be supplied with a correct PFN to do so. If the caller must hold
locks to be able to do this, those locks should be held across the
operation, and mmap_abort() should be provided to revoke the lock should an
error arise.

While we're here, also clean up the duplicated #ifdef
__HAVE_PFNMAP_TRACKING check and put into a single #ifdef/#else block.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/mm.h |  25 +++++++--
 mm/memory.c        | 128 ++++++++++++++++++++++++++++-----------------
 2 files changed, 102 insertions(+), 51 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d4508b20be3..0f59bf14cac3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -489,6 +489,21 @@ extern unsigned int kobjsize(const void *objp);
  */
 #define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)
 
+/*
+ * Physically remapped pages are special. Tell the
+ * rest of the world about it:
+ *   VM_IO tells people not to look at these pages
+ *	(accesses can have side effects).
+ *   VM_PFNMAP tells the core MM that the base pages are just
+ *	raw PFN mappings, and do not have a "struct page" associated
+ *	with them.
+ *   VM_DONTEXPAND
+ *      Disable vma merging and expanding with mremap().
+ *   VM_DONTDUMP
+ *      Omit vma from core dump, even when VM_IO turned off.
+ */
+#define VM_REMAP_FLAGS (VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP)
+
 /* This mask prevents VMA from being scanned with khugepaged */
 #define VM_NO_KHUGEPAGED (VM_SPECIAL | VM_HUGETLB)
 
@@ -3611,10 +3626,12 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 
 struct vm_area_struct *find_extend_vma_locked(struct mm_struct *,
 		unsigned long addr);
-int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
-			unsigned long pfn, unsigned long size, pgprot_t);
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot);
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t pgprot);
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t pgprot);
+
 int vm_insert_page(struct vm_area_struct *, unsigned long addr, struct page *);
 int vm_insert_pages(struct vm_area_struct *vma, unsigned long addr,
 			struct page **pages, unsigned long *num);
diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..f6234c54047f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2900,8 +2900,27 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	return 0;
 }
 
+static int get_remap_pgoff(vm_flags_t vm_flags, unsigned long addr,
+		unsigned long end, unsigned long vm_start, unsigned long vm_end,
+		unsigned long pfn, pgoff_t *vm_pgoff_p)
+{
+	/*
+	 * There's a horrible special case to handle copy-on-write
+	 * behaviour that some programs depend on. We mark the "original"
+	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
+	 * See vm_normal_page() for details.
+	 */
+	if (is_cow_mapping(vm_flags)) {
+		if (addr != vm_start || end != vm_end)
+			return -EINVAL;
+		*vm_pgoff_p = pfn;
+	}
+
+	return 0;
+}
+
 static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -2912,32 +2931,17 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
 	if (WARN_ON_ONCE(!PAGE_ALIGNED(addr)))
 		return -EINVAL;
 
-	/*
-	 * Physically remapped pages are special. Tell the
-	 * rest of the world about it:
-	 *   VM_IO tells people not to look at these pages
-	 *	(accesses can have side effects).
-	 *   VM_PFNMAP tells the core MM that the base pages are just
-	 *	raw PFN mappings, and do not have a "struct page" associated
-	 *	with them.
-	 *   VM_DONTEXPAND
-	 *      Disable vma merging and expanding with mremap().
-	 *   VM_DONTDUMP
-	 *      Omit vma from core dump, even when VM_IO turned off.
-	 *
-	 * There's a horrible special case to handle copy-on-write
-	 * behaviour that some programs depend on. We mark the "original"
-	 * un-COW'ed pages by matching them up with "vma->vm_pgoff".
-	 * See vm_normal_page() for details.
-	 */
-	if (is_cow_mapping(vma->vm_flags)) {
-		if (addr != vma->vm_start || end != vma->vm_end)
-			return -EINVAL;
-		vma->vm_pgoff = pfn;
+	if (set_vma) {
+		err = get_remap_pgoff(vma->vm_flags, addr, end,
+				      vma->vm_start, vma->vm_end,
+				      pfn, &vma->vm_pgoff);
+		if (err)
+			return err;
+		vm_flags_set(vma, VM_REMAP_FLAGS);
+	} else {
+		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) == VM_REMAP_FLAGS);
 	}
 
-	vm_flags_set(vma, VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
-
 	BUG_ON(addr >= end);
 	pfn -= addr >> PAGE_SHIFT;
 	pgd = pgd_offset(mm, addr);
@@ -2957,11 +2961,10 @@ static int remap_pfn_range_internal(struct vm_area_struct *vma, unsigned long ad
  * Variant of remap_pfn_range that does not call track_pfn_remap.  The caller
  * must have pre-validated the caching bits of the pgprot_t.
  */
-int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
 {
-	int error = remap_pfn_range_internal(vma, addr, pfn, size, prot);
-
+	int error = remap_pfn_range_internal(vma, addr, pfn, size, prot, set_vma);
 	if (!error)
 		return 0;
 
@@ -2974,6 +2977,18 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr,
 	return error;
 }
 
+void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+{
+	/*
+	 * We set addr=VMA start, end=VMA end here, so this won't fail, but we
+	 * check it again on complete and will fail there if specified addr is
+	 * invalid.
+	 */
+	get_remap_pgoff(desc->vm_flags, desc->start, desc->end,
+			desc->start, desc->end, pfn, &desc->pgoff);
+	desc->vm_flags |= VM_REMAP_FLAGS;
+}
+
 #ifdef __HAVE_PFNMAP_TRACKING
 static inline struct pfnmap_track_ctx *pfnmap_track_ctx_alloc(unsigned long pfn,
 		unsigned long size, pgprot_t *prot)
@@ -3002,23 +3017,9 @@ void pfnmap_track_ctx_release(struct kref *ref)
 	pfnmap_untrack(ctx->pfn, ctx->size);
 	kfree(ctx);
 }
-#endif /* __HAVE_PFNMAP_TRACKING */
 
-/**
- * remap_pfn_range - remap kernel memory to userspace
- * @vma: user vma to map to
- * @addr: target page aligned user address to start at
- * @pfn: page frame number of kernel physical memory address
- * @size: size of mapping area
- * @prot: page protection flags for this mapping
- *
- * Note: this is only safe if the mm semaphore is held when called.
- *
- * Return: %0 on success, negative error code otherwise.
- */
-#ifdef __HAVE_PFNMAP_TRACKING
-int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
-		    unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_pfn_range_track(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t prot, bool set_vma)
 {
 	struct pfnmap_track_ctx *ctx = NULL;
 	int err;
@@ -3044,7 +3045,7 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		return -EINVAL;
 	}
 
-	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+	err = remap_pfn_range_notrack(vma, addr, pfn, size, prot, set_vma);
 	if (ctx) {
 		if (err)
 			kref_put(&ctx->kref, pfnmap_track_ctx_release);
@@ -3054,11 +3055,44 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 	return err;
 }
 
+/**
+ * remap_pfn_range - remap kernel memory to userspace
+ * @vma: user vma to map to
+ * @addr: target page aligned user address to start at
+ * @pfn: page frame number of kernel physical memory address
+ * @size: size of mapping area
+ * @prot: page protection flags for this mapping
+ *
+ * Note: this is only safe if the mm semaphore is held when called.
+ *
+ * Return: %0 on success, negative error code otherwise.
+ */
+int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
+		    unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_track(vma, addr, pfn, size, prot,
+				     /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+		unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	/* With set_vma = false, the VMA will not be modified. */
+	return remap_pfn_range_track(vma, addr, pfn, size, prot,
+				     /* set_vma = */false);
+}
 #else
 int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 		    unsigned long pfn, unsigned long size, pgprot_t prot)
 {
-	return remap_pfn_range_notrack(vma, addr, pfn, size, prot);
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot, /* set_vma = */true);
+}
+
+int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
+			     unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range_notrack(vma, addr, pfn, size, prot,
+				       /* set_vma = */false);
 }
 #endif
 EXPORT_SYMBOL(remap_pfn_range);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 09/16] mm: introduce io_remap_pfn_range_prepare, complete
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (7 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

We introduce the io_remap*() equivalents of remap_pfn_range_prepare() and
remap_pfn_range_complete() to allow for I/O remapping utilising
f_op->mmap_prepare and f_op->mmap_complete hooks.

We have to make some architecture-specific changes for those architectures
which define customised handlers.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 arch/csky/include/asm/pgtable.h     |  5 +++++
 arch/mips/alchemy/common/setup.c    | 28 +++++++++++++++++++++++++---
 arch/mips/include/asm/pgtable.h     | 10 ++++++++++
 arch/sparc/include/asm/pgtable_32.h | 29 +++++++++++++++++++++++++----
 arch/sparc/include/asm/pgtable_64.h | 29 +++++++++++++++++++++++++----
 include/linux/mm.h                  | 18 ++++++++++++++++++
 6 files changed, 108 insertions(+), 11 deletions(-)

diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index 5a394be09c35..c83505839a06 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -266,4 +266,9 @@ void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
 #define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
 	remap_pfn_range(vma, vaddr, pfn, size, prot)
 
+/* default io_remap_pfn_range_prepare can be used. */
+
+#define io_remap_pfn_range_complete(vma, addr, pfn, size, prot) \
+	remap_pfn_range_complete(vma, addr, pfn, size, prot)
+
 #endif /* __ASM_CSKY_PGTABLE_H */
diff --git a/arch/mips/alchemy/common/setup.c b/arch/mips/alchemy/common/setup.c
index a7a6d31a7a41..a4ab02776994 100644
--- a/arch/mips/alchemy/common/setup.c
+++ b/arch/mips/alchemy/common/setup.c
@@ -94,12 +94,34 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t phys_addr, phys_addr_t size)
 	return phys_addr;
 }
 
-int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+static unsigned long calc_pfn(unsigned long pfn, unsigned long size)
 {
 	phys_addr_t phys_addr = fixup_bigphys_addr(pfn << PAGE_SHIFT, size);
 
-	return remap_pfn_range(vma, vaddr, phys_addr >> PAGE_SHIFT, size, prot);
+	return phys_addr >> PAGE_SHIFT;
+}
+
+int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
+		unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, vaddr, calc_pfn(pfn, size), size, prot);
 }
 EXPORT_SYMBOL(io_remap_pfn_range);
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+			       unsigned long size)
+{
+	remap_pfn_range_prepare(desc, calc_pfn(pfn, size));
+}
+EXPORT_SYMBOL(io_remap_pfn_range_prepare);
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, calc_pfn(pfn, size),
+			size, prot);
+}
+EXPORT_SYMBOL(io_remap_pfn_range_complete);
+
 #endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index ae73ecf4c41a..6a8964f55a31 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -607,6 +607,16 @@ phys_addr_t fixup_bigphys_addr(phys_addr_t addr, phys_addr_t size);
 int io_remap_pfn_range(struct vm_area_struct *vma, unsigned long vaddr,
 		unsigned long pfn, unsigned long size, pgprot_t prot);
 #define io_remap_pfn_range io_remap_pfn_range
+
+void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+		unsigned long size);
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot);
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
 #else
 #define fixup_bigphys_addr(addr, size)	(addr)
 #endif /* CONFIG_MIPS_FIXUP_BIGPHYS_ADDR */
diff --git a/arch/sparc/include/asm/pgtable_32.h b/arch/sparc/include/asm/pgtable_32.h
index 7c199c003ffe..cfd764afc107 100644
--- a/arch/sparc/include/asm/pgtable_32.h
+++ b/arch/sparc/include/asm/pgtable_32.h
@@ -398,9 +398,7 @@ __get_iospace (unsigned long addr)
 int remap_pfn_range(struct vm_area_struct *, unsigned long, unsigned long,
 		    unsigned long, pgprot_t);
 
-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
-				     unsigned long from, unsigned long pfn,
-				     unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
 {
 	unsigned long long offset, space, phys_base;
 
@@ -408,10 +406,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 	space = GET_IOSPACE(pfn);
 	phys_base = offset | (space << 32ULL);
 
-	return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+	return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+				     unsigned long from, unsigned long pfn,
+				     unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
 }
 #define io_remap_pfn_range io_remap_pfn_range
 
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+		unsigned long size)
+{
+	remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+			size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
 #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
 #define ptep_set_access_flags(__vma, __address, __ptep, __entry, __dirty) \
 ({									  \
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 669cd02469a1..b8000ce4b59f 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -1084,9 +1084,7 @@ static inline int arch_unmap_one(struct mm_struct *mm,
 	return 0;
 }
 
-static inline int io_remap_pfn_range(struct vm_area_struct *vma,
-				     unsigned long from, unsigned long pfn,
-				     unsigned long size, pgprot_t prot)
+static inline unsigned long calc_io_remap_pfn(unsigned long pfn)
 {
 	unsigned long offset = GET_PFN(pfn) << PAGE_SHIFT;
 	int space = GET_IOSPACE(pfn);
@@ -1094,10 +1092,33 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 
 	phys_base = offset | (((unsigned long) space) << 32UL);
 
-	return remap_pfn_range(vma, from, phys_base >> PAGE_SHIFT, size, prot);
+	return phys_base >> PAGE_SHIFT;
+}
+
+static inline int io_remap_pfn_range(struct vm_area_struct *vma,
+				     unsigned long from, unsigned long pfn,
+				     unsigned long size, pgprot_t prot)
+{
+	return remap_pfn_range(vma, from, calc_io_remap_pfn(pfn), size, prot);
 }
 #define io_remap_pfn_range io_remap_pfn_range
 
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+	unsigned long size)
+{
+	return remap_pfn_range_prepare(desc, calc_io_remap_pfn(pfn));
+}
+#define io_remap_pfn_range_prepare io_remap_pfn_range_prepare
+
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, calc_io_remap_pfn(pfn),
+					size, prot);
+}
+#define io_remap_pfn_range_complete io_remap_pfn_range_complete
+
 static inline unsigned long __untagged_addr(unsigned long start)
 {
 	if (adi_capable()) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0f59bf14cac3..d96840262498 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3673,6 +3673,24 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,
 }
 #endif
 
+#ifndef io_remap_pfn_range_prepare
+static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn,
+	unsigned long size)
+{
+	return remap_pfn_range_prepare(desc, pfn);
+}
+#endif
+
+#ifndef io_remap_pfn_range_complete
+static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
+		unsigned long addr, unsigned long pfn, unsigned long size,
+		pgprot_t prot)
+{
+	return remap_pfn_range_complete(vma, addr, pfn, size,
+			pgprot_decrypted(prot));
+}
+#endif
+
 static inline vm_fault_t vmf_error(int err)
 {
 	if (err == -ENOMEM)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (8 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 09/16] mm: introduce io_remap_pfn_range_prepare, complete Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 13:11   ` Jason Gunthorpe
  2025-09-08 11:10 ` [PATCH 11/16] mm: update mem char driver " Lorenzo Stoakes
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

We can now update hugetlb to make sure of the new .mmap_prepare() hook, by
deferring the reservation of pages until the VMA is fully established and
handle this in the f_op->mmap_complete() hook.

We hold the VMA write lock throughout so we can't race with faults. rmap
can discover the VMA, but this should not cause a problem.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/hugetlbfs/inode.c | 86 ++++++++++++++++++++++++--------------------
 1 file changed, 47 insertions(+), 39 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 3cfdf4091001..46d1ddc654c2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -96,39 +96,14 @@ static const struct fs_parameter_spec hugetlb_fs_parameters[] = {
 #define PGOFF_LOFFT_MAX \
 	(((1UL << (PAGE_SHIFT + 1)) - 1) <<  (BITS_PER_LONG - (PAGE_SHIFT + 1)))
 
-static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+static int hugetlb_file_mmap_complete(struct file *file, struct vm_area_struct *vma,
+				      const void *context)
 {
 	struct inode *inode = file_inode(file);
-	loff_t len, vma_len;
-	int ret;
 	struct hstate *h = hstate_file(file);
-	vm_flags_t vm_flags;
-
-	/*
-	 * vma address alignment (but not the pgoff alignment) has
-	 * already been checked by prepare_hugepage_range.  If you add
-	 * any error returns here, do so after setting VM_HUGETLB, so
-	 * is_vm_hugetlb_page tests below unmap_region go the right
-	 * way when do_mmap unwinds (may be important on powerpc
-	 * and ia64).
-	 */
-	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
-	vma->vm_ops = &hugetlb_vm_ops;
-
-	/*
-	 * page based offset in vm_pgoff could be sufficiently large to
-	 * overflow a loff_t when converted to byte offset.  This can
-	 * only happen on architectures where sizeof(loff_t) ==
-	 * sizeof(unsigned long).  So, only check in those instances.
-	 */
-	if (sizeof(unsigned long) == sizeof(loff_t)) {
-		if (vma->vm_pgoff & PGOFF_LOFFT_MAX)
-			return -EINVAL;
-	}
-
-	/* must be huge page aligned */
-	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
-		return -EINVAL;
+	vm_flags_t vm_flags = vma->vm_flags;
+	loff_t len, vma_len;
+	int ret = 0;
 
 	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
 	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
@@ -139,9 +114,6 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	inode_lock(inode);
 	file_accessed(file);
 
-	ret = -ENOMEM;
-
-	vm_flags = vma->vm_flags;
 	/*
 	 * for SHM_HUGETLB, the pages are reserved in the shmget() call so skip
 	 * reserving here. Note: only for SHM hugetlbfs file, the inode
@@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 		vm_flags |= VM_NORESERVE;
 
 	if (hugetlb_reserve_pages(inode,
-				vma->vm_pgoff >> huge_page_order(h),
-				len >> huge_page_shift(h), vma,
-				vm_flags) < 0)
+			vma->vm_pgoff >> huge_page_order(h),
+			len >> huge_page_shift(h), vma,
+			vm_flags) < 0) {
+		ret = -ENOMEM;
 		goto out;
+	}
 
-	ret = 0;
 	if (vma->vm_flags & VM_WRITE && inode->i_size < len)
 		i_size_write(inode, len);
+
 out:
 	inode_unlock(inode);
-
 	return ret;
 }
 
+static int hugetlbfs_file_mmap_prepare(struct vm_area_desc *desc)
+{
+	struct file *file = desc->file;
+	struct hstate *h = hstate_file(file);
+
+	/*
+	 * vma address alignment (but not the pgoff alignment) has
+	 * already been checked by prepare_hugepage_range.  If you add
+	 * any error returns here, do so after setting VM_HUGETLB, so
+	 * is_vm_hugetlb_page tests below unmap_region go the right
+	 * way when do_mmap unwinds (may be important on powerpc
+	 * and ia64).
+	 */
+	desc->vm_flags |= VM_HUGETLB | VM_DONTEXPAND;
+	desc->vm_ops = &hugetlb_vm_ops;
+
+	/*
+	 * page based offset in vm_pgoff could be sufficiently large to
+	 * overflow a loff_t when converted to byte offset.  This can
+	 * only happen on architectures where sizeof(loff_t) ==
+	 * sizeof(unsigned long).  So, only check in those instances.
+	 */
+	if (sizeof(unsigned long) == sizeof(loff_t)) {
+		if (desc->pgoff & PGOFF_LOFFT_MAX)
+			return -EINVAL;
+	}
+
+	/* must be huge page aligned */
+	if (desc->pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
+		return -EINVAL;
+
+	return 0;
+}
+
 /*
  * Called under mmap_write_lock(mm).
  */
@@ -1219,7 +1226,8 @@ static void init_once(void *foo)
 
 static const struct file_operations hugetlbfs_file_operations = {
 	.read_iter		= hugetlbfs_read_iter,
-	.mmap			= hugetlbfs_file_mmap,
+	.mmap_prepare		= hugetlbfs_file_mmap_prepare,
+	.mmap_complete		= hugetlb_file_mmap_complete,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
 	.llseek			= default_llseek,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 11/16] mm: update mem char driver to use mmap_prepare, mmap_complete
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (9 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort Lorenzo Stoakes
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

Update the mem char driver (backing /dev/mem and /dev/zero) to use
f_op->mmap_prepare, f_op->mmap_complete hooks rather than the deprecated
f_op->mmap hook.

The /dev/zero implementation has a very unique and rather concerning
characteristic in that it converts MAP_PRIVATE mmap() mappings anonymous
when they are, in fact, not.

The new f_op->mmap_prepare() can support this, but rather than introducing
a helper function to perform this hack (and risk introducing other users),
simply set desc->vm_op to NULL here and add a comment describing what's
going on.

We also introduce shmem_zero_setup_desc() to allow for the shared mapping
case via an f_op->mmap_prepare() hook, and generalise the code between this
and shmem_zero_setup().

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 drivers/char/mem.c       | 80 +++++++++++++++++++++++-----------------
 include/linux/shmem_fs.h |  3 +-
 mm/shmem.c               | 40 ++++++++++++++++----
 3 files changed, 81 insertions(+), 42 deletions(-)

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 34b815901b20..b57ed104d302 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -304,13 +304,13 @@ static unsigned zero_mmap_capabilities(struct file *file)
 }
 
 /* can't do an in-place private mapping if there's no MMU */
-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
 {
-	return is_nommu_shared_mapping(vma->vm_flags);
+	return is_nommu_shared_mapping(desc->vm_flags);
 }
 #else
 
-static inline int private_mapping_ok(struct vm_area_struct *vma)
+static inline int private_mapping_ok(struct vm_area_desc *desc)
 {
 	return 1;
 }
@@ -322,46 +322,54 @@ static const struct vm_operations_struct mmap_mem_ops = {
 #endif
 };
 
-static int mmap_mem(struct file *file, struct vm_area_struct *vma)
+static int mmap_mem_complete(struct file *file, struct vm_area_struct *vma,
+			     const void *context)
 {
 	size_t size = vma->vm_end - vma->vm_start;
-	phys_addr_t offset = (phys_addr_t)vma->vm_pgoff << PAGE_SHIFT;
+
+	if (remap_pfn_range_complete(vma,
+			    vma->vm_start,
+			    vma->vm_pgoff,
+			    size,
+			    vma->vm_page_prot))
+		return -EAGAIN;
+
+	return 0;
+}
+
+static int mmap_mem_prepare(struct vm_area_desc *desc)
+{
+	size_t size = vma_desc_size(desc);
+	phys_addr_t offset = (phys_addr_t)desc->pgoff << PAGE_SHIFT;
 
 	/* Does it even fit in phys_addr_t? */
-	if (offset >> PAGE_SHIFT != vma->vm_pgoff)
+	if (offset >> PAGE_SHIFT != desc->pgoff)
 		return -EINVAL;
 
 	/* It's illegal to wrap around the end of the physical address space. */
 	if (offset + (phys_addr_t)size - 1 < offset)
 		return -EINVAL;
 
-	if (!valid_mmap_phys_addr_range(vma->vm_pgoff, size))
+	if (!valid_mmap_phys_addr_range(desc->pgoff, size))
 		return -EINVAL;
 
-	if (!private_mapping_ok(vma))
+	if (!private_mapping_ok(desc))
 		return -ENOSYS;
 
-	if (!range_is_allowed(vma->vm_pgoff, size))
+	if (!range_is_allowed(desc->pgoff, size))
 		return -EPERM;
 
-	if (!phys_mem_access_prot_allowed(file, vma->vm_pgoff, size,
-						&vma->vm_page_prot))
+	if (!phys_mem_access_prot_allowed(desc->file, desc->pgoff, size,
+						&desc->page_prot))
 		return -EINVAL;
 
-	vma->vm_page_prot = phys_mem_access_prot(file, vma->vm_pgoff,
-						 size,
-						 vma->vm_page_prot);
-
-	vma->vm_ops = &mmap_mem_ops;
+	desc->page_prot = phys_mem_access_prot(desc->file, desc->pgoff,
+					       size,
+					       desc->page_prot);
+	desc->vm_ops = &mmap_mem_ops;
 
 	/* Remap-pfn-range will mark the range VM_IO */
-	if (remap_pfn_range(vma,
-			    vma->vm_start,
-			    vma->vm_pgoff,
-			    size,
-			    vma->vm_page_prot)) {
-		return -EAGAIN;
-	}
+	remap_pfn_range_prepare(desc, desc->pgoff);
 	return 0;
 }
 
@@ -501,14 +509,18 @@ static ssize_t read_zero(struct file *file, char __user *buf,
 	return cleared;
 }
 
-static int mmap_zero(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_zero(struct vm_area_desc *desc)
 {
 #ifndef CONFIG_MMU
 	return -ENOSYS;
 #endif
-	if (vma->vm_flags & VM_SHARED)
-		return shmem_zero_setup(vma);
-	vma_set_anonymous(vma);
+	if (desc->vm_flags & VM_SHARED)
+		return shmem_zero_setup_desc(desc);
+	/*
+	 * This is a highly unique situation where we mark a MAP_PRIVATE mapping
+	 * of /dev/zero anonymous, despite it not being.
+	 */
+	desc->vm_ops = NULL;
 	return 0;
 }
 
@@ -526,10 +538,11 @@ static unsigned long get_unmapped_area_zero(struct file *file,
 {
 	if (flags & MAP_SHARED) {
 		/*
-		 * mmap_zero() will call shmem_zero_setup() to create a file,
-		 * so use shmem's get_unmapped_area in case it can be huge;
-		 * and pass NULL for file as in mmap.c's get_unmapped_area(),
-		 * so as not to confuse shmem with our handle on "/dev/zero".
+		 * mmap_prepare_zero() will call shmem_zero_setup() to create a
+		 * file, so use shmem's get_unmapped_area in case it can be
+		 * huge; and pass NULL for file as in mmap.c's
+		 * get_unmapped_area(), so as not to confuse shmem with our
+		 * handle on "/dev/zero".
 		 */
 		return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
 	}
@@ -632,7 +645,8 @@ static const struct file_operations __maybe_unused mem_fops = {
 	.llseek		= memory_lseek,
 	.read		= read_mem,
 	.write		= write_mem,
-	.mmap		= mmap_mem,
+	.mmap_prepare	= mmap_mem_prepare,
+	.mmap_complete	= mmap_mem_complete,
 	.open		= open_mem,
 #ifndef CONFIG_MMU
 	.get_unmapped_area = get_unmapped_area_mem,
@@ -668,7 +682,7 @@ static const struct file_operations zero_fops = {
 	.write_iter	= write_iter_zero,
 	.splice_read	= copy_splice_read,
 	.splice_write	= splice_write_zero,
-	.mmap		= mmap_zero,
+	.mmap_prepare	= mmap_prepare_zero,
 	.get_unmapped_area = get_unmapped_area_zero,
 #ifndef CONFIG_MMU
 	.mmap_capabilities = zero_mmap_capabilities,
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0e47465ef0fd..5b368f9549d6 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -94,7 +94,8 @@ extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
 					    unsigned long flags);
 extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt,
 		const char *name, loff_t size, unsigned long flags);
-extern int shmem_zero_setup(struct vm_area_struct *);
+int shmem_zero_setup(struct vm_area_struct *vma);
+int shmem_zero_setup_desc(struct vm_area_desc *desc);
 extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 extern int shmem_lock(struct file *file, int lock, struct ucounts *ucounts);
diff --git a/mm/shmem.c b/mm/shmem.c
index cfc33b99a23a..7f402e438af0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -5905,14 +5905,9 @@ struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name,
 }
 EXPORT_SYMBOL_GPL(shmem_file_setup_with_mnt);
 
-/**
- * shmem_zero_setup - setup a shared anonymous mapping
- * @vma: the vma to be mmapped is prepared by do_mmap
- */
-int shmem_zero_setup(struct vm_area_struct *vma)
+static struct file *__shmem_zero_setup(unsigned long start, unsigned long end, vm_flags_t vm_flags)
 {
-	struct file *file;
-	loff_t size = vma->vm_end - vma->vm_start;
+	loff_t size = end - start;
 
 	/*
 	 * Cloning a new file under mmap_lock leads to a lock ordering conflict
@@ -5920,7 +5915,17 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	 * accessible to the user through its mapping, use S_PRIVATE flag to
 	 * bypass file security, in the same way as shmem_kernel_file_setup().
 	 */
-	file = shmem_kernel_file_setup("dev/zero", size, vma->vm_flags);
+	return shmem_kernel_file_setup("dev/zero", size, vm_flags);
+}
+
+/**
+ * shmem_zero_setup - setup a shared anonymous mapping
+ * @vma: the vma to be mmapped is prepared by do_mmap
+ */
+int shmem_zero_setup(struct vm_area_struct *vma)
+{
+	struct file *file = __shmem_zero_setup(vma->vm_start, vma->vm_end, vma->vm_flags);
+
 	if (IS_ERR(file))
 		return PTR_ERR(file);
 
@@ -5932,6 +5937,25 @@ int shmem_zero_setup(struct vm_area_struct *vma)
 	return 0;
 }
 
+/**
+ * shmem_zero_setup_desc - same as shmem_zero_setup, but determined by VMA
+ * descriptor for convenience.
+ * @desc: Describes VMA
+ * Returns: 0 on success, or error
+ */
+int shmem_zero_setup_desc(struct vm_area_desc *desc)
+{
+	struct file *file = __shmem_zero_setup(desc->start, desc->end, desc->vm_flags);
+
+	if (IS_ERR(file))
+		return PTR_ERR(file);
+
+	desc->vm_file = file;
+	desc->vm_ops = &shmem_anon_vm_ops;
+
+	return 0;
+}
+
 /**
  * shmem_read_folio_gfp - read into page cache, using specified page allocation flags.
  * @mapping:	the folio's address_space
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (10 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 11/16] mm: update mem char driver " Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 13:24   ` Jason Gunthorpe
  2025-09-08 11:10 ` [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

resctl uses remap_pfn_range(), but holds a mutex over the
operation. Therefore, establish the mutex in mmap_prepare(), release it in
mmap_complete() and release it in mmap_abort() should the operation fail.

Otherwise, we simply make use of the remap_pfn_range_[prepare/complete]()
remap PFN range variants in an ordinary way.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/resctrl/pseudo_lock.c | 56 +++++++++++++++++++++++++++++++---------
 1 file changed, 44 insertions(+), 12 deletions(-)

diff --git a/fs/resctrl/pseudo_lock.c b/fs/resctrl/pseudo_lock.c
index 87bbc2605de1..6d18ffde6a94 100644
--- a/fs/resctrl/pseudo_lock.c
+++ b/fs/resctrl/pseudo_lock.c
@@ -995,7 +995,8 @@ static const struct vm_operations_struct pseudo_mmap_ops = {
 	.mremap = pseudo_lock_dev_mremap,
 };
 
-static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
+static int pseudo_lock_dev_mmap_complete(struct file *filp, struct vm_area_struct *vma,
+					 const void *context)
 {
 	unsigned long vsize = vma->vm_end - vma->vm_start;
 	unsigned long off = vma->vm_pgoff << PAGE_SHIFT;
@@ -1004,6 +1005,40 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	unsigned long physical;
 	unsigned long psize;
 
+	rdtgrp = filp->private_data;
+	plr = rdtgrp->plr;
+
+	physical = __pa(plr->kmem) >> PAGE_SHIFT;
+	psize = plr->size - off;
+
+	memset(plr->kmem + off, 0, vsize);
+
+	if (remap_pfn_range_complete(vma, vma->vm_start, physical + vma->vm_pgoff,
+			    vsize, vma->vm_page_prot)) {
+		mutex_unlock(&rdtgroup_mutex);
+		return -EAGAIN;
+	}
+
+	mutex_unlock(&rdtgroup_mutex);
+	return 0;
+}
+
+static void pseudo_lock_dev_mmap_abort(const struct file *filp,
+				       const void *vm_private_data,
+				       const void *context)
+{
+	mutex_unlock(&rdtgroup_mutex);
+}
+
+static int pseudo_lock_dev_mmap_prepare(struct vm_area_desc *desc)
+{
+	unsigned long vsize = vma_desc_size(desc);
+	unsigned long off = desc->pgoff << PAGE_SHIFT;
+	struct file *filp = desc->file;
+	struct pseudo_lock_region *plr;
+	struct rdtgroup *rdtgrp;
+	unsigned long psize;
+
 	mutex_lock(&rdtgroup_mutex);
 
 	rdtgrp = filp->private_data;
@@ -1031,7 +1066,6 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 		return -EINVAL;
 	}
 
-	physical = __pa(plr->kmem) >> PAGE_SHIFT;
 	psize = plr->size - off;
 
 	if (off > plr->size) {
@@ -1043,7 +1077,7 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 	 * Ensure changes are carried directly to the memory being mapped,
 	 * do not allow copy-on-write mapping.
 	 */
-	if (!(vma->vm_flags & VM_SHARED)) {
+	if (!(desc->vm_flags & VM_SHARED)) {
 		mutex_unlock(&rdtgroup_mutex);
 		return -EINVAL;
 	}
@@ -1053,15 +1087,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
 		return -ENOSPC;
 	}
 
-	memset(plr->kmem + off, 0, vsize);
+	/* No CoW allowed so don't need to specify pfn. */
+	remap_pfn_range_prepare(desc, 0);
+	desc->vm_ops = &pseudo_mmap_ops;
 
-	if (remap_pfn_range(vma, vma->vm_start, physical + vma->vm_pgoff,
-			    vsize, vma->vm_page_prot)) {
-		mutex_unlock(&rdtgroup_mutex);
-		return -EAGAIN;
-	}
-	vma->vm_ops = &pseudo_mmap_ops;
-	mutex_unlock(&rdtgroup_mutex);
+	/* mutex will be release in mmap_complete or mmap_abort. */
 	return 0;
 }
 
@@ -1071,7 +1101,9 @@ static const struct file_operations pseudo_lock_dev_fops = {
 	.write =	NULL,
 	.open =		pseudo_lock_dev_open,
 	.release =	pseudo_lock_dev_release,
-	.mmap =		pseudo_lock_dev_mmap,
+	.mmap_prepare =	pseudo_lock_dev_mmap_prepare,
+	.mmap_complete = pseudo_lock_dev_mmap_complete,
+	.mmap_abort =	pseudo_lock_dev_mmap_abort,
 };
 
 int rdt_pseudo_lock_init(void)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (11 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 13:27   ` Jason Gunthorpe
  2025-09-08 11:10 ` [PATCH 14/16] fs/proc: add proc_mmap_[prepare, complete] hooks for procfs Lorenzo Stoakes
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

We thread the state through the mmap_context, allowing for both PFN map and
mixed mapped pre-population.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/cramfs/inode.c | 134 +++++++++++++++++++++++++++++++---------------
 1 file changed, 92 insertions(+), 42 deletions(-)

diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index b002e9b734f9..11a11213304d 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -59,6 +59,12 @@ static const struct address_space_operations cramfs_aops;
 
 static DEFINE_MUTEX(read_mutex);
 
+/* How should the mapping be completed? */
+enum cramfs_mmap_state {
+	NO_PREPOPULATE,
+	PREPOPULATE_PFNMAP,
+	PREPOPULATE_MIXEDMAP,
+};
 
 /* These macros may change in future, to provide better st_ino semantics. */
 #define OFFSET(x)	((x)->i_ino)
@@ -342,34 +348,89 @@ static bool cramfs_last_page_is_shared(struct inode *inode)
 	return memchr_inv(tail_data, 0, PAGE_SIZE - partial) ? true : false;
 }
 
-static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int cramfs_physmem_mmap_complete(struct file *file, struct vm_area_struct *vma,
+					const void *context)
 {
 	struct inode *inode = file_inode(file);
 	struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
-	unsigned int pages, max_pages, offset;
 	unsigned long address, pgoff = vma->vm_pgoff;
-	char *bailout_reason;
-	int ret;
+	unsigned int pages, offset;
+	enum cramfs_mmap_state mmap_state = (enum cramfs_mmap_state)context;
+	int ret = 0;
 
-	ret = generic_file_readonly_mmap(file, vma);
-	if (ret)
-		return ret;
+	if (mmap_state == NO_PREPOPULATE)
+		return 0;
+
+	offset = cramfs_get_block_range(inode, pgoff, &pages);
+	address = sbi->linear_phys_addr + offset;
 
 	/*
 	 * Now try to pre-populate ptes for this vma with a direct
 	 * mapping avoiding memory allocation when possible.
 	 */
 
+	if (mmap_state == PREPOPULATE_PFNMAP) {
+		/*
+		 * The entire vma is mappable. remap_pfn_range() will
+		 * make it distinguishable from a non-direct mapping
+		 * in /proc/<pid>/maps by substituting the file offset
+		 * with the actual physical address.
+		 */
+		ret = remap_pfn_range_complete(vma, vma->vm_start, address >> PAGE_SHIFT,
+				pages * PAGE_SIZE, vma->vm_page_prot);
+	} else {
+		/*
+		 * Let's create a mixed map if we can't map it all.
+		 * The normal paging machinery will take care of the
+		 * unpopulated ptes via cramfs_read_folio().
+		 */
+		int i;
+
+		for (i = 0; i < pages && !ret; i++) {
+			vm_fault_t vmf;
+			unsigned long off = i * PAGE_SIZE;
+
+			vmf = vmf_insert_mixed(vma, vma->vm_start + off,
+					address + off);
+			if (vmf & VM_FAULT_ERROR)
+				ret = vm_fault_to_errno(vmf, 0);
+		}
+	}
+
+	if (!ret)
+		pr_debug("mapped %pD[%lu] at 0x%08lx (%u/%lu pages) "
+			 "to vma 0x%08lx, page_prot 0x%llx\n", file,
+			 pgoff, address, pages, vma_pages(vma), vma->vm_start,
+			 (unsigned long long)pgprot_val(vma->vm_page_prot));
+	return ret;
+}
+
+static int cramfs_physmem_mmap_prepare(struct vm_area_desc *desc)
+{
+	struct file *file = desc->file;
+	struct inode *inode = file_inode(file);
+	struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+	unsigned int pages, max_pages, offset, mapped_pages;
+	unsigned long address, pgoff = desc->pgoff;
+	enum cramfs_mmap_state mmap_state;
+	char *bailout_reason;
+	int ret;
+
+	ret = generic_file_readonly_mmap_prepare(desc);
+	if (ret)
+		return ret;
+
 	/* Could COW work here? */
 	bailout_reason = "vma is writable";
-	if (vma->vm_flags & VM_WRITE)
+	if (desc->vm_flags & VM_WRITE)
 		goto bailout;
 
 	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	bailout_reason = "beyond file limit";
 	if (pgoff >= max_pages)
 		goto bailout;
-	pages = min(vma_pages(vma), max_pages - pgoff);
+	mapped_pages = vma_desc_pages(desc);
+	pages = min(mapped_pages, max_pages - pgoff);
 
 	offset = cramfs_get_block_range(inode, pgoff, &pages);
 	bailout_reason = "unsuitable block layout";
@@ -391,41 +452,23 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 		goto bailout;
 	}
 
-	if (pages == vma_pages(vma)) {
-		/*
-		 * The entire vma is mappable. remap_pfn_range() will
-		 * make it distinguishable from a non-direct mapping
-		 * in /proc/<pid>/maps by substituting the file offset
-		 * with the actual physical address.
-		 */
-		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
-				      pages * PAGE_SIZE, vma->vm_page_prot);
+	if (mapped_pages == pages)
+		mmap_state = PREPOPULATE_PFNMAP;
+	else
+		mmap_state = PREPOPULATE_MIXEDMAP;
+	desc->mmap_context = (void *)mmap_state;
+
+	if (mmap_state == PREPOPULATE_PFNMAP) {
+		/* No CoW allowed, so no need to provide PFN. */
+		remap_pfn_range_prepare(desc, 0);
 	} else {
-		/*
-		 * Let's create a mixed map if we can't map it all.
-		 * The normal paging machinery will take care of the
-		 * unpopulated ptes via cramfs_read_folio().
-		 */
-		int i;
-		vm_flags_set(vma, VM_MIXEDMAP);
-		for (i = 0; i < pages && !ret; i++) {
-			vm_fault_t vmf;
-			unsigned long off = i * PAGE_SIZE;
-			vmf = vmf_insert_mixed(vma, vma->vm_start + off,
-					address + off);
-			if (vmf & VM_FAULT_ERROR)
-				ret = vm_fault_to_errno(vmf, 0);
-		}
+		desc->vm_flags |= VM_MIXEDMAP;
 	}
 
-	if (!ret)
-		pr_debug("mapped %pD[%lu] at 0x%08lx (%u/%lu pages) "
-			 "to vma 0x%08lx, page_prot 0x%llx\n", file,
-			 pgoff, address, pages, vma_pages(vma), vma->vm_start,
-			 (unsigned long long)pgprot_val(vma->vm_page_prot));
-	return ret;
+	return 0;
 
 bailout:
+	desc->mmap_context = (void *)NO_PREPOPULATE;
 	pr_debug("%pD[%lu]: direct mmap impossible: %s\n",
 		 file, pgoff, bailout_reason);
 	/* Didn't manage any direct map, but normal paging is still possible */
@@ -434,9 +477,15 @@ static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
 
 #else /* CONFIG_MMU */
 
-static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+static int cramfs_physmem_mmap_prepare(struct vm_area_desc *desc)
 {
-	return is_nommu_shared_mapping(vma->vm_flags) ? 0 : -ENOSYS;
+	return is_nommu_shared_mapping(desc->vm_flags) ? 0 : -ENOSYS;
+}
+
+static int cramfs_physmem_mmap_complete(struct file *file,
+					struct vm_area_struct *vma)
+{
+	return 0;
 }
 
 static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
@@ -474,7 +523,8 @@ static const struct file_operations cramfs_physmem_fops = {
 	.llseek			= generic_file_llseek,
 	.read_iter		= generic_file_read_iter,
 	.splice_read		= filemap_splice_read,
-	.mmap			= cramfs_physmem_mmap,
+	.mmap_prepare		= cramfs_physmem_mmap_prepare,
+	.mmap_complete		= cramfs_physmem_mmap_complete,
 #ifndef CONFIG_MMU
 	.get_unmapped_area	= cramfs_physmem_get_unmapped_area,
 	.mmap_capabilities	= cramfs_physmem_mmap_capabilities,
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 14/16] fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (12 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 15/16] fs/proc: update vmcore to use .proc_mmap_[prepare, complete] Lorenzo Stoakes
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

By adding these hooks we enable procfs implementations to be able to use
the .mmap_prepare, .mmap_complete hooks rather than the deprecated .mmap
hook.

We treat this as if it were any other nested mmap hook and utilise the
.mmap_prepare compatibility layer if necessary.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/proc/inode.c         | 13 ++++++++++---
 include/linux/proc_fs.h |  5 +++++
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 129490151be1..d031267e2e4a 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -414,9 +414,16 @@ static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned
 
 static int pde_mmap(struct proc_dir_entry *pde, struct file *file, struct vm_area_struct *vma)
 {
-	__auto_type mmap = pde->proc_ops->proc_mmap;
-	if (mmap)
-		return mmap(file, vma);
+	const struct file_operations f_op = {
+		.mmap = pde->proc_ops->proc_mmap,
+		.mmap_prepare = pde->proc_ops->proc_mmap_prepare,
+		.mmap_complete = pde->proc_ops->proc_mmap_complete,
+	};
+
+	if (f_op.mmap)
+		return f_op.mmap(file, vma);
+	else if (f_op.mmap_prepare)
+		return __compat_vma_mmap_prepare(&f_op, file, vma);
 	return -EIO;
 }
 
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index f139377f4b31..3573192f813d 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -47,6 +47,11 @@ struct proc_ops {
 	long	(*proc_compat_ioctl)(struct file *, unsigned int, unsigned long);
 #endif
 	int	(*proc_mmap)(struct file *, struct vm_area_struct *);
+	int	(*proc_mmap_prepare)(struct vm_area_desc *);
+	int	(*proc_mmap_complete)(struct file *, struct vm_area_struct *,
+				      const void *context);
+	void	(*proc_mmap_abort)(const struct file *, const void *vm_private_data,
+				   const void *context);
 	unsigned long (*proc_get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
 } __randomize_layout;
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 15/16] fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (13 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 14/16] fs/proc: add proc_mmap_[prepare, complete] hooks for procfs Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 11:10 ` [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete Lorenzo Stoakes
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

Now are able to use mmap_prepare, complete callbacks for procfs
implementations, update the vmcore implementation accordingly.

As part of this change, we must also update remap_vmalloc_range_partial()
to optionally not update VMA flags. Other than then remap_vmalloc_range()
wrapper, vmcore is the only user of this function so we can simply go ahead
and add a parameter.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 arch/s390/kernel/crash_dump.c |  6 ++--
 fs/proc/vmcore.c              | 53 +++++++++++++++++++++++++----------
 include/linux/vmalloc.h       | 10 +++----
 mm/vmalloc.c                  | 16 +++++++++--
 4 files changed, 59 insertions(+), 26 deletions(-)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index d4839de8ce9d..44d7902f7e41 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -186,7 +186,7 @@ static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,
 
 	if (pfn < oldmem_data.size >> PAGE_SHIFT) {
 		size_old = min(size, oldmem_data.size - (pfn << PAGE_SHIFT));
-		rc = remap_pfn_range(vma, from,
+		rc = remap_pfn_range_complete(vma, from,
 				     pfn + (oldmem_data.start >> PAGE_SHIFT),
 				     size_old, prot);
 		if (rc || size == size_old)
@@ -195,7 +195,7 @@ static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,
 		from += size_old;
 		pfn += size_old >> PAGE_SHIFT;
 	}
-	return remap_pfn_range(vma, from, pfn, size, prot);
+	return remap_pfn_range_complete(vma, from, pfn, size, prot);
 }
 
 /*
@@ -220,7 +220,7 @@ static int remap_oldmem_pfn_range_zfcpdump(struct vm_area_struct *vma,
 		from += size_hsa;
 		pfn += size_hsa >> PAGE_SHIFT;
 	}
-	return remap_pfn_range(vma, from, pfn, size, prot);
+	return remap_pfn_range_complete(vma, from, pfn, size, prot);
 }
 
 /*
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index f188bd900eb2..5e4e19c38d5e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -254,7 +254,7 @@ int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
 				  unsigned long size, pgprot_t prot)
 {
 	prot = pgprot_encrypted(prot);
-	return remap_pfn_range(vma, from, pfn, size, prot);
+	return remap_pfn_range_complete(vma, from, pfn, size, prot);
 }
 
 /*
@@ -308,7 +308,7 @@ static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
 			tsz = min(offset + (u64)dump->size - start, (u64)size);
 			buf = dump->buf + start - offset;
 			if (remap_vmalloc_range_partial(vma, dst, buf, 0,
-							tsz))
+							tsz, /* set_vma= */false))
 				return -EFAULT;
 
 			size -= tsz;
@@ -588,24 +588,40 @@ static int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
 	return ret;
 }
 
-static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_vmcore(struct vm_area_desc *desc)
 {
-	size_t size = vma->vm_end - vma->vm_start;
-	u64 start, end, len, tsz;
-	struct vmcore_range *m;
+	size_t size = vma_desc_size(desc);
+	u64 start, end;
 
-	start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+	start = (u64)desc->pgoff << PAGE_SHIFT;
 	end = start + size;
 
 	if (size > vmcore_size || end > vmcore_size)
 		return -EINVAL;
 
-	if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+	if (desc->vm_flags & (VM_WRITE | VM_EXEC))
 		return -EPERM;
 
-	vm_flags_mod(vma, VM_MIXEDMAP, VM_MAYWRITE | VM_MAYEXEC);
-	vma->vm_ops = &vmcore_mmap_ops;
+	desc->vm_flags |= VM_MIXEDMAP | VM_REMAP_FLAGS;
+	desc->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	desc->vm_ops = &vmcore_mmap_ops;
+
+	/*
+	 * No need for remap_pfn_range_prepare() as we ensure non-CoW by
+	 * clearing VM_MAYWRITE.
+	 */
+
+	return 0;
+}
+
+static int mmap_complete_vmcore(struct file *file, struct vm_area_struct *vma,
+	const void *context)
+{
+	size_t size = vma->vm_end - vma->vm_start;
+	u64 start, len, tsz;
+	struct vmcore_range *m;
 
+	start = (u64)vma->vm_pgoff << PAGE_SHIFT;
 	len = 0;
 
 	if (start < elfcorebuf_sz) {
@@ -613,8 +629,8 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 
 		tsz = min(elfcorebuf_sz - (size_t)start, size);
 		pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
-		if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
-				    vma->vm_page_prot))
+		if (remap_pfn_range_complete(vma, vma->vm_start, pfn, tsz,
+					     vma->vm_page_prot))
 			return -EAGAIN;
 		size -= tsz;
 		start += tsz;
@@ -664,7 +680,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
 		kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
-						kaddr, 0, tsz))
+				kaddr, 0, tsz, /* set_vma =*/false))
 			goto fail;
 
 		size -= tsz;
@@ -701,7 +717,13 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 	return -EAGAIN;
 }
 #else
-static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+static int mmap_prepare_vmcore(struct vm_area_desc *desc)
+{
+	return -ENOSYS;
+}
+
+static int mmap_complete_vmcore(struct file *file, struct vm_area_struct *vma,
+		const void *context)
 {
 	return -ENOSYS;
 }
@@ -712,7 +734,8 @@ static const struct proc_ops vmcore_proc_ops = {
 	.proc_release	= release_vmcore,
 	.proc_read_iter	= read_vmcore,
 	.proc_lseek	= default_llseek,
-	.proc_mmap	= mmap_vmcore,
+	.proc_mmap_prepare = mmap_prepare_vmcore,
+	.proc_mmap_complete = mmap_complete_vmcore,
 };
 
 static u64 get_vmcore_size(size_t elfsz, size_t elfnotesegsz,
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index eb54b7b3202f..588810e571aa 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -215,12 +215,12 @@ extern void *vmap(struct page **pages, unsigned int count,
 void *vmap_pfn(unsigned long *pfns, unsigned int count, pgprot_t prot);
 extern void vunmap(const void *addr);
 
-extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
-				       unsigned long uaddr, void *kaddr,
-				       unsigned long pgoff, unsigned long size);
+int remap_vmalloc_range_partial(struct vm_area_struct *vma,
+		unsigned long uaddr, void *kaddr, unsigned long pgoff,
+		unsigned long size, bool set_vma);
 
-extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
-							unsigned long pgoff);
+int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
+		unsigned long pgoff);
 
 int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot,
 		     struct page **pages, unsigned int page_shift);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 4249e1e01947..877b557b2482 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -4528,6 +4528,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
  * @kaddr:		virtual address of vmalloc kernel memory
  * @pgoff:		offset from @kaddr to start at
  * @size:		size of map area
+ * @set_vma:		If true, update VMA flags
  *
  * Returns:	0 for success, -Exxx on failure
  *
@@ -4540,7 +4541,7 @@ long vread_iter(struct iov_iter *iter, const char *addr, size_t count)
  */
 int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
 				void *kaddr, unsigned long pgoff,
-				unsigned long size)
+				unsigned long size, bool set_vma)
 {
 	struct vm_struct *area;
 	unsigned long off;
@@ -4566,6 +4567,10 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
 		return -EINVAL;
 	kaddr += off;
 
+	/* If we shouldn't modify VMA flags, vm_insert_page() mustn't. */
+	if (!set_vma && !(vma->vm_flags & VM_MIXEDMAP))
+		return -EINVAL;
+
 	do {
 		struct page *page = vmalloc_to_page(kaddr);
 		int ret;
@@ -4579,7 +4584,11 @@ int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
 		size -= PAGE_SIZE;
 	} while (size > 0);
 
-	vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
+	if (set_vma)
+		vm_flags_set(vma, VM_DONTEXPAND | VM_DONTDUMP);
+	else
+		VM_WARN_ON_ONCE((vma->vm_flags & (VM_DONTEXPAND | VM_DONTDUMP)) !=
+				(VM_DONTEXPAND | VM_DONTDUMP));
 
 	return 0;
 }
@@ -4603,7 +4612,8 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
 {
 	return remap_vmalloc_range_partial(vma, vma->vm_start,
 					   addr, pgoff,
-					   vma->vm_end - vma->vm_start);
+					   vma->vm_end - vma->vm_start,
+					   /* set_vma= */ true);
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (14 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 15/16] fs/proc: update vmcore to use .proc_mmap_[prepare, complete] Lorenzo Stoakes
@ 2025-09-08 11:10 ` Lorenzo Stoakes
  2025-09-08 13:30   ` Jason Gunthorpe
  2025-09-08 13:27 ` [PATCH 00/16] expand mmap_prepare functionality, port more users Jan Kara
  2025-09-09  8:31 ` Alexander Gordeev
  17 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 11:10 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

Now we have the capacity to set up the VMA in f_op->mmap_prepare and then
later, once the VMA is established, insert a mixed mapping in
f_op->mmap_complete, do so for kcov.

We utilise the context desc->mmap_context field to pass context between
mmap_prepare and mmap_complete to conveniently provide the size over which
the mapping is performed.

Also note that we intentionally set VM_MIXEDMAP ahead of time so upon
mmap_complete being invoked, vm_insert_page() does not adjust VMA flags.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 kernel/kcov.c | 40 ++++++++++++++++++++++++++++------------
 1 file changed, 28 insertions(+), 12 deletions(-)

diff --git a/kernel/kcov.c b/kernel/kcov.c
index 1d85597057e1..53c8bcae54d0 100644
--- a/kernel/kcov.c
+++ b/kernel/kcov.c
@@ -484,23 +484,40 @@ void kcov_task_exit(struct task_struct *t)
 	kcov_put(kcov);
 }
 
-static int kcov_mmap(struct file *filep, struct vm_area_struct *vma)
+static int kcov_mmap_prepare(struct vm_area_desc *desc)
 {
-	int res = 0;
-	struct kcov *kcov = vma->vm_file->private_data;
-	unsigned long size, off;
-	struct page *page;
+	struct kcov *kcov = desc->file->private_data;
+	unsigned long size;
 	unsigned long flags;
+	int res = 0;
 
 	spin_lock_irqsave(&kcov->lock, flags);
 	size = kcov->size * sizeof(unsigned long);
-	if (kcov->area == NULL || vma->vm_pgoff != 0 ||
-	    vma->vm_end - vma->vm_start != size) {
+	if (kcov->area == NULL || desc->pgoff != 0 ||
+	    vma_desc_size(desc) != size) {
 		res = -EINVAL;
 		goto exit;
 	}
 	spin_unlock_irqrestore(&kcov->lock, flags);
-	vm_flags_set(vma, VM_DONTEXPAND);
+
+	desc->vm_flags |= VM_DONTEXPAND | VM_MIXEDMAP;
+	desc->mmap_context = (void *)size;
+
+	return 0;
+exit:
+	spin_unlock_irqrestore(&kcov->lock, flags);
+	return res;
+}
+
+static int kcov_mmap_complete(struct file *file, struct vm_area_struct *vma,
+			       const void *context)
+{
+	struct kcov *kcov = file->private_data;
+	unsigned long size = (unsigned long)context;
+	struct page *page;
+	unsigned long off;
+	int res;
+
 	for (off = 0; off < size; off += PAGE_SIZE) {
 		page = vmalloc_to_page(kcov->area + off);
 		res = vm_insert_page(vma, vma->vm_start + off, page);
@@ -509,10 +526,8 @@ static int kcov_mmap(struct file *filep, struct vm_area_struct *vma)
 			return res;
 		}
 	}
+
 	return 0;
-exit:
-	spin_unlock_irqrestore(&kcov->lock, flags);
-	return res;
 }
 
 static int kcov_open(struct inode *inode, struct file *filep)
@@ -761,7 +776,8 @@ static const struct file_operations kcov_fops = {
 	.open		= kcov_open,
 	.unlocked_ioctl	= kcov_ioctl,
 	.compat_ioctl	= kcov_ioctl,
-	.mmap		= kcov_mmap,
+	.mmap_prepare	= kcov_mmap_prepare,
+	.mmap_complete	= kcov_mmap_complete,
 	.release        = kcov_close,
 };
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 11:10 ` [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
@ 2025-09-08 12:51   ` Jason Gunthorpe
  2025-09-08 13:12     ` Lorenzo Stoakes
  2025-09-08 15:10   ` David Hildenbrand
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 12:51 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
>  static int secretmem_mmap_prepare(struct vm_area_desc *desc)
>  {
> -	const unsigned long len = desc->end - desc->start;
> +	const unsigned long len = vma_desc_size(desc);
>  
>  	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
>  		return -EINVAL;

I wonder if we should have some helper for this shared check too, it
is a bit tricky with the two flags. Forced-shared checks are pretty
common.

vma_desc_must_be_shared(desc) ?

Also 'must not be exec' is common too.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
@ 2025-09-08 12:55   ` Jason Gunthorpe
  2025-09-08 13:19     ` Lorenzo Stoakes
  2025-09-08 15:27   ` David Hildenbrand
  2025-09-09 16:44   ` Suren Baghdasaryan
  2 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 12:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:37PM +0100, Lorenzo Stoakes wrote:
> We have introduced the f_op->mmap_prepare hook to allow for setting up a
> VMA far earlier in the process of mapping memory, reducing problematic
> error handling paths, but this does not provide what all
> drivers/filesystems need.
> 
> In order to supply this, and to be able to move forward with removing
> f_op->mmap altogether, introduce f_op->mmap_complete.
> 
> This hook is called once the VMA is fully mapped and everything is done,
> however with the mmap write lock and VMA write locks held.
> 
> The hook is then provided with a fully initialised VMA which it can do what
> it needs with, though the mmap and VMA write locks must remain held
> throughout.
> 
> It is not intended that the VMA be modified at this point, attempts to do
> so will end in tears.

The commit message should call out if this has fixed the race
condition with unmap mapping range and prepopulation in mmap()..

> @@ -793,6 +793,11 @@ struct vm_area_desc {
>  	/* Write-only fields. */
>  	const struct vm_operations_struct *vm_ops;
>  	void *private_data;
> +	/*
> +	 * A user-defined field, value will be passed to mmap_complete,
> +	 * mmap_abort.
> +	 */
> +	void *mmap_context;

Seem strange, private_data and mmap_context? Something actually needs
both?

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 11:10 ` [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
@ 2025-09-08 13:00   ` Jason Gunthorpe
  2025-09-08 13:27     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:39PM +0100, Lorenzo Stoakes wrote:
> remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> it must be supplied with a correct PFN to do so. If the caller must hold
> locks to be able to do this, those locks should be held across the
> operation, and mmap_abort() should be provided to revoke the lock should an
> error arise.

It seems very strange to me that callers have to provide locks.

Today once mmap is called the vma priv should be allocated and access
to the PFN is allowed - access doesn't stop until the priv is
destroyed.

So whatever refcounting the driver must do to protect PFN must already
be in place and driven by the vma priv.

When split I'd expect the same thing the prepare should obtain the vma
priv and that locks the pfn. On complete the already affiliated PFN is
mapped to PTEs.

Why would any driver need a lock held to complete?

Arguably we should store the remap pfn in the desc and just make
complete a fully generic helper that fills the PTEs from the prepared
desc.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  2025-09-08 11:10 ` [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
@ 2025-09-08 13:11   ` Jason Gunthorpe
  2025-09-08 13:37     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:11 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>  		vm_flags |= VM_NORESERVE;
>  
>  	if (hugetlb_reserve_pages(inode,
> -				vma->vm_pgoff >> huge_page_order(h),
> -				len >> huge_page_shift(h), vma,
> -				vm_flags) < 0)
> +			vma->vm_pgoff >> huge_page_order(h),
> +			len >> huge_page_shift(h), vma,
> +			vm_flags) < 0) {

It was split like this because vma is passed here right?

But hugetlb_reserve_pages() doesn't do much with the vma:

	hugetlb_vma_lock_alloc(vma);
[..]
	vma->vm_private_data = vma_lock;

Manipulates the private which should already exist in prepare:

Check non-share a few times:

	if (!vma || vma->vm_flags & VM_MAYSHARE) {
	if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
	if (!vma || vma->vm_flags & VM_MAYSHARE) {

And does this resv_map stuff:

		set_vma_resv_map(vma, resv_map);
		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
[..]
	set_vma_private_data(vma, (unsigned long)map);

Which is also just manipulating the private data.

So it looks to me like it should be refactored so that
hugetlb_reserve_pages() returns the priv pointer to set in the VMA
instead of accepting vma as an argument. Maybe just pass in the desc
instead?

Then no need to introduce complete. I think it is probably better to
try to avoid using complete except for filling PTEs..

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 12:51   ` Jason Gunthorpe
@ 2025-09-08 13:12     ` Lorenzo Stoakes
  2025-09-08 13:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> >  static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> >  {
> > -	const unsigned long len = desc->end - desc->start;
> > +	const unsigned long len = vma_desc_size(desc);
> >
> >  	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> >  		return -EINVAL;
>
> I wonder if we should have some helper for this shared check too, it
> is a bit tricky with the two flags. Forced-shared checks are pretty
> common.

Sure can add.

>
> vma_desc_must_be_shared(desc) ?

Maybe _could_be_shared()?

>
> Also 'must not be exec' is common too.

Right, will have a look! :)

>
> Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-08 12:55   ` Jason Gunthorpe
@ 2025-09-08 13:19     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 09:55:26AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:37PM +0100, Lorenzo Stoakes wrote:
> > We have introduced the f_op->mmap_prepare hook to allow for setting up a
> > VMA far earlier in the process of mapping memory, reducing problematic
> > error handling paths, but this does not provide what all
> > drivers/filesystems need.
> >
> > In order to supply this, and to be able to move forward with removing
> > f_op->mmap altogether, introduce f_op->mmap_complete.
> >
> > This hook is called once the VMA is fully mapped and everything is done,
> > however with the mmap write lock and VMA write locks held.
> >
> > The hook is then provided with a fully initialised VMA which it can do what
> > it needs with, though the mmap and VMA write locks must remain held
> > throughout.
> >
> > It is not intended that the VMA be modified at this point, attempts to do
> > so will end in tears.
>
> The commit message should call out if this has fixed the race
> condition with unmap mapping range and prepopulation in mmap()..

To be claer, this isn't the intent of the series, the intent is to make it
possible for mmap_prepare to replace mmap. This is just a bonus :)

Looking at the discussion in [0] it seems the issue was that .mmap() is
called before the vma is actually correctly inserted into the maple tree.

This is no longer the case, we call .mmap_complete() once the VMA is fully
established, but before releasing the VMA/mmap write lock.

This should, presumably, resolve the race as stated?

I can add some blurb about this yes.


[0]:https://lore.kernel.org/linux-mm/20250801162930.GB184255@nvidia.com/


>
> > @@ -793,6 +793,11 @@ struct vm_area_desc {
> >  	/* Write-only fields. */
> >  	const struct vm_operations_struct *vm_ops;
> >  	void *private_data;
> > +	/*
> > +	 * A user-defined field, value will be passed to mmap_complete,
> > +	 * mmap_abort.
> > +	 */
> > +	void *mmap_context;
>
> Seem strange, private_data and mmap_context? Something actually needs
> both?

We are now doing something _new_ - we're splitting an operation that was
never split before.

Before a hook implementor could rely on there being state throughout the
_entire_ operation. But now they can't.

And they may already be putting context into private_data, which then gets
put into vma->vm_private_data for a VMA added to the maple tree and made
accessible.

So it is appropriate and convenient to allow for the transfer of state
between the two, and I already implement logic that does this.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
  2025-09-08 11:10 ` [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort Lorenzo Stoakes
@ 2025-09-08 13:24   ` Jason Gunthorpe
  2025-09-08 13:40     ` Lorenzo Stoakes
  2025-09-08 14:27     ` Lorenzo Stoakes
  0 siblings, 2 replies; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:24 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> resctl uses remap_pfn_range(), but holds a mutex over the
> operation. Therefore, establish the mutex in mmap_prepare(), release it in
> mmap_complete() and release it in mmap_abort() should the operation fail.

The mutex can't do anything relative to remap_pfn, no reason to hold it.

> @@ -1053,15 +1087,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
>  		return -ENOSPC;
>  	}
>  
> -	memset(plr->kmem + off, 0, vsize);
> +	/* No CoW allowed so don't need to specify pfn. */
> +	remap_pfn_range_prepare(desc, 0);

This would be a good place to make a more generic helper..

 ret = remap_pfn_no_cow(desc, phys);

And it can consistently check for !shared internally.

Store phys in the desc and use common code to trigger the PTE population
during complete.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 13:00   ` Jason Gunthorpe
@ 2025-09-08 13:27     ` Lorenzo Stoakes
  2025-09-08 13:35       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:00:15AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:39PM +0100, Lorenzo Stoakes wrote:
> > remap_pfn_range_prepare() will set the cow vma->vm_pgoff if necessary, so
> > it must be supplied with a correct PFN to do so. If the caller must hold
> > locks to be able to do this, those locks should be held across the
> > operation, and mmap_abort() should be provided to revoke the lock should an
> > error arise.
>
> It seems very strange to me that callers have to provide locks.
>
> Today once mmap is called the vma priv should be allocated and access
> to the PFN is allowed - access doesn't stop until the priv is
> destroyed.
>
> So whatever refcounting the driver must do to protect PFN must already
> be in place and driven by the vma priv.
>
> When split I'd expect the same thing the prepare should obtain the vma
> priv and that locks the pfn. On complete the already affiliated PFN is
> mapped to PTEs.
>
> Why would any driver need a lock held to complete?


In general, again we're splitting an operation that didn't used to be split.

A hook implementor may need to hold the lock in order to stabilise whatever
is required to be stabilisesd across the two (of course, with careful
consideration of the fact we're doing stuff between the two!)

It's not only remap that is a concern here, people do all kinds of weird
and wonderful things in .mmap(), sometimes in combination with remap.

This is what makes this so fun to try to change ;)

An implementor may also update state somehow which would need to be altered
should the operation fail, again something that would not have needed to be
considered previously, as it was all done in one.

>
> Arguably we should store the remap pfn in the desc and just make
> complete a fully generic helper that fills the PTEs from the prepared
> desc.

That's an interesting take actually.

Though I don't thik we can _always_ do that, as drivers again do weird and
wonderful things and we need to have maximum flexibility here.

But we could have a generic function that could speed some things up here,
and have that assume desc->mmap_context contains the PFN.

You can see patch 12/16 for an example of mmap_abort in action.

I also wonder if we should add remap_pfn_range_prepare_nocow() - which can
assert !is_cow_mapping(desc->vm_flags) - and then that self-documents the
cases where we don't actually need the PFN on prepare (this is only for the
hideous vm_pgoff hack for arches without special page table flag).

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete
  2025-09-08 11:10 ` [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
@ 2025-09-08 13:27   ` Jason Gunthorpe
  2025-09-08 13:44     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:27 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:44PM +0100, Lorenzo Stoakes wrote:
> We thread the state through the mmap_context, allowing for both PFN map and
> mixed mapped pre-population.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  fs/cramfs/inode.c | 134 +++++++++++++++++++++++++++++++---------------
>  1 file changed, 92 insertions(+), 42 deletions(-)
> 
> diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> index b002e9b734f9..11a11213304d 100644
> --- a/fs/cramfs/inode.c
> +++ b/fs/cramfs/inode.c
> @@ -59,6 +59,12 @@ static const struct address_space_operations cramfs_aops;
>  
>  static DEFINE_MUTEX(read_mutex);
>  
> +/* How should the mapping be completed? */
> +enum cramfs_mmap_state {
> +	NO_PREPOPULATE,
> +	PREPOPULATE_PFNMAP,
> +	PREPOPULATE_MIXEDMAP,
> +};
>  
>  /* These macros may change in future, to provide better st_ino semantics. */
>  #define OFFSET(x)	((x)->i_ino)
> @@ -342,34 +348,89 @@ static bool cramfs_last_page_is_shared(struct inode *inode)
>  	return memchr_inv(tail_data, 0, PAGE_SIZE - partial) ? true : false;
>  }
>  
> -static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
> +static int cramfs_physmem_mmap_complete(struct file *file, struct vm_area_struct *vma,
> +					const void *context)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
> -	unsigned int pages, max_pages, offset;
>  	unsigned long address, pgoff = vma->vm_pgoff;
> -	char *bailout_reason;
> -	int ret;
> +	unsigned int pages, offset;
> +	enum cramfs_mmap_state mmap_state = (enum cramfs_mmap_state)context;
> +	int ret = 0;
>  
> -	ret = generic_file_readonly_mmap(file, vma);
> -	if (ret)
> -		return ret;
> +	if (mmap_state == NO_PREPOPULATE)
> +		return 0;

It would be nicer to have different ops than this, the normal op could
just call the generic helper and then there is only the mixed map op.

Makes me wonder if putting the op in the fops was right, a
mixed/non-mixed vm_ops would do this nicely.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (15 preceding siblings ...)
  2025-09-08 11:10 ` [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete Lorenzo Stoakes
@ 2025-09-08 13:27 ` Jan Kara
  2025-09-08 14:48   ` Lorenzo Stoakes
  2025-09-09  8:31 ` Alexander Gordeev
  17 siblings, 1 reply; 84+ messages in thread
From: Jan Kara @ 2025-09-08 13:27 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

Hi Lorenzo!

On Mon 08-09-25 12:10:31, Lorenzo Stoakes wrote:
> Since commit c84bf6dd2b83 ("mm: introduce new .mmap_prepare() file
> callback"), The f_op->mmap hook has been deprecated in favour of
> f_op->mmap_prepare.
> 
> This was introduced in order to make it possible for us to eventually
> eliminate the f_op->mmap hook which is highly problematic as it allows
> drivers and filesystems raw access to a VMA which is not yet correctly
> initialised.
> 
> This hook also introduces complexity for the memory mapping operation, as
> we must correctly unwind what we do should an error arises.
> 
> Overall this interface being so open has caused significant problems for
> us, including security issues, it is important for us to simply eliminate
> this as a source of problems.
> 
> Therefore this series continues what was established by extending the
> functionality further to permit more drivers and filesystems to use
> mmap_prepare.
> 
> After updating some areas that can simply use mmap_prepare as-is, and
> performing some housekeeping, we then introduce two new hooks:
> 
> f_op->mmap_complete - this is invoked at the point of the VMA having been
> correctly inserted, though with the VMA write lock still held. mmap_prepare
> must also be specified.
> 
> This expands the use of mmap_prepare to those callers which need to
> prepopulate mappings, as well as any which does genuinely require access to
> the VMA.
> 
> It's simple - we will let the caller access the VMA, but only once it's
> established. At this point unwinding issues is simple - we just unmap the
> VMA.
> 
> The VMA is also then correctly initialised at this stage so there can be no
> issues arising from a not-fully initialised VMA at this point.
> 
> The other newly added hook is:
> 
> f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
> mmap_complete. This is called should an error arise between mmap_prepare
> and mmap_complete (not as a result of mmap_prepare but rather some other
> part of the mapping logic).
> 
> This is required in case mmap_prepare wishes to establish state or locks
> which need to be cleaned up on completion. If we did not provide this, then
> this could not be permitted as this cleanup would otherwise not occur
> should the mapping fail between the two calls.

So seeing these new hooks makes me wonder: Shouldn't rather implement
mmap(2) in a way more similar to how other f_op hooks behave like ->read or
->write? I.e., a hook called at rather high level - something like from
vm_mmap_pgoff() or similar similar level - which would just call library
functions from MM for the stuff it needs to do. Filesystems would just do
their checks and call the generic mmap function with the vm_ops they want
to use, more complex users could then fill in the VMA before releasing
mmap_lock or do cleanup in case of failure... This would seem like a more
understandable API than several hooks with rules when what gets called.

								Honza

> 
> We then add split remap_pfn_range*() functions which allow for PFN remap (a
> typical mapping prepopulation operation) split between a prepare/complete
> step, as well as io_mremap_pfn_range_prepare, complete for a similar
> purpose.
> 
> From there we update various mm-adjacent logic to use this functionality as
> a first set of changes, as well as resctl and cramfs filesystems to round
> off the non-stacked filesystem instances.
> 
> 
> REVIEWER NOTE:
> ~~~~~~~~~~~~~~
> 
> I considered putting the complete, abort callbacks in vm_ops, however this
> won't work because then we would be unable to adjust helpers like
> generic_file_mmap_prepare() (which provides vm_ops) to provide the correct
> complete, abort callbacks.
> 
> Conceptually it also makes more sense to have these in f_op as they are
> one-off operations performed at mmap time to establish the VMA, rather than
> a property of the VMA itself.
> 
> Lorenzo Stoakes (16):
>   mm/shmem: update shmem to use mmap_prepare
>   device/dax: update devdax to use mmap_prepare
>   mm: add vma_desc_size(), vma_desc_pages() helpers
>   relay: update relay to use mmap_prepare
>   mm/vma: rename mmap internal functions to avoid confusion
>   mm: introduce the f_op->mmap_complete, mmap_abort hooks
>   doc: update porting, vfs documentation for mmap_[complete, abort]
>   mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
>   mm: introduce io_remap_pfn_range_prepare, complete
>   mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
>   mm: update mem char driver to use mmap_prepare, mmap_complete
>   mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
>   mm: update cramfs to use mmap_prepare, mmap_complete
>   fs/proc: add proc_mmap_[prepare, complete] hooks for procfs
>   fs/proc: update vmcore to use .proc_mmap_[prepare, complete]
>   kcov: update kcov to use mmap_prepare, mmap_complete
> 
>  Documentation/filesystems/porting.rst |   9 ++
>  Documentation/filesystems/vfs.rst     |  35 +++++++
>  arch/csky/include/asm/pgtable.h       |   5 +
>  arch/mips/alchemy/common/setup.c      |  28 +++++-
>  arch/mips/include/asm/pgtable.h       |  10 ++
>  arch/s390/kernel/crash_dump.c         |   6 +-
>  arch/sparc/include/asm/pgtable_32.h   |  29 +++++-
>  arch/sparc/include/asm/pgtable_64.h   |  29 +++++-
>  drivers/char/mem.c                    |  80 ++++++++-------
>  drivers/dax/device.c                  |  32 +++---
>  fs/cramfs/inode.c                     | 134 ++++++++++++++++++--------
>  fs/hugetlbfs/inode.c                  |  86 +++++++++--------
>  fs/ntfs3/file.c                       |   2 +-
>  fs/proc/inode.c                       |  13 ++-
>  fs/proc/vmcore.c                      |  53 +++++++---
>  fs/resctrl/pseudo_lock.c              |  56 ++++++++---
>  include/linux/fs.h                    |   4 +
>  include/linux/mm.h                    |  53 +++++++++-
>  include/linux/mm_types.h              |   5 +
>  include/linux/proc_fs.h               |   5 +
>  include/linux/shmem_fs.h              |   3 +-
>  include/linux/vmalloc.h               |  10 +-
>  kernel/kcov.c                         |  40 +++++---
>  kernel/relay.c                        |  32 +++---
>  mm/memory.c                           | 128 +++++++++++++++---------
>  mm/secretmem.c                        |   2 +-
>  mm/shmem.c                            |  49 +++++++---
>  mm/util.c                             |  18 +++-
>  mm/vma.c                              |  96 +++++++++++++++---
>  mm/vmalloc.c                          |  16 ++-
>  tools/testing/vma/vma_internal.h      |  31 +++++-
>  31 files changed, 810 insertions(+), 289 deletions(-)
> 
> --
> 2.51.0
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete
  2025-09-08 11:10 ` [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete Lorenzo Stoakes
@ 2025-09-08 13:30   ` Jason Gunthorpe
  2025-09-08 13:47     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:10:47PM +0100, Lorenzo Stoakes wrote:
> Now we have the capacity to set up the VMA in f_op->mmap_prepare and then
> later, once the VMA is established, insert a mixed mapping in
> f_op->mmap_complete, do so for kcov.
> 
> We utilise the context desc->mmap_context field to pass context between
> mmap_prepare and mmap_complete to conveniently provide the size over which
> the mapping is performed.

Why?

+	    vma_desc_size(desc) != size) {
+  		res = -EINVAL;

Just call some vma_size()?

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 13:12     ` Lorenzo Stoakes
@ 2025-09-08 13:32       ` Jason Gunthorpe
  2025-09-08 14:09         ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:32 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 02:12:00PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> > >  static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> > >  {
> > > -	const unsigned long len = desc->end - desc->start;
> > > +	const unsigned long len = vma_desc_size(desc);
> > >
> > >  	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > >  		return -EINVAL;
> >
> > I wonder if we should have some helper for this shared check too, it
> > is a bit tricky with the two flags. Forced-shared checks are pretty
> > common.
> 
> Sure can add.
> 
> >
> > vma_desc_must_be_shared(desc) ?
> 
> Maybe _could_be_shared()?

It is not could, it is must. 

Perhaps

!vma_desc_cowable()

Is what many drivers are really trying to assert.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 13:27     ` Lorenzo Stoakes
@ 2025-09-08 13:35       ` Jason Gunthorpe
  2025-09-08 14:18         ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:35 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:

> It's not only remap that is a concern here, people do all kinds of weird
> and wonderful things in .mmap(), sometimes in combination with remap.

So it should really not be split this way, complete is a badly name
prepopulate and it should only fill the PTEs, which shouldn't need
more locking.

The only example in this series didn't actually need to hold the lock.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  2025-09-08 13:11   ` Jason Gunthorpe
@ 2025-09-08 13:37     ` Lorenzo Stoakes
  2025-09-08 13:52       ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:11:21AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> > @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> >  		vm_flags |= VM_NORESERVE;
> >
> >  	if (hugetlb_reserve_pages(inode,
> > -				vma->vm_pgoff >> huge_page_order(h),
> > -				len >> huge_page_shift(h), vma,
> > -				vm_flags) < 0)
> > +			vma->vm_pgoff >> huge_page_order(h),
> > +			len >> huge_page_shift(h), vma,
> > +			vm_flags) < 0) {
>
> It was split like this because vma is passed here right?
>
> But hugetlb_reserve_pages() doesn't do much with the vma:
>
> 	hugetlb_vma_lock_alloc(vma);
> [..]
> 	vma->vm_private_data = vma_lock;
>
> Manipulates the private which should already exist in prepare:
>
> Check non-share a few times:
>
> 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> 	if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
> 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
>
> And does this resv_map stuff:
>
> 		set_vma_resv_map(vma, resv_map);
> 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> [..]
> 	set_vma_private_data(vma, (unsigned long)map);
>
> Which is also just manipulating the private data.
>
> So it looks to me like it should be refactored so that
> hugetlb_reserve_pages() returns the priv pointer to set in the VMA
> instead of accepting vma as an argument. Maybe just pass in the desc
> instead?

Well hugetlb_vma_lock_alloc() does:

	vma_lock->vma = vma;

Which we cannot do in prepare.

This is checked in hugetlb_dup_vma_private(), and obviously desc is not a stable
pointer to be used for comparing anything.

I'm also trying to do the minimal changes I can here, I'd rather not majorly
refactor things to suit this change if possible.

>
> Then no need to introduce complete. I think it is probably better to
> try to avoid using complete except for filling PTEs..

I'd rather do that yes. hugetlbfs is the exception to many rules, unfortunately.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
  2025-09-08 13:24   ` Jason Gunthorpe
@ 2025-09-08 13:40     ` Lorenzo Stoakes
  2025-09-08 14:27     ` Lorenzo Stoakes
  1 sibling, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:24:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> > resctl uses remap_pfn_range(), but holds a mutex over the
> > operation. Therefore, establish the mutex in mmap_prepare(), release it in
> > mmap_complete() and release it in mmap_abort() should the operation fail.
>
> The mutex can't do anything relative to remap_pfn, no reason to hold it.
>
> > @@ -1053,15 +1087,11 @@ static int pseudo_lock_dev_mmap(struct file *filp, struct vm_area_struct *vma)
> >  		return -ENOSPC;
> >  	}
> >
> > -	memset(plr->kmem + off, 0, vsize);
> > +	/* No CoW allowed so don't need to specify pfn. */
> > +	remap_pfn_range_prepare(desc, 0);
>
> This would be a good place to make a more generic helper..
>
>  ret = remap_pfn_no_cow(desc, phys);

Ha, funny I suggested a _no_cow() thing earlier :) seems we are agreed on that
then!

Presumably you mean remap_pfn_no_cow_prepare()?

>
> And it can consistently check for !shared internally.
>
> Store phys in the desc and use common code to trigger the PTE population
> during complete.

We can use mmap_context for this, I guess it's not a terrible idea to set .pfn
but I just don't want to add any confusion as to what doing that means in
the non-generic mmap_complete case.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete
  2025-09-08 13:27   ` Jason Gunthorpe
@ 2025-09-08 13:44     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:27:23AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:44PM +0100, Lorenzo Stoakes wrote:
> > We thread the state through the mmap_context, allowing for both PFN map and
> > mixed mapped pre-population.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  fs/cramfs/inode.c | 134 +++++++++++++++++++++++++++++++---------------
> >  1 file changed, 92 insertions(+), 42 deletions(-)
> >
> > diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
> > index b002e9b734f9..11a11213304d 100644
> > --- a/fs/cramfs/inode.c
> > +++ b/fs/cramfs/inode.c
> > @@ -59,6 +59,12 @@ static const struct address_space_operations cramfs_aops;
> >
> >  static DEFINE_MUTEX(read_mutex);
> >
> > +/* How should the mapping be completed? */
> > +enum cramfs_mmap_state {
> > +	NO_PREPOPULATE,
> > +	PREPOPULATE_PFNMAP,
> > +	PREPOPULATE_MIXEDMAP,
> > +};
> >
> >  /* These macros may change in future, to provide better st_ino semantics. */
> >  #define OFFSET(x)	((x)->i_ino)
> > @@ -342,34 +348,89 @@ static bool cramfs_last_page_is_shared(struct inode *inode)
> >  	return memchr_inv(tail_data, 0, PAGE_SIZE - partial) ? true : false;
> >  }
> >
> > -static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
> > +static int cramfs_physmem_mmap_complete(struct file *file, struct vm_area_struct *vma,
> > +					const void *context)
> >  {
> >  	struct inode *inode = file_inode(file);
> >  	struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
> > -	unsigned int pages, max_pages, offset;
> >  	unsigned long address, pgoff = vma->vm_pgoff;
> > -	char *bailout_reason;
> > -	int ret;
> > +	unsigned int pages, offset;
> > +	enum cramfs_mmap_state mmap_state = (enum cramfs_mmap_state)context;
> > +	int ret = 0;
> >
> > -	ret = generic_file_readonly_mmap(file, vma);
> > -	if (ret)
> > -		return ret;
> > +	if (mmap_state == NO_PREPOPULATE)
> > +		return 0;
>
> It would be nicer to have different ops than this, the normal op could
> just call the generic helper and then there is only the mixed map op.

Right, but I can't stop to refactor everything I change, or this effort will
take even longer.

I do have to compromise a _little_ on that as there's ~250 odd callsites to
go...

>
> Makes me wonder if putting the op in the fops was right, a
> mixed/non-mixed vm_ops would do this nicely.

I added a reviewers note just for you in 00/16 :) I guess you missed it:

	REVIEWER NOTE:
	~~~~~~~~~~~~~~

	I considered putting the complete, abort callbacks in vm_ops,
	however this won't work because then we would be unable to adjust
	helpers like ngeneric_file_mmap_prepare() (which provides vm_ops)
	to provide the correct complete, abort callbacks.

	Conceptually it also makes more sense to have these in f_op as they
	are one-off operations performed at mmap time to establish the VMA,
	rather than a property of the VMA itself.

Basically, existing generic code sets vm_ops to something already, now we'd
need to somehow also vary it on this as well or nest vm_ops? I don't think
it's workable.

I found this out because I started working on this series with the complete
callback as part of vm_ops then hit this stumbling block as a result.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete
  2025-09-08 13:30   ` Jason Gunthorpe
@ 2025-09-08 13:47     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 13:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:30:13AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:47PM +0100, Lorenzo Stoakes wrote:
> > Now we have the capacity to set up the VMA in f_op->mmap_prepare and then
> > later, once the VMA is established, insert a mixed mapping in
> > f_op->mmap_complete, do so for kcov.
> >
> > We utilise the context desc->mmap_context field to pass context between
> > mmap_prepare and mmap_complete to conveniently provide the size over which
> > the mapping is performed.
>
> Why?
>
> +	    vma_desc_size(desc) != size) {
> +  		res = -EINVAL;
>
> Just call some vma_size()?

Ah yeah we can do you're right, as we assert vma_desc_size() == size, will fix
that thanks!

There is no vma_size() though, which is weird to me. There is vma_pages() <<
PAGE_SHIFT though...

Maybe one to add!

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  2025-09-08 13:37     ` Lorenzo Stoakes
@ 2025-09-08 13:52       ` Jason Gunthorpe
  2025-09-08 14:19         ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 13:52 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 02:37:44PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 10:11:21AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> > > @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> > >  		vm_flags |= VM_NORESERVE;
> > >
> > >  	if (hugetlb_reserve_pages(inode,
> > > -				vma->vm_pgoff >> huge_page_order(h),
> > > -				len >> huge_page_shift(h), vma,
> > > -				vm_flags) < 0)
> > > +			vma->vm_pgoff >> huge_page_order(h),
> > > +			len >> huge_page_shift(h), vma,
> > > +			vm_flags) < 0) {
> >
> > It was split like this because vma is passed here right?
> >
> > But hugetlb_reserve_pages() doesn't do much with the vma:
> >
> > 	hugetlb_vma_lock_alloc(vma);
> > [..]
> > 	vma->vm_private_data = vma_lock;
> >
> > Manipulates the private which should already exist in prepare:
> >
> > Check non-share a few times:
> >
> > 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> > 	if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
> > 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> >
> > And does this resv_map stuff:
> >
> > 		set_vma_resv_map(vma, resv_map);
> > 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> > [..]
> > 	set_vma_private_data(vma, (unsigned long)map);
> >
> > Which is also just manipulating the private data.
> >
> > So it looks to me like it should be refactored so that
> > hugetlb_reserve_pages() returns the priv pointer to set in the VMA
> > instead of accepting vma as an argument. Maybe just pass in the desc
> > instead?
> 
> Well hugetlb_vma_lock_alloc() does:
> 
> 	vma_lock->vma = vma;
> 
> Which we cannot do in prepare.

Okay, just doing that in commit would be appropriate then
 
> This is checked in hugetlb_dup_vma_private(), and obviously desc is not a stable
> pointer to be used for comparing anything.
> 
> I'm also trying to do the minimal changes I can here, I'd rather not majorly
> refactor things to suit this change if possible.

It doesn't look like a bit refactor, pass vma desc into
hugetlb_reserve_pages(), lift the vma_lock set out

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 13:32       ` Jason Gunthorpe
@ 2025-09-08 14:09         ` Lorenzo Stoakes
  2025-09-08 14:20           ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:32:24AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 02:12:00PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 09:51:01AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 08, 2025 at 12:10:34PM +0100, Lorenzo Stoakes wrote:
> > > >  static int secretmem_mmap_prepare(struct vm_area_desc *desc)
> > > >  {
> > > > -	const unsigned long len = desc->end - desc->start;
> > > > +	const unsigned long len = vma_desc_size(desc);
> > > >
> > > >  	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
> > > >  		return -EINVAL;
> > >
> > > I wonder if we should have some helper for this shared check too, it
> > > is a bit tricky with the two flags. Forced-shared checks are pretty
> > > common.
> >
> > Sure can add.
> >
> > >
> > > vma_desc_must_be_shared(desc) ?
> >
> > Maybe _could_be_shared()?
>
> It is not could, it is must.

I mean VM_MAYSHARE is a nonsense anyway, but _in theory_ VM_MAYSHARE &&
!VM_SHARE means we _could_ share it.

But in reality of course this isn't a real thing.

Perhaps vma_desc_is_shared() or something, I obviously don't want to get stuck
on semantics here :) [he says, while getting obviously stuck on semantics] :P

>
> Perhaps
>
> !vma_desc_cowable()
>
> Is what many drivers are really trying to assert.

Well no, because:

static inline bool is_cow_mapping(vm_flags_t flags)
{
	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}

Read-only means !CoW.

Hey we've made a rod for own backs! Again!

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 13:35       ` Jason Gunthorpe
@ 2025-09-08 14:18         ` Lorenzo Stoakes
  2025-09-08 16:03           ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 14:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
>
> > It's not only remap that is a concern here, people do all kinds of weird
> > and wonderful things in .mmap(), sometimes in combination with remap.
>
> So it should really not be split this way, complete is a badly name

I don't understand, you think we can avoid splitting this in two? If so, I
disagree.

We have two stages, _intentionally_ placed to avoid the issues the mmap_prepare
series in the first instance worked to avoid:

1. 'Hey, how do we configure this VMA we have _not yet set up_'
2. 'OK it's set up, now do you want to do something else?

I'm sorry but I'm not sure how we could otherwise do this.

Keep in mind re: point 1, we _need_ the VMA to be established enough to check
for merge etc.

Another key aim of this change was to eliminate the need for a merge re-check.

> prepopulate and it should only fill the PTEs, which shouldn't need
> more locking.
>
> The only example in this series didn't actually need to hold the lock.

There's ~250 more mmap callbacks to work through. Do you provide a guarantee
that:

- All 250 absolutely only need access to the VMAs to perform prepopulation of
  this nature.

- That absolutely none will set up state in the prepopulate step that might need
  to be unwound should an error arise?

Keeping in mind I must remain practical re: refactoring each caller.

I mean, let me go check what you say re: the resctl lock, if you're right I
could drop mmap_abort for now and add it later if needed.

But re: calling mmap_complete prepopulate, I don't really think that's sensible.

mmap_prepare is invoked at the point of the preparation of the mapping, and
mmap_complete is invoked once that preoparation is complete to allow further
actions.

I'm obviously open to naming suggestions, but I think it's safer to consistently
refer to where we are in the lifecycle rather than presuming what the caller
might do.

(I'd _prefer_ they always did just prepopulate, but I just don't think we
necessarily can).

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete
  2025-09-08 13:52       ` Jason Gunthorpe
@ 2025-09-08 14:19         ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:52:40AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 02:37:44PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 10:11:21AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 08, 2025 at 12:10:41PM +0100, Lorenzo Stoakes wrote:
> > > > @@ -151,20 +123,55 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  		vm_flags |= VM_NORESERVE;
> > > >
> > > >  	if (hugetlb_reserve_pages(inode,
> > > > -				vma->vm_pgoff >> huge_page_order(h),
> > > > -				len >> huge_page_shift(h), vma,
> > > > -				vm_flags) < 0)
> > > > +			vma->vm_pgoff >> huge_page_order(h),
> > > > +			len >> huge_page_shift(h), vma,
> > > > +			vm_flags) < 0) {
> > >
> > > It was split like this because vma is passed here right?
> > >
> > > But hugetlb_reserve_pages() doesn't do much with the vma:
> > >
> > > 	hugetlb_vma_lock_alloc(vma);
> > > [..]
> > > 	vma->vm_private_data = vma_lock;
> > >
> > > Manipulates the private which should already exist in prepare:
> > >
> > > Check non-share a few times:
> > >
> > > 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> > > 	if (vma && !(vma->vm_flags & VM_MAYSHARE) && h_cg) {
> > > 	if (!vma || vma->vm_flags & VM_MAYSHARE) {
> > >
> > > And does this resv_map stuff:
> > >
> > > 		set_vma_resv_map(vma, resv_map);
> > > 		set_vma_resv_flags(vma, HPAGE_RESV_OWNER);
> > > [..]
> > > 	set_vma_private_data(vma, (unsigned long)map);
> > >
> > > Which is also just manipulating the private data.
> > >
> > > So it looks to me like it should be refactored so that
> > > hugetlb_reserve_pages() returns the priv pointer to set in the VMA
> > > instead of accepting vma as an argument. Maybe just pass in the desc
> > > instead?
> >
> > Well hugetlb_vma_lock_alloc() does:
> >
> > 	vma_lock->vma = vma;
> >
> > Which we cannot do in prepare.
>
> Okay, just doing that in commit would be appropriate then
>
> > This is checked in hugetlb_dup_vma_private(), and obviously desc is not a stable
> > pointer to be used for comparing anything.
> >
> > I'm also trying to do the minimal changes I can here, I'd rather not majorly
> > refactor things to suit this change if possible.
>
> It doesn't look like a bit refactor, pass vma desc into
> hugetlb_reserve_pages(), lift the vma_lock set out

OK, I'll take a look at refactoring this.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 14:09         ` Lorenzo Stoakes
@ 2025-09-08 14:20           ` Jason Gunthorpe
  2025-09-08 14:47             ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 14:20 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > Perhaps
> >
> > !vma_desc_cowable()
> >
> > Is what many drivers are really trying to assert.
> 
> Well no, because:
> 
> static inline bool is_cow_mapping(vm_flags_t flags)
> {
> 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> }
> 
> Read-only means !CoW.

What drivers want when they check SHARED is to prevent COW. It is COW
that causes problems for whatever the driver is doing, so calling the
helper cowable and making the test actually right for is a good thing.

COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
something that is COW in future.

Drivers have commonly various things with VM_SHARED to establish !COW,
but if that isn't actually right then lets fix it to be clear and
correct.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort
  2025-09-08 13:24   ` Jason Gunthorpe
  2025-09-08 13:40     ` Lorenzo Stoakes
@ 2025-09-08 14:27     ` Lorenzo Stoakes
  1 sibling, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 14:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 10:24:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 12:10:43PM +0100, Lorenzo Stoakes wrote:
> > resctl uses remap_pfn_range(), but holds a mutex over the
> > operation. Therefore, establish the mutex in mmap_prepare(), release it in
> > mmap_complete() and release it in mmap_abort() should the operation fail.
>
> The mutex can't do anything relative to remap_pfn, no reason to hold it.

Sorry I missed this bit before...

Yeah I guess my concern was that the original code very intentionally holds the
mutex _over the remap operation_.

But I guess given we release the lock on failure this isn't necessary, and of
course obviously the lock has no bearing ont he actual remap.

Will drop it and drop mmap_abort for now as it's not yet needed.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 14:20           ` Jason Gunthorpe
@ 2025-09-08 14:47             ` Lorenzo Stoakes
  2025-09-08 15:07               ` David Hildenbrand
  2025-09-08 15:16               ` Jason Gunthorpe
  0 siblings, 2 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 14:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > Perhaps
> > >
> > > !vma_desc_cowable()
> > >
> > > Is what many drivers are really trying to assert.
> >
> > Well no, because:
> >
> > static inline bool is_cow_mapping(vm_flags_t flags)
> > {
> > 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > }
> >
> > Read-only means !CoW.
>
> What drivers want when they check SHARED is to prevent COW. It is COW
> that causes problems for whatever the driver is doing, so calling the
> helper cowable and making the test actually right for is a good thing.
>
> COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
> something that is COW in future.

But you can't do that if !VM_MAYWRITE.

I mean probably the driver's just wrong and should use is_cow_mapping() tbh.

>
> Drivers have commonly various things with VM_SHARED to establish !COW,
> but if that isn't actually right then lets fix it to be clear and
> correct.

I think we need to be cautious of scope here :) I don't want to accidentally
break things this way.

OK I think a sensible way forward - How about I add desc_is_cowable() or
vma_desc_cowable() and only set this if I'm confident it's correct?

That way I can achieve both aims at once.

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
  2025-09-08 13:27 ` [PATCH 00/16] expand mmap_prepare functionality, port more users Jan Kara
@ 2025-09-08 14:48   ` Lorenzo Stoakes
  2025-09-08 15:04     ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 14:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On Mon, Sep 08, 2025 at 03:27:52PM +0200, Jan Kara wrote:
> Hi Lorenzo!

Hey! :)

> > After updating some areas that can simply use mmap_prepare as-is, and
> > performing some housekeeping, we then introduce two new hooks:
> >
> > f_op->mmap_complete - this is invoked at the point of the VMA having been
> > correctly inserted, though with the VMA write lock still held. mmap_prepare
> > must also be specified.
> >
> > This expands the use of mmap_prepare to those callers which need to
> > prepopulate mappings, as well as any which does genuinely require access to
> > the VMA.
> >
> > It's simple - we will let the caller access the VMA, but only once it's
> > established. At this point unwinding issues is simple - we just unmap the
> > VMA.
> >
> > The VMA is also then correctly initialised at this stage so there can be no
> > issues arising from a not-fully initialised VMA at this point.
> >
> > The other newly added hook is:
> >
> > f_op->mmap_abort - this is only valid in conjunction with mmap_prepare and
> > mmap_complete. This is called should an error arise between mmap_prepare
> > and mmap_complete (not as a result of mmap_prepare but rather some other
> > part of the mapping logic).
> >
> > This is required in case mmap_prepare wishes to establish state or locks
> > which need to be cleaned up on completion. If we did not provide this, then
> > this could not be permitted as this cleanup would otherwise not occur
> > should the mapping fail between the two calls.
>
> So seeing these new hooks makes me wonder: Shouldn't rather implement
> mmap(2) in a way more similar to how other f_op hooks behave like ->read or
> ->write? I.e., a hook called at rather high level - something like from
> vm_mmap_pgoff() or similar similar level - which would just call library
> functions from MM for the stuff it needs to do. Filesystems would just do
> their checks and call the generic mmap function with the vm_ops they want
> to use, more complex users could then fill in the VMA before releasing
> mmap_lock or do cleanup in case of failure... This would seem like a more
> understandable API than several hooks with rules when what gets called.

We can't just do everything at this level, because we need:

a. Information to actually know how to map the VMA before putting it in the
   maple tree.
b. Once it's there, anything else we need to do (typically - prepopulate).

The crux of this change is to avoid horrors around the VMA being passed
around not yet being properly initialised, and yet being accessible for
drivers to do 'whatever' with.

Ideally we'd have only one case, and for _nearly all_ filesystems this is
how it is actually.

But sadly some _do need_ to do extra work afterwards, most notably,
prepopulation.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare
  2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
@ 2025-09-08 14:59   ` David Hildenbrand
  2025-09-08 15:28     ` Lorenzo Stoakes
  2025-09-09  3:19   ` Baolin Wang
  1 sibling, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 14:59 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal,
	Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On 08.09.25 13:10, Lorenzo Stoakes wrote:
> This simply assigns the vm_ops so is easily updated - do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 02/16] device/dax: update devdax to use mmap_prepare
  2025-09-08 11:10 ` [PATCH 02/16] device/dax: update devdax " Lorenzo Stoakes
@ 2025-09-08 15:03   ` David Hildenbrand
  2025-09-08 15:28     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:03 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal,
	Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On 08.09.25 13:10, Lorenzo Stoakes wrote:
> The devdax driver does nothing special in its f_op->mmap hook, so
> straightforwardly update it to use the mmap_prepare hook instead.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   drivers/dax/device.c | 32 +++++++++++++++++++++-----------
>   1 file changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 2bb40a6060af..c2181439f925 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -13,8 +13,9 @@
>   #include "dax-private.h"
>   #include "bus.h"
>   
> -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> -		const char *func)
> +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
> +		       unsigned long start, unsigned long end, struct file *file,
> +		       const char *func)

In general

Acked-by: David Hildenbrand <david@redhat.com>

The only thing that bugs me is __check_vma() that does not check a vma.

Maybe something along the lines of

"check_vma_properties"

Not sure.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
  2025-09-08 14:48   ` Lorenzo Stoakes
@ 2025-09-08 15:04     ` Jason Gunthorpe
  2025-09-08 15:15       ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 15:04 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jan Kara, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote:
> But sadly some _do need_ to do extra work afterwards, most notably,
> prepopulation.

I think Jan is suggesting something more like

mmap_op()
{
   struct vma_desc desc = {};

   desc.[..] = x
   desc.[..] = y
   desc.[..] = z
   vma = vma_alloc(desc);

   ret = remap_pfn(vma)
   if (ret) goto err_vma;

   return vma_commit(vma);

err_va:
  vma_dealloc(vma);
  return ERR_PTR(ret);
}

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 14:47             ` Lorenzo Stoakes
@ 2025-09-08 15:07               ` David Hildenbrand
  2025-09-08 15:35                 ` Lorenzo Stoakes
  2025-09-08 15:16               ` Jason Gunthorpe
  1 sibling, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:07 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On 08.09.25 16:47, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
>> On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
>>>> Perhaps
>>>>
>>>> !vma_desc_cowable()
>>>>
>>>> Is what many drivers are really trying to assert.
>>>
>>> Well no, because:
>>>
>>> static inline bool is_cow_mapping(vm_flags_t flags)
>>> {
>>> 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>>> }
>>>
>>> Read-only means !CoW.
>>
>> What drivers want when they check SHARED is to prevent COW. It is COW
>> that causes problems for whatever the driver is doing, so calling the
>> helper cowable and making the test actually right for is a good thing.
>>
>> COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
>> something that is COW in future.
> 
> But you can't do that if !VM_MAYWRITE.
> 
> I mean probably the driver's just wrong and should use is_cow_mapping() tbh.
> 
>>
>> Drivers have commonly various things with VM_SHARED to establish !COW,
>> but if that isn't actually right then lets fix it to be clear and
>> correct.
> 
> I think we need to be cautious of scope here :) I don't want to accidentally
> break things this way.
> 
> OK I think a sensible way forward - How about I add desc_is_cowable() or
> vma_desc_cowable() and only set this if I'm confident it's correct?

I'll note that the naming is bad.

Why?

Because the vma_desc is not cowable. The underlying mapping maybe is.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 11:10 ` [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
  2025-09-08 12:51   ` Jason Gunthorpe
@ 2025-09-08 15:10   ` David Hildenbrand
  1 sibling, 0 replies; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:10 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal,
	Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On 08.09.25 13:10, Lorenzo Stoakes wrote:
> It's useful to be able to determine the size of a VMA descriptor range used
> on f_op->mmap_prepare, expressed both in bytes and pages, so add helpers
> for both and update code that could make use of it to do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   fs/ntfs3/file.c    |  2 +-
>   include/linux/mm.h | 10 ++++++++++
>   mm/secretmem.c     |  2 +-
>   3 files changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ntfs3/file.c b/fs/ntfs3/file.c
> index c1ece707b195..86eb88f62714 100644
> --- a/fs/ntfs3/file.c
> +++ b/fs/ntfs3/file.c
> @@ -304,7 +304,7 @@ static int ntfs_file_mmap_prepare(struct vm_area_desc *desc)
>   
>   	if (rw) {
>   		u64 to = min_t(loff_t, i_size_read(inode),
> -			       from + desc->end - desc->start);
> +			       from + vma_desc_size(desc));
>   
>   		if (is_sparsed(ni)) {
>   			/* Allocate clusters for rw map. */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a6bfa46937a8..9d4508b20be3 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3560,6 +3560,16 @@ static inline unsigned long vma_pages(const struct vm_area_struct *vma)
>   	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
>   }
>   
> +static inline unsigned long vma_desc_size(struct vm_area_desc *desc)
> +{
> +	return desc->end - desc->start;
> +}
> +
> +static inline unsigned long vma_desc_pages(struct vm_area_desc *desc)
> +{
> +	return vma_desc_size(desc) >> PAGE_SHIFT;
> +}

"const struct vm_area_desc *" in both cases?

> +
>   /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
>   static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
>   				unsigned long vm_start, unsigned long vm_end)
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 60137305bc20..62066ddb1e9c 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -120,7 +120,7 @@ static int secretmem_release(struct inode *inode, struct file *file)
>   
>   static int secretmem_mmap_prepare(struct vm_area_desc *desc)
>   {
> -	const unsigned long len = desc->end - desc->start;
> +	const unsigned long len = vma_desc_size(desc);
>   
>   	if ((desc->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
>   		return -EINVAL;

We really want to forbid any private mappings here, independent of cow.

Maybe a is_private_mapping() helper

or a

vma_desc_is_private_mapping()

helper if we really need it

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 04/16] relay: update relay to use mmap_prepare
  2025-09-08 11:10 ` [PATCH 04/16] relay: update relay to use mmap_prepare Lorenzo Stoakes
@ 2025-09-08 15:15   ` David Hildenbrand
  2025-09-08 15:29     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:15 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal,
	Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On 08.09.25 13:10, Lorenzo Stoakes wrote:
> It is relatively trivial to update this code to use the f_op->mmap_prepare
> hook in favour of the deprecated f_op->mmap hook, so do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
  2025-09-08 15:04     ` Jason Gunthorpe
@ 2025-09-08 15:15       ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jan Kara, Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 12:04:04PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 03:48:36PM +0100, Lorenzo Stoakes wrote:
> > But sadly some _do need_ to do extra work afterwards, most notably,
> > prepopulation.
>
> I think Jan is suggesting something more like
>
> mmap_op()
> {
>    struct vma_desc desc = {};
>
>    desc.[..] = x
>    desc.[..] = y
>    desc.[..] = z
>    vma = vma_alloc(desc);
>
>    ret = remap_pfn(vma)
>    if (ret) goto err_vma;
>
>    return vma_commit(vma);
>
> err_va:
>   vma_dealloc(vma);
>   return ERR_PTR(ret);
> }
>
> Jason

Right, unfortunately the locking and the subtle issues around memory mapping
really preclude something like this I think. We really do need to keep control
over that.

And since partly the motivation here is 'drivers do insane things when given too
much freedom', I feel this would not improve that :)

If you look at do_mmap() -> mmap_region() -> __mmap_region() etc. you can see a
lot of that.

We also had a security issue arise as a result of incorrect error path handling,
I don't think letting a driver writer handle that is wise.

It's a nice idea, but I just think this stuff is too sensitive for that. And in
any case, it wouldn't likely be tractable to convert legacy code to this.

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 14:47             ` Lorenzo Stoakes
  2025-09-08 15:07               ` David Hildenbrand
@ 2025-09-08 15:16               ` Jason Gunthorpe
  2025-09-08 15:24                 ` David Hildenbrand
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 15:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 03:47:34PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > > Perhaps
> > > >
> > > > !vma_desc_cowable()
> > > >
> > > > Is what many drivers are really trying to assert.
> > >
> > > Well no, because:
> > >
> > > static inline bool is_cow_mapping(vm_flags_t flags)
> > > {
> > > 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > > }
> > >
> > > Read-only means !CoW.
> >
> > What drivers want when they check SHARED is to prevent COW. It is COW
> > that causes problems for whatever the driver is doing, so calling the
> > helper cowable and making the test actually right for is a good thing.
> >
> > COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
> > something that is COW in future.
> 
> But you can't do that if !VM_MAYWRITE.

See this is my fear, the drivers are wrong and you are talking about
edge cases nobody actually knows about.

The need is the created VMA, and its dups, never, ever becomes
COWable. This is what drivers actually want. We need to give them a
clear test to do that.

Anything using remap and checking for SHARED almost certainly falls
into this category as COWing remapped memory is rare and weird.
 
> I mean probably the driver's just wrong and should use
> is_cow_mapping() tbh.

Maybe.

> I think we need to be cautious of scope here :) I don't want to
> accidentally break things this way.

IMHO it is worth doing when you get into more driver places it is far
more obvious why the VM_SHARED is being checked.

> OK I think a sensible way forward - How about I add desc_is_cowable() or
> vma_desc_cowable() and only set this if I'm confident it's correct?

I'm thinking to call it vma_desc_never_cowable() as that is much much
clear what the purpose is.

I think anyone just checking VM_SHARED should be changed over..

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion
  2025-09-08 11:10 ` [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion Lorenzo Stoakes
@ 2025-09-08 15:19   ` David Hildenbrand
  2025-09-08 15:31     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:19 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal,
	Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On 08.09.25 13:10, Lorenzo Stoakes wrote:
> Now we have the f_op->mmap_prepare() hook, having a static function called
> __mmap_prepare() that has nothing to do with it is confusing, so rename the
> function.
> 
> Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
> provide a f_op->mmap_complete() callback.

Isn't prologue the opposite of epilogue? :)

I guess I would just have done a

__mmap_prepare -> __mmap_setup()

and left the __mmap_complete() as is.


-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:16               ` Jason Gunthorpe
@ 2025-09-08 15:24                 ` David Hildenbrand
  2025-09-08 15:33                   ` Jason Gunthorpe
  2025-09-08 15:33                   ` Lorenzo Stoakes
  0 siblings, 2 replies; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

> 
>> I think we need to be cautious of scope here :) I don't want to
>> accidentally break things this way.
> 
> IMHO it is worth doing when you get into more driver places it is far
> more obvious why the VM_SHARED is being checked.
> 
>> OK I think a sensible way forward - How about I add desc_is_cowable() or
>> vma_desc_cowable() and only set this if I'm confident it's correct?
> 
> I'm thinking to call it vma_desc_never_cowable() as that is much much
> clear what the purpose is.

Secretmem wants no private mappings. So we should check exactly that, 
not whether we might have a cow mapping.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
  2025-09-08 12:55   ` Jason Gunthorpe
@ 2025-09-08 15:27   ` David Hildenbrand
  2025-09-09  9:13     ` Lorenzo Stoakes
  2025-09-09 16:44   ` Suren Baghdasaryan
  2 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:27 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, Konstantin Komarov, Baoquan He, Vivek Goyal,
	Dave Young, Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On 08.09.25 13:10, Lorenzo Stoakes wrote:
> We have introduced the f_op->mmap_prepare hook to allow for setting up a
> VMA far earlier in the process of mapping memory, reducing problematic
> error handling paths, but this does not provide what all
> drivers/filesystems need.
> 
> In order to supply this, and to be able to move forward with removing
> f_op->mmap altogether, introduce f_op->mmap_complete.
> 
> This hook is called once the VMA is fully mapped and everything is done,
> however with the mmap write lock and VMA write locks held.
> 
> The hook is then provided with a fully initialised VMA which it can do what
> it needs with, though the mmap and VMA write locks must remain held
> throughout.
> 
> It is not intended that the VMA be modified at this point, attempts to do
> so will end in tears.
> 
> This allows for operations such as pre-population typically via a remap, or
> really anything that requires access to the VMA once initialised.
> 
> In addition, a caller may need to take a lock in mmap_prepare, when it is
> possible to modify the VMA, and release it on mmap_complete. In order to
> handle errors which may arise between the two operations, f_op->mmap_abort
> is provided.
> 
> This hook should be used to drop any lock and clean up anything before the
> VMA mapping operation is aborted. After this point the VMA will not be
> added to any mapping and will not exist.
> 
> We also add a new mmap_context field to the vm_area_desc type which can be
> used to pass information pertinent to any locks which are held or any state
> which is required for mmap_complete, abort to operate correctly.
> 
> We also update the compatibility layer for nested filesystems which
> currently still only specify an f_op->mmap() handler so that it correctly
> invokes f_op->mmap_complete as necessary (note that no error can occur
> between mmap_prepare and mmap_complete so mmap_abort will never be called
> in this case).
> 
> Also update the VMA tests to account for the changes.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   include/linux/fs.h               |  4 ++
>   include/linux/mm_types.h         |  5 ++
>   mm/util.c                        | 18 +++++--
>   mm/vma.c                         | 82 ++++++++++++++++++++++++++++++--
>   tools/testing/vma/vma_internal.h | 31 ++++++++++--
>   5 files changed, 129 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 594bd4d0521e..bb432924993a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2195,6 +2195,10 @@ struct file_operations {
>   	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
>   				unsigned int poll_flags);
>   	int (*mmap_prepare)(struct vm_area_desc *);
> +	int (*mmap_complete)(struct file *, struct vm_area_struct *,
> +			     const void *context);
> +	void (*mmap_abort)(const struct file *, const void *vm_private_data,
> +			   const void *context);

Do we have a description somewhere what these things do, when they are 
called, and what a driver may be allowed to do with a VMA?

In particular, the mmap_complete() looks like another candidate for 
letting a driver just go crazy on the vma? :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare
  2025-09-08 14:59   ` David Hildenbrand
@ 2025-09-08 15:28     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 04:59:46PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > This simply assigns the vm_ops so is easily updated - do so.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
>
> Reviewed-by: David Hildenbrand <david@redhat.com>

Cheers!


>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 02/16] device/dax: update devdax to use mmap_prepare
  2025-09-08 15:03   ` David Hildenbrand
@ 2025-09-08 15:28     ` Lorenzo Stoakes
  2025-09-08 15:31       ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 05:03:54PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > The devdax driver does nothing special in its f_op->mmap hook, so
> > straightforwardly update it to use the mmap_prepare hook instead.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >   drivers/dax/device.c | 32 +++++++++++++++++++++-----------
> >   1 file changed, 21 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index 2bb40a6060af..c2181439f925 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -13,8 +13,9 @@
> >   #include "dax-private.h"
> >   #include "bus.h"
> > -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
> > -		const char *func)
> > +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
> > +		       unsigned long start, unsigned long end, struct file *file,
> > +		       const char *func)
>
> In general
>
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

>
> The only thing that bugs me is __check_vma() that does not check a vma.

Ah yeah, you're right.

>
> Maybe something along the lines of
>
> "check_vma_properties"

maybe check_vma_desc()?

>
> Not sure.
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 04/16] relay: update relay to use mmap_prepare
  2025-09-08 15:15   ` David Hildenbrand
@ 2025-09-08 15:29     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:29 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 05:15:12PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > It is relatively trivial to update this code to use the f_op->mmap_prepare
> > hook in favour of the deprecated f_op->mmap hook, so do so.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
>
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks!

>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 02/16] device/dax: update devdax to use mmap_prepare
  2025-09-08 15:28     ` Lorenzo Stoakes
@ 2025-09-08 15:31       ` David Hildenbrand
  0 siblings, 0 replies; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On 08.09.25 17:28, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 05:03:54PM +0200, David Hildenbrand wrote:
>> On 08.09.25 13:10, Lorenzo Stoakes wrote:
>>> The devdax driver does nothing special in its f_op->mmap hook, so
>>> straightforwardly update it to use the mmap_prepare hook instead.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>    drivers/dax/device.c | 32 +++++++++++++++++++++-----------
>>>    1 file changed, 21 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
>>> index 2bb40a6060af..c2181439f925 100644
>>> --- a/drivers/dax/device.c
>>> +++ b/drivers/dax/device.c
>>> @@ -13,8 +13,9 @@
>>>    #include "dax-private.h"
>>>    #include "bus.h"
>>> -static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
>>> -		const char *func)
>>> +static int __check_vma(struct dev_dax *dev_dax, vm_flags_t vm_flags,
>>> +		       unsigned long start, unsigned long end, struct file *file,
>>> +		       const char *func)
>>
>> In general
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Thanks!
> 
>>
>> The only thing that bugs me is __check_vma() that does not check a vma.
> 
> Ah yeah, you're right.
> 
>>
>> Maybe something along the lines of
>>
>> "check_vma_properties"
> 
> maybe check_vma_desc()?

Would also work, although it might imply that we are passing in a vma desc.

Well, you could let check_vma() construct a vma_desc and pass that to 
check_vma_desc() ...

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion
  2025-09-08 15:19   ` David Hildenbrand
@ 2025-09-08 15:31     ` Lorenzo Stoakes
  2025-09-08 17:38       ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 05:19:18PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > Now we have the f_op->mmap_prepare() hook, having a static function called
> > __mmap_prepare() that has nothing to do with it is confusing, so rename the
> > function.
> >
> > Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
> > provide a f_op->mmap_complete() callback.
>
> Isn't prologue the opposite of epilogue? :)

:) well indeed, the prologue comes _first_ and epilogue comes _last_. So we
rename the bit that comes first

>
> I guess I would just have done a
>
> __mmap_prepare -> __mmap_setup()

Sure will rename to __mmap_setup().

>
> and left the __mmap_complete() as is.

But we are adding a 'mmap_complete' hook :)'

I can think of another sensible name here then if I'm being too abstract here...

__mmap_finish() or something.

>
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:24                 ` David Hildenbrand
@ 2025-09-08 15:33                   ` Jason Gunthorpe
  2025-09-08 15:46                     ` David Hildenbrand
  2025-09-08 15:33                   ` Lorenzo Stoakes
  1 sibling, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 15:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 05:24:23PM +0200, David Hildenbrand wrote:
> > 
> > > I think we need to be cautious of scope here :) I don't want to
> > > accidentally break things this way.
> > 
> > IMHO it is worth doing when you get into more driver places it is far
> > more obvious why the VM_SHARED is being checked.
> > 
> > > OK I think a sensible way forward - How about I add desc_is_cowable() or
> > > vma_desc_cowable() and only set this if I'm confident it's correct?
> > 
> > I'm thinking to call it vma_desc_never_cowable() as that is much much
> > clear what the purpose is.
> 
> Secretmem wants no private mappings. So we should check exactly that, not
> whether we might have a cow mapping.

secretmem is checking shared for a different reason than many other places..

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:24                 ` David Hildenbrand
  2025-09-08 15:33                   ` Jason Gunthorpe
@ 2025-09-08 15:33                   ` Lorenzo Stoakes
  1 sibling, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:33 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 05:24:23PM +0200, David Hildenbrand wrote:
> >
> > > I think we need to be cautious of scope here :) I don't want to
> > > accidentally break things this way.
> >
> > IMHO it is worth doing when you get into more driver places it is far
> > more obvious why the VM_SHARED is being checked.
> >
> > > OK I think a sensible way forward - How about I add desc_is_cowable() or
> > > vma_desc_cowable() and only set this if I'm confident it's correct?
> >
> > I'm thinking to call it vma_desc_never_cowable() as that is much much
> > clear what the purpose is.
>
> Secretmem wants no private mappings. So we should check exactly that, not
> whether we might have a cow mapping.

Well then :)

Probably in most cases what Jason is saying is valid for drivers.

So I can add a helper for both.

Maybe vma_desc_is_private() for this one?

>
> --
> Cheers
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:07               ` David Hildenbrand
@ 2025-09-08 15:35                 ` Lorenzo Stoakes
  2025-09-08 17:30                   ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 15:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 05:07:57PM +0200, David Hildenbrand wrote:
> On 08.09.25 16:47, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > > > Perhaps
> > > > >
> > > > > !vma_desc_cowable()
> > > > >
> > > > > Is what many drivers are really trying to assert.
> > > >
> > > > Well no, because:
> > > >
> > > > static inline bool is_cow_mapping(vm_flags_t flags)
> > > > {
> > > > 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > > > }
> > > >
> > > > Read-only means !CoW.
> > >
> > > What drivers want when they check SHARED is to prevent COW. It is COW
> > > that causes problems for whatever the driver is doing, so calling the
> > > helper cowable and making the test actually right for is a good thing.
> > >
> > > COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
> > > something that is COW in future.
> >
> > But you can't do that if !VM_MAYWRITE.
> >
> > I mean probably the driver's just wrong and should use is_cow_mapping() tbh.
> >
> > >
> > > Drivers have commonly various things with VM_SHARED to establish !COW,
> > > but if that isn't actually right then lets fix it to be clear and
> > > correct.
> >
> > I think we need to be cautious of scope here :) I don't want to accidentally
> > break things this way.
> >
> > OK I think a sensible way forward - How about I add desc_is_cowable() or
> > vma_desc_cowable() and only set this if I'm confident it's correct?
>
> I'll note that the naming is bad.
>
> Why?
>
> Because the vma_desc is not cowable. The underlying mapping maybe is.

Right, but the vma_desc desribes a VMA being set up.

I mean is_cow_mapping(desc->vm_flags) isn't too egregious anyway, so maybe
just use that for that case?

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:33                   ` Jason Gunthorpe
@ 2025-09-08 15:46                     ` David Hildenbrand
  2025-09-08 15:50                       ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On 08.09.25 17:33, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 05:24:23PM +0200, David Hildenbrand wrote:
>>>
>>>> I think we need to be cautious of scope here :) I don't want to
>>>> accidentally break things this way.
>>>
>>> IMHO it is worth doing when you get into more driver places it is far
>>> more obvious why the VM_SHARED is being checked.
>>>
>>>> OK I think a sensible way forward - How about I add desc_is_cowable() or
>>>> vma_desc_cowable() and only set this if I'm confident it's correct?
>>>
>>> I'm thinking to call it vma_desc_never_cowable() as that is much much
>>> clear what the purpose is.
>>
>> Secretmem wants no private mappings. So we should check exactly that, not
>> whether we might have a cow mapping.
> 
> secretmem is checking shared for a different reason than many other places..

I think many cases just don't want any private mappings.

After all, you need a R/O file (VM_MAYWRITE cleared) mapped MAP_PRIVATE 
to make is_cow_mapping() == false.

And at that point, you just mostly have a R/O MAP_SHARED mapping IIRC.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:46                     ` David Hildenbrand
@ 2025-09-08 15:50                       ` David Hildenbrand
  2025-09-08 15:56                         ` Jason Gunthorpe
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 15:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On 08.09.25 17:46, David Hildenbrand wrote:
> On 08.09.25 17:33, Jason Gunthorpe wrote:
>> On Mon, Sep 08, 2025 at 05:24:23PM +0200, David Hildenbrand wrote:
>>>>
>>>>> I think we need to be cautious of scope here :) I don't want to
>>>>> accidentally break things this way.
>>>>
>>>> IMHO it is worth doing when you get into more driver places it is far
>>>> more obvious why the VM_SHARED is being checked.
>>>>
>>>>> OK I think a sensible way forward - How about I add desc_is_cowable() or
>>>>> vma_desc_cowable() and only set this if I'm confident it's correct?
>>>>
>>>> I'm thinking to call it vma_desc_never_cowable() as that is much much
>>>> clear what the purpose is.
>>>
>>> Secretmem wants no private mappings. So we should check exactly that, not
>>> whether we might have a cow mapping.
>>
>> secretmem is checking shared for a different reason than many other places..
> 
> I think many cases just don't want any private mappings.
> 
> After all, you need a R/O file (VM_MAYWRITE cleared) mapped MAP_PRIVATE
> to make is_cow_mapping() == false.

Sorry, was confused there. R/O file does not matter with MAP_PRIVATE. I 
think we default to VM_MAYWRITE with MAP_PRIVATE unless someone 
explicitly clears it.

So in practice there is indeed not a big difference between a private 
and cow mapping.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:50                       ` David Hildenbrand
@ 2025-09-08 15:56                         ` Jason Gunthorpe
  2025-09-08 17:36                           ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 15:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 05:50:18PM +0200, David Hildenbrand wrote:

> So in practice there is indeed not a big difference between a private and
> cow mapping.

Right and most drivers just check SHARED.

But if we are being documentative why they check shared is because the
driver cannot tolerate COW.

I think if someone is cargo culting a diver and sees
'vma_never_cowable' they will have a better understanding of the
driver side issues.

Driver's don't actually care about private vs shared, except this
indirectly implies something about cow.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 14:18         ` Lorenzo Stoakes
@ 2025-09-08 16:03           ` Jason Gunthorpe
  2025-09-08 16:07             ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Jason Gunthorpe @ 2025-09-08 16:03 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 03:18:46PM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
> >
> > > It's not only remap that is a concern here, people do all kinds of weird
> > > and wonderful things in .mmap(), sometimes in combination with remap.
> >
> > So it should really not be split this way, complete is a badly name
> 
> I don't understand, you think we can avoid splitting this in two? If so, I
> disagree.

I'm saying to the greatest extent possible complete should only
populate PTEs.

We should refrain from trying to use it for other things, because it
shouldn't need to be there.

> > The only example in this series didn't actually need to hold the lock.
> 
> There's ~250 more mmap callbacks to work through. Do you provide a guarantee
> that:

I'd be happy if only a small few need something weird and everything
else was aligned.

Jason

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete()
  2025-09-08 16:03           ` Jason Gunthorpe
@ 2025-09-08 16:07             ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 16:07 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 01:03:06PM -0300, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 03:18:46PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 10:35:38AM -0300, Jason Gunthorpe wrote:
> > > On Mon, Sep 08, 2025 at 02:27:12PM +0100, Lorenzo Stoakes wrote:
> > >
> > > > It's not only remap that is a concern here, people do all kinds of weird
> > > > and wonderful things in .mmap(), sometimes in combination with remap.
> > >
> > > So it should really not be split this way, complete is a badly name
> >
> > I don't understand, you think we can avoid splitting this in two? If so, I
> > disagree.
>
> I'm saying to the greatest extent possible complete should only
> populate PTEs.
>
> We should refrain from trying to use it for other things, because it
> shouldn't need to be there.

OK that sounds sensible, I will refactor to try to do only this in the
mmap_complete hook as far as is possible and see if I can use a generic function
also.

>
> > > The only example in this series didn't actually need to hold the lock.
> >
> > There's ~250 more mmap callbacks to work through. Do you provide a guarantee
> > that:
>
> I'd be happy if only a small few need something weird and everything
> else was aligned.

Ack!

>
> Jason

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:35                 ` Lorenzo Stoakes
@ 2025-09-08 17:30                   ` David Hildenbrand
  2025-09-09  9:21                     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 17:30 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On 08.09.25 17:35, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 05:07:57PM +0200, David Hildenbrand wrote:
>> On 08.09.25 16:47, Lorenzo Stoakes wrote:
>>> On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
>>>> On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
>>>>>> Perhaps
>>>>>>
>>>>>> !vma_desc_cowable()
>>>>>>
>>>>>> Is what many drivers are really trying to assert.
>>>>>
>>>>> Well no, because:
>>>>>
>>>>> static inline bool is_cow_mapping(vm_flags_t flags)
>>>>> {
>>>>> 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
>>>>> }
>>>>>
>>>>> Read-only means !CoW.
>>>>
>>>> What drivers want when they check SHARED is to prevent COW. It is COW
>>>> that causes problems for whatever the driver is doing, so calling the
>>>> helper cowable and making the test actually right for is a good thing.
>>>>
>>>> COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
>>>> something that is COW in future.
>>>
>>> But you can't do that if !VM_MAYWRITE.
>>>
>>> I mean probably the driver's just wrong and should use is_cow_mapping() tbh.
>>>
>>>>
>>>> Drivers have commonly various things with VM_SHARED to establish !COW,
>>>> but if that isn't actually right then lets fix it to be clear and
>>>> correct.
>>>
>>> I think we need to be cautious of scope here :) I don't want to accidentally
>>> break things this way.
>>>
>>> OK I think a sensible way forward - How about I add desc_is_cowable() or
>>> vma_desc_cowable() and only set this if I'm confident it's correct?
>>
>> I'll note that the naming is bad.
>>
>> Why?
>>
>> Because the vma_desc is not cowable. The underlying mapping maybe is.
> 
> Right, but the vma_desc desribes a VMA being set up.
> 
> I mean is_cow_mapping(desc->vm_flags) isn't too egregious anyway, so maybe
> just use that for that case?

Yes, I don't think we would need another wrapper.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 15:56                         ` Jason Gunthorpe
@ 2025-09-08 17:36                           ` David Hildenbrand
  2025-09-08 20:24                             ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 17:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lorenzo Stoakes, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On 08.09.25 17:56, Jason Gunthorpe wrote:
> On Mon, Sep 08, 2025 at 05:50:18PM +0200, David Hildenbrand wrote:
> 
>> So in practice there is indeed not a big difference between a private and
>> cow mapping.
> 
> Right and most drivers just check SHARED.
> 
> But if we are being documentative why they check shared is because the
> driver cannot tolerate COW.
> 
> I think if someone is cargo culting a diver and sees
> 'vma_never_cowable' they will have a better understanding of the
> driver side issues.
> 
> Driver's don't actually care about private vs shared, except this
> indirectly implies something about cow.

I recall some corner cases, but yes, most drivers don't clear MAP_MAYWRITE so
is_cow_mapping() would just rule out what they wanted to rule out (no anon
pages / cow semantics).

FWIW, I recalled some VM_MAYWRITE magic in memfd, but it's really just for
!cow mappings, so the following should likely work:

diff --git a/mm/memfd.c b/mm/memfd.c
index 1de610e9f2ea2..2a3aa26444bbb 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -346,14 +346,11 @@ static int check_write_seal(vm_flags_t *vm_flags_ptr)
         vm_flags_t vm_flags = *vm_flags_ptr;
         vm_flags_t mask = vm_flags & (VM_SHARED | VM_WRITE);
  
-       /* If a private mapping then writability is irrelevant. */
-       if (!(mask & VM_SHARED))
+       /* If a CoW mapping then writability is irrelevant. */
+       if (is_cow_mapping(vm_flags))
                 return 0;
  
-       /*
-        * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
-        * write seals are active.
-        */
+       /* New PROT_WRITE mappings are not allowed when write-sealed. */
         if (mask & VM_WRITE)
                 return -EPERM;
  


-- 
Cheers

David / dhildenb


^ permalink raw reply related	[flat|nested] 84+ messages in thread

* Re: [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion
  2025-09-08 15:31     ` Lorenzo Stoakes
@ 2025-09-08 17:38       ` David Hildenbrand
  2025-09-09  9:04         ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-08 17:38 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On 08.09.25 17:31, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 05:19:18PM +0200, David Hildenbrand wrote:
>> On 08.09.25 13:10, Lorenzo Stoakes wrote:
>>> Now we have the f_op->mmap_prepare() hook, having a static function called
>>> __mmap_prepare() that has nothing to do with it is confusing, so rename the
>>> function.
>>>
>>> Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
>>> provide a f_op->mmap_complete() callback.
>>
>> Isn't prologue the opposite of epilogue? :)
> 
> :) well indeed, the prologue comes _first_ and epilogue comes _last_. So we
> rename the bit that comes first
> 
>>
>> I guess I would just have done a
>>
>> __mmap_prepare -> __mmap_setup()
> 
> Sure will rename to __mmap_setup().
> 
>>
>> and left the __mmap_complete() as is.
> 
> But we are adding a 'mmap_complete' hook :)'
> 
> I can think of another sensible name here then if I'm being too abstract here...
> 
> __mmap_finish() or something.

LGTM. I guess it would all be clearer if we could just describe less 
abstract what is happening. But that would likely imply a bigger rework. 
So setup/finish sounds good.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 17:36                           ` David Hildenbrand
@ 2025-09-08 20:24                             ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-08 20:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 07:36:59PM +0200, David Hildenbrand wrote:
> On 08.09.25 17:56, Jason Gunthorpe wrote:
> > On Mon, Sep 08, 2025 at 05:50:18PM +0200, David Hildenbrand wrote:
> >
> > > So in practice there is indeed not a big difference between a private and
> > > cow mapping.
> >
> > Right and most drivers just check SHARED.
> >
> > But if we are being documentative why they check shared is because the
> > driver cannot tolerate COW.
> >
> > I think if someone is cargo culting a diver and sees
> > 'vma_never_cowable' they will have a better understanding of the
> > driver side issues.
> >
> > Driver's don't actually care about private vs shared, except this
> > indirectly implies something about cow.
>
> I recall some corner cases, but yes, most drivers don't clear MAP_MAYWRITE so
> is_cow_mapping() would just rule out what they wanted to rule out (no anon
> pages / cow semantics).
>
> FWIW, I recalled some VM_MAYWRITE magic in memfd, but it's really just for
> !cow mappings, so the following should likely work:

I was invovled in these dark arts :)

Since we gate the check_write_seal() function (which is the one that removes
VM_MAYWRITE) on the mapping being shared, then obviously we can't remove
VM_MAYWRITE in the first place.

The only other way VM_MAYWRITE could be got rid of is if it already a MAP_SHARED
or MAP_SHARED_VALIDATE mapping without write permission, and then it'd fail this
check anyway.

So I think the below patch is fine!

>
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 1de610e9f2ea2..2a3aa26444bbb 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -346,14 +346,11 @@ static int check_write_seal(vm_flags_t *vm_flags_ptr)
>         vm_flags_t vm_flags = *vm_flags_ptr;
>         vm_flags_t mask = vm_flags & (VM_SHARED | VM_WRITE);
> -       /* If a private mapping then writability is irrelevant. */
> -       if (!(mask & VM_SHARED))
> +       /* If a CoW mapping then writability is irrelevant. */
> +       if (is_cow_mapping(vm_flags))
>                 return 0;
> -       /*
> -        * New PROT_WRITE and MAP_SHARED mmaps are not allowed when
> -        * write seals are active.
> -        */
> +       /* New PROT_WRITE mappings are not allowed when write-sealed. */
>         if (mask & VM_WRITE)
>                 return -EPERM;

>
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort]
  2025-09-08 11:10 ` [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort] Lorenzo Stoakes
@ 2025-09-08 23:17   ` Randy Dunlap
  2025-09-09  9:02     ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Randy Dunlap @ 2025-09-08 23:17 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

Hi--

On 9/8/25 4:10 AM, Lorenzo Stoakes wrote:
> We have introduced the mmap_complete() and mmap_abort() callbacks, which
> work in conjunction with mmap_prepare(), so describe what they used for.
> 
> We update both the VFS documentation and the porting guide.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  Documentation/filesystems/porting.rst |  9 +++++++
>  Documentation/filesystems/vfs.rst     | 35 +++++++++++++++++++++++++++
>  2 files changed, 44 insertions(+)
> 

> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 486a91633474..172d36a13e13 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst

> @@ -1236,6 +1240,37 @@ otherwise noted.
>  	file-backed memory mapping, most notably establishing relevant
>  	private state and VMA callbacks.
>  
> +``mmap_complete``
> +	If mmap_prepare is provided, will be invoked after the mapping is fully

s/mmap_prepare/mmap_complete/ ??

> +	established, with the mmap and VMA write locks held.
> +
> +	It is useful for prepopulating VMAs before they may be accessed by
> +	users.
> +
> +	The hook MUST NOT release either the VMA or mmap write locks. This is

You could also do **bold** above:

	The hook **MUST NOT** release ...


> +	asserted by the mmap logic.
> +
> +	If an error is returned by the hook, the VMA is unmapped and the
> +	mmap() operation fails with that error.
> +
> +	It is not valid to specify this hook if mmap_prepare is not also
> +	specified, doing so will result in an error upon mapping.

-- 
~Randy


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare
  2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
  2025-09-08 14:59   ` David Hildenbrand
@ 2025-09-09  3:19   ` Baolin Wang
  2025-09-09  9:08     ` Lorenzo Stoakes
  1 sibling, 1 reply; 84+ messages in thread
From: Baolin Wang @ 2025-09-09  3:19 UTC (permalink / raw)
  To: Lorenzo Stoakes, Andrew Morton
  Cc: Jonathan Corbet, Matthew Wilcox, Guo Ren, Thomas Bogendoerfer,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Uladzislau Rezki,
	Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato,
	linux-doc, linux-kernel, linux-fsdevel, linux-csky, linux-mips,
	linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec,
	kasan-dev, Jason Gunthorpe



On 2025/9/8 19:10, Lorenzo Stoakes wrote:
> This simply assigns the vm_ops so is easily updated - do so.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

>   mm/shmem.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 29e1eb690125..cfc33b99a23a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2950,16 +2950,17 @@ int shmem_lock(struct file *file, int lock, struct ucounts *ucounts)
>   	return retval;
>   }
>   
> -static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
> +static int shmem_mmap_prepare(struct vm_area_desc *desc)
>   {
> +	struct file *file = desc->file;
>   	struct inode *inode = file_inode(file);
>   
>   	file_accessed(file);
>   	/* This is anonymous shared memory if it is unlinked at the time of mmap */
>   	if (inode->i_nlink)
> -		vma->vm_ops = &shmem_vm_ops;
> +		desc->vm_ops = &shmem_vm_ops;
>   	else
> -		vma->vm_ops = &shmem_anon_vm_ops;
> +		desc->vm_ops = &shmem_anon_vm_ops;
>   	return 0;
>   }
>   
> @@ -5229,7 +5230,7 @@ static const struct address_space_operations shmem_aops = {
>   };
>   
>   static const struct file_operations shmem_file_operations = {
> -	.mmap		= shmem_mmap,
> +	.mmap_prepare	= shmem_mmap_prepare,
>   	.open		= shmem_file_open,
>   	.get_unmapped_area = shmem_get_unmapped_area,
>   #ifdef CONFIG_TMPFS


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
  2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
                   ` (16 preceding siblings ...)
  2025-09-08 13:27 ` [PATCH 00/16] expand mmap_prepare functionality, port more users Jan Kara
@ 2025-09-09  8:31 ` Alexander Gordeev
  2025-09-09  8:59   ` Lorenzo Stoakes
  17 siblings, 1 reply; 84+ messages in thread
From: Alexander Gordeev @ 2025-09-09  8:31 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote:

Hi Lorenzo,

I am getting this warning with this series applied:

[Tue Sep  9 10:25:34 2025] ------------[ cut here ]------------
[Tue Sep  9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420
[Tue Sep  9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E)
[Tue Sep  9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1
[Tue Sep  9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G            E       6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT 
[Tue Sep  9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE
[Tue Sep  9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR)
[Tue Sep  9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420)
[Tue Sep  9 10:25:34 2025]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
[Tue Sep  9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022
[Tue Sep  9 10:25:34 2025]            000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711
[Tue Sep  9 10:25:34 2025]            000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400
[Tue Sep  9 10:25:34 2025]            000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000
[Tue Sep  9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700                bc      0,1792
                                      00007fffe07f5eea: af000000                mc      0,0
                                     #00007fffe07f5eee: af000000                mc      0,0
                                     >00007fffe07f5ef2: a7f4ff11                brc     15,00007fffe07f5d14
                                      00007fffe07f5ef6: b904002b                lgr     %r2,%r11
                                      00007fffe07f5efa: c0e5000918bb    brasl   %r14,00007fffe0919070
                                      00007fffe07f5f00: a7f4ff39                brc     15,00007fffe07f5d72
                                      00007fffe07f5f04: e320f0c80004    lg      %r2,200(%r15)
[Tue Sep  9 10:25:34 2025] Call Trace:
[Tue Sep  9 10:25:34 2025]  [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420 
[Tue Sep  9 10:25:34 2025]  [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70 
[Tue Sep  9 10:25:34 2025]  [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290 
[Tue Sep  9 10:25:34 2025]  [<00007fffe085b0b4>] __mmap_region+0x964/0x1090 
[Tue Sep  9 10:25:34 2025]  [<00007fffe085dc7e>] mmap_region+0xde/0x250 
[Tue Sep  9 10:25:34 2025]  [<00007fffe08065fc>] do_mmap+0x80c/0xc30 
[Tue Sep  9 10:25:34 2025]  [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370 
[Tue Sep  9 10:25:34 2025]  [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400 
[Tue Sep  9 10:25:34 2025]  [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0 
[Tue Sep  9 10:25:34 2025]  [<00007fffe29f1cd6>] __do_syscall+0x146/0x410 
[Tue Sep  9 10:25:34 2025]  [<00007fffe2a17e1e>] system_call+0x6e/0x90 
[Tue Sep  9 10:25:34 2025] 2 locks held by makedumpfile/563:
[Tue Sep  9 10:25:34 2025]  #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370
[Tue Sep  9 10:25:34 2025]  #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0
[Tue Sep  9 10:25:34 2025] Last Breaking-Event-Address:
[Tue Sep  9 10:25:34 2025]  [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420
[Tue Sep  9 10:25:34 2025] irq event stamp: 19113
[Tue Sep  9 10:25:34 2025] hardirqs last  enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120
[Tue Sep  9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120
[Tue Sep  9 10:25:34 2025] softirqs last  enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0
[Tue Sep  9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380
[Tue Sep  9 10:25:34 2025] ---[ end trace 0000000000000000 ]---

Thanks!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 00/16] expand mmap_prepare functionality, port more users
  2025-09-09  8:31 ` Alexander Gordeev
@ 2025-09-09  8:59   ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  8:59 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S . Miller,
	Andreas Larsson, Arnd Bergmann, Greg Kroah-Hartman, Dan Williams,
	Vishal Verma, Dave Jiang, Nicolas Pitre, Muchun Song,
	Oscar Salvador, David Hildenbrand, Konstantin Komarov, Baoquan He,
	Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre, Dave Martin,
	James Morse, Alexander Viro, Christian Brauner, Jan Kara,
	Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Tue, Sep 09, 2025 at 10:31:24AM +0200, Alexander Gordeev wrote:
> On Mon, Sep 08, 2025 at 12:10:31PM +0100, Lorenzo Stoakes wrote:
>
> Hi Lorenzo,
>
> I am getting this warning with this series applied:
>
> [Tue Sep  9 10:25:34 2025] ------------[ cut here ]------------
> [Tue Sep  9 10:25:34 2025] WARNING: CPU: 0 PID: 563 at mm/memory.c:2942 remap_pfn_range_internal+0x36e/0x420

OK yeah this is a very silly error :)

I'm asserting:

		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) == VM_REMAP_FLAGS);

So err.. this should be:

		VM_WARN_ON_ONCE((vma->vm_flags & VM_REMAP_FLAGS) != VM_REMAP_FLAGS);

This was a super late addition to the code and obviously I didn't test this as
well as I did the remap code in general, apologies.

Will fix on respin! :)

Cheers, Lorenzo

> [Tue Sep  9 10:25:34 2025] Modules linked in: diag288_wdt(E) watchdog(E) ghash_s390(E) des_generic(E) prng(E) aes_s390(E) des_s390(E) libdes(E) sha3_512_s390(E) sha3_256_s390(E) sha_common(E) vfio_ccw(E) mdev(E) vfio_iommu_type1(E) vfio(E) pkey(E) autofs4(E) overlay(E) squashfs(E) loop(E)
> [Tue Sep  9 10:25:34 2025] Unloaded tainted modules: hmac_s390(E):1
> [Tue Sep  9 10:25:34 2025] CPU: 0 UID: 0 PID: 563 Comm: makedumpfile Tainted: G            E       6.17.0-rc4-gcc-mmap-00410-g87e982e900f0 #288 PREEMPT
> [Tue Sep  9 10:25:34 2025] Tainted: [E]=UNSIGNED_MODULE
> [Tue Sep  9 10:25:34 2025] Hardware name: IBM 8561 T01 703 (LPAR)
> [Tue Sep  9 10:25:34 2025] Krnl PSW : 0704d00180000000 00007fffe07f5ef2 (remap_pfn_range_internal+0x372/0x420)
> [Tue Sep  9 10:25:34 2025]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> [Tue Sep  9 10:25:34 2025] Krnl GPRS: 0000000004044400 001c0f000188b024 0000000000000000 001c0f000188b022
> [Tue Sep  9 10:25:34 2025]            000078000c458120 000078000a0ca800 00000f000188b022 0000000000000711
> [Tue Sep  9 10:25:34 2025]            000003ffa6e05000 00000f000188b024 000003ffa6a05000 0000000004044400
> [Tue Sep  9 10:25:34 2025]            000003ffa7aadfa8 00007fffe2c35ea0 001c000000000000 00007f7fe0faf000
> [Tue Sep  9 10:25:34 2025] Krnl Code: 00007fffe07f5ee6: 47000700                bc      0,1792
>                                       00007fffe07f5eea: af000000                mc      0,0
>                                      #00007fffe07f5eee: af000000                mc      0,0
>                                      >00007fffe07f5ef2: a7f4ff11                brc     15,00007fffe07f5d14
>                                       00007fffe07f5ef6: b904002b                lgr     %r2,%r11
>                                       00007fffe07f5efa: c0e5000918bb    brasl   %r14,00007fffe0919070
>                                       00007fffe07f5f00: a7f4ff39                brc     15,00007fffe07f5d72
>                                       00007fffe07f5f04: e320f0c80004    lg      %r2,200(%r15)
> [Tue Sep  9 10:25:34 2025] Call Trace:
> [Tue Sep  9 10:25:34 2025]  [<00007fffe07f5ef2>] remap_pfn_range_internal+0x372/0x420
> [Tue Sep  9 10:25:34 2025]  [<00007fffe07f5fd4>] remap_pfn_range_complete+0x34/0x70
> [Tue Sep  9 10:25:34 2025]  [<00007fffe019879e>] remap_oldmem_pfn_range+0x13e/0x1a0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0bd3550>] mmap_complete_vmcore+0x520/0x7b0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe077b05a>] __compat_vma_mmap_prepare+0x3ea/0x550
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0ba27f0>] pde_mmap+0x160/0x1a0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0ba3750>] proc_reg_mmap+0xd0/0x180
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0859904>] __mmap_new_vma+0x444/0x1290
> [Tue Sep  9 10:25:34 2025]  [<00007fffe085b0b4>] __mmap_region+0x964/0x1090
> [Tue Sep  9 10:25:34 2025]  [<00007fffe085dc7e>] mmap_region+0xde/0x250
> [Tue Sep  9 10:25:34 2025]  [<00007fffe08065fc>] do_mmap+0x80c/0xc30
> [Tue Sep  9 10:25:34 2025]  [<00007fffe077c708>] vm_mmap_pgoff+0x218/0x370
> [Tue Sep  9 10:25:34 2025]  [<00007fffe080467e>] ksys_mmap_pgoff+0x2ee/0x400
> [Tue Sep  9 10:25:34 2025]  [<00007fffe0804a3a>] __s390x_sys_old_mmap+0x15a/0x1d0
> [Tue Sep  9 10:25:34 2025]  [<00007fffe29f1cd6>] __do_syscall+0x146/0x410
> [Tue Sep  9 10:25:34 2025]  [<00007fffe2a17e1e>] system_call+0x6e/0x90
> [Tue Sep  9 10:25:34 2025] 2 locks held by makedumpfile/563:
> [Tue Sep  9 10:25:34 2025]  #0: 000078000a0caab0 (&mm->mmap_lock){++++}-{3:3}, at: vm_mmap_pgoff+0x16e/0x370
> [Tue Sep  9 10:25:34 2025]  #1: 00007fffe3864f50 (vmcore_cb_srcu){.+.+}-{0:0}, at: mmap_complete_vmcore+0x20c/0x7b0
> [Tue Sep  9 10:25:34 2025] Last Breaking-Event-Address:
> [Tue Sep  9 10:25:34 2025]  [<00007fffe07f5d0e>] remap_pfn_range_internal+0x18e/0x420
> [Tue Sep  9 10:25:34 2025] irq event stamp: 19113
> [Tue Sep  9 10:25:34 2025] hardirqs last  enabled at (19121): [<00007fffe0391910>] __up_console_sem+0xe0/0x120
> [Tue Sep  9 10:25:34 2025] hardirqs last disabled at (19128): [<00007fffe03918f2>] __up_console_sem+0xc2/0x120
> [Tue Sep  9 10:25:34 2025] softirqs last  enabled at (4934): [<00007fffe021cb8e>] handle_softirqs+0x70e/0xed0
> [Tue Sep  9 10:25:34 2025] softirqs last disabled at (3919): [<00007fffe021b670>] __irq_exit_rcu+0x2e0/0x380
> [Tue Sep  9 10:25:34 2025] ---[ end trace 0000000000000000 ]---
>
> Thanks!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort]
  2025-09-08 23:17   ` Randy Dunlap
@ 2025-09-09  9:02     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  9:02 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Baolin Wang, Uladzislau Rezki, Dmitry Vyukov,
	Andrey Konovalov, Jann Horn, Pedro Falcato, linux-doc,
	linux-kernel, linux-fsdevel, linux-csky, linux-mips, linux-s390,
	sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec, kasan-dev,
	Jason Gunthorpe

On Mon, Sep 08, 2025 at 04:17:16PM -0700, Randy Dunlap wrote:
> Hi--
>
> On 9/8/25 4:10 AM, Lorenzo Stoakes wrote:
> > We have introduced the mmap_complete() and mmap_abort() callbacks, which
> > work in conjunction with mmap_prepare(), so describe what they used for.
> >
> > We update both the VFS documentation and the porting guide.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  Documentation/filesystems/porting.rst |  9 +++++++
> >  Documentation/filesystems/vfs.rst     | 35 +++++++++++++++++++++++++++
> >  2 files changed, 44 insertions(+)
> >
>
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 486a91633474..172d36a13e13 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
>
> > @@ -1236,6 +1240,37 @@ otherwise noted.
> >  	file-backed memory mapping, most notably establishing relevant
> >  	private state and VMA callbacks.
> >
> > +``mmap_complete``
> > +	If mmap_prepare is provided, will be invoked after the mapping is fully
>
> s/mmap_prepare/mmap_complete/ ??

Yes indeed sorry! Will fix on respin.

>
> > +	established, with the mmap and VMA write locks held.
> > +
> > +	It is useful for prepopulating VMAs before they may be accessed by
> > +	users.
> > +
> > +	The hook MUST NOT release either the VMA or mmap write locks. This is
>
> You could also do **bold** above:
>
> 	The hook **MUST NOT** release ...
>
>

Ack will do!

> > +	asserted by the mmap logic.
> > +
> > +	If an error is returned by the hook, the VMA is unmapped and the
> > +	mmap() operation fails with that error.
> > +
> > +	It is not valid to specify this hook if mmap_prepare is not also
> > +	specified, doing so will result in an error upon mapping.
>
> --
> ~Randy
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion
  2025-09-08 17:38       ` David Hildenbrand
@ 2025-09-09  9:04         ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  9:04 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 07:38:57PM +0200, David Hildenbrand wrote:
> On 08.09.25 17:31, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 05:19:18PM +0200, David Hildenbrand wrote:
> > > On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > > > Now we have the f_op->mmap_prepare() hook, having a static function called
> > > > __mmap_prepare() that has nothing to do with it is confusing, so rename the
> > > > function.
> > > >
> > > > Additionally rename __mmap_complete() to __mmap_epilogue(), as we intend to
> > > > provide a f_op->mmap_complete() callback.
> > >
> > > Isn't prologue the opposite of epilogue? :)
> >
> > :) well indeed, the prologue comes _first_ and epilogue comes _last_. So we
> > rename the bit that comes first
> >
> > >
> > > I guess I would just have done a
> > >
> > > __mmap_prepare -> __mmap_setup()
> >
> > Sure will rename to __mmap_setup().
> >
> > >
> > > and left the __mmap_complete() as is.
> >
> > But we are adding a 'mmap_complete' hook :)'
> >
> > I can think of another sensible name here then if I'm being too abstract here...
> >
> > __mmap_finish() or something.
>
> LGTM. I guess it would all be clearer if we could just describe less
> abstract what is happening. But that would likely imply a bigger rework. So
> setup/finish sounds good.

Ack will fix on respin!

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare
  2025-09-09  3:19   ` Baolin Wang
@ 2025-09-09  9:08     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  9:08 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	Hugh Dickins, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov,
	Jann Horn, Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel,
	linux-csky, linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl,
	linux-mm, ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Tue, Sep 09, 2025 at 11:19:16AM +0800, Baolin Wang wrote:
>
>
> On 2025/9/8 19:10, Lorenzo Stoakes wrote:
> > This simply assigns the vm_ops so is easily updated - do so.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
>
> LGTM.
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

Thanks!

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-08 15:27   ` David Hildenbrand
@ 2025-09-09  9:13     ` Lorenzo Stoakes
  2025-09-09  9:26       ` David Hildenbrand
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  9:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 08, 2025 at 05:27:37PM +0200, David Hildenbrand wrote:
> On 08.09.25 13:10, Lorenzo Stoakes wrote:
> > We have introduced the f_op->mmap_prepare hook to allow for setting up a
> > VMA far earlier in the process of mapping memory, reducing problematic
> > error handling paths, but this does not provide what all
> > drivers/filesystems need.
> >
> > In order to supply this, and to be able to move forward with removing
> > f_op->mmap altogether, introduce f_op->mmap_complete.
> >
> > This hook is called once the VMA is fully mapped and everything is done,
> > however with the mmap write lock and VMA write locks held.
> >
> > The hook is then provided with a fully initialised VMA which it can do what
> > it needs with, though the mmap and VMA write locks must remain held
> > throughout.
> >
> > It is not intended that the VMA be modified at this point, attempts to do
> > so will end in tears.
> >
> > This allows for operations such as pre-population typically via a remap, or
> > really anything that requires access to the VMA once initialised.
> >
> > In addition, a caller may need to take a lock in mmap_prepare, when it is
> > possible to modify the VMA, and release it on mmap_complete. In order to
> > handle errors which may arise between the two operations, f_op->mmap_abort
> > is provided.
> >
> > This hook should be used to drop any lock and clean up anything before the
> > VMA mapping operation is aborted. After this point the VMA will not be
> > added to any mapping and will not exist.
> >
> > We also add a new mmap_context field to the vm_area_desc type which can be
> > used to pass information pertinent to any locks which are held or any state
> > which is required for mmap_complete, abort to operate correctly.
> >
> > We also update the compatibility layer for nested filesystems which
> > currently still only specify an f_op->mmap() handler so that it correctly
> > invokes f_op->mmap_complete as necessary (note that no error can occur
> > between mmap_prepare and mmap_complete so mmap_abort will never be called
> > in this case).
> >
> > Also update the VMA tests to account for the changes.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >   include/linux/fs.h               |  4 ++
> >   include/linux/mm_types.h         |  5 ++
> >   mm/util.c                        | 18 +++++--
> >   mm/vma.c                         | 82 ++++++++++++++++++++++++++++++--
> >   tools/testing/vma/vma_internal.h | 31 ++++++++++--
> >   5 files changed, 129 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 594bd4d0521e..bb432924993a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2195,6 +2195,10 @@ struct file_operations {
> >   	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
> >   				unsigned int poll_flags);
> >   	int (*mmap_prepare)(struct vm_area_desc *);
> > +	int (*mmap_complete)(struct file *, struct vm_area_struct *,
> > +			     const void *context);
> > +	void (*mmap_abort)(const struct file *, const void *vm_private_data,
> > +			   const void *context);
>
> Do we have a description somewhere what these things do, when they are
> called, and what a driver may be allowed to do with a VMA?

Yeah there's a doc patch that follows this.

>
> In particular, the mmap_complete() looks like another candidate for letting
> a driver just go crazy on the vma? :)

Well there's only so much we can do. In an ideal world we'd treat VMAs as
entirely internal data structures and pass some sort of opaque thing around, but
we have to keep things real here :)

So the main purpose of these changes is not so much to be as ambitious as
_that_, but to only provide the VMA _when it's safe to do so_.

Before we were providing a pointer to an incompletely-initialised VMA that was
not yet in the maple tree, with which the driver could do _anything_, and then
afterwards have:

a. a bunch of stuff left to do with a VMA that might be in some broken state due
   to drivers.
b. (the really bad case) have error paths to handle because the driver returned
   an error, but did who-knows-what with the VMA and page tables.

So we address this by:

1. mmap_prepare being done _super early_ and _not_ providing a VMA. We
   essentially ask the driver 'hey what do you want these fields that you are
   allowed to change in the VMA to be?'

2. mmap_complete being done _super_ late, essentially just before we release the
   VMA/mmap locks. If an error arises - we can just unmap it, easy. And then
   there's a lot less damage the driver can do.

I think it's probably the most sensible means of doing something about the
legacy we have where we've been rather too 'free and easy' with allowing drivers
to do whatever.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers
  2025-09-08 17:30                   ` David Hildenbrand
@ 2025-09-09  9:21                     ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  9:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Jason Gunthorpe, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev

On Mon, Sep 08, 2025 at 07:30:34PM +0200, David Hildenbrand wrote:
> On 08.09.25 17:35, Lorenzo Stoakes wrote:
> > On Mon, Sep 08, 2025 at 05:07:57PM +0200, David Hildenbrand wrote:
> > > On 08.09.25 16:47, Lorenzo Stoakes wrote:
> > > > On Mon, Sep 08, 2025 at 11:20:11AM -0300, Jason Gunthorpe wrote:
> > > > > On Mon, Sep 08, 2025 at 03:09:43PM +0100, Lorenzo Stoakes wrote:
> > > > > > > Perhaps
> > > > > > >
> > > > > > > !vma_desc_cowable()
> > > > > > >
> > > > > > > Is what many drivers are really trying to assert.
> > > > > >
> > > > > > Well no, because:
> > > > > >
> > > > > > static inline bool is_cow_mapping(vm_flags_t flags)
> > > > > > {
> > > > > > 	return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
> > > > > > }
> > > > > >
> > > > > > Read-only means !CoW.
> > > > >
> > > > > What drivers want when they check SHARED is to prevent COW. It is COW
> > > > > that causes problems for whatever the driver is doing, so calling the
> > > > > helper cowable and making the test actually right for is a good thing.
> > > > >
> > > > > COW of this VMA, and no possibilty to remap/mprotect/fork/etc it into
> > > > > something that is COW in future.
> > > >
> > > > But you can't do that if !VM_MAYWRITE.
> > > >
> > > > I mean probably the driver's just wrong and should use is_cow_mapping() tbh.
> > > >
> > > > >
> > > > > Drivers have commonly various things with VM_SHARED to establish !COW,
> > > > > but if that isn't actually right then lets fix it to be clear and
> > > > > correct.
> > > >
> > > > I think we need to be cautious of scope here :) I don't want to accidentally
> > > > break things this way.
> > > >
> > > > OK I think a sensible way forward - How about I add desc_is_cowable() or
> > > > vma_desc_cowable() and only set this if I'm confident it's correct?
> > >
> > > I'll note that the naming is bad.
> > >
> > > Why?
> > >
> > > Because the vma_desc is not cowable. The underlying mapping maybe is.
> >
> > Right, but the vma_desc desribes a VMA being set up.
> >
> > I mean is_cow_mapping(desc->vm_flags) isn't too egregious anyway, so maybe
> > just use that for that case?
>
> Yes, I don't think we would need another wrapper.

Ack will use this in favour of a wrapper.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-09  9:13     ` Lorenzo Stoakes
@ 2025-09-09  9:26       ` David Hildenbrand
  2025-09-09  9:37         ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: David Hildenbrand @ 2025-09-09  9:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On 09.09.25 11:13, Lorenzo Stoakes wrote:
> On Mon, Sep 08, 2025 at 05:27:37PM +0200, David Hildenbrand wrote:
>> On 08.09.25 13:10, Lorenzo Stoakes wrote:
>>> We have introduced the f_op->mmap_prepare hook to allow for setting up a
>>> VMA far earlier in the process of mapping memory, reducing problematic
>>> error handling paths, but this does not provide what all
>>> drivers/filesystems need.
>>>
>>> In order to supply this, and to be able to move forward with removing
>>> f_op->mmap altogether, introduce f_op->mmap_complete.
>>>
>>> This hook is called once the VMA is fully mapped and everything is done,
>>> however with the mmap write lock and VMA write locks held.
>>>
>>> The hook is then provided with a fully initialised VMA which it can do what
>>> it needs with, though the mmap and VMA write locks must remain held
>>> throughout.
>>>
>>> It is not intended that the VMA be modified at this point, attempts to do
>>> so will end in tears.
>>>
>>> This allows for operations such as pre-population typically via a remap, or
>>> really anything that requires access to the VMA once initialised.
>>>
>>> In addition, a caller may need to take a lock in mmap_prepare, when it is
>>> possible to modify the VMA, and release it on mmap_complete. In order to
>>> handle errors which may arise between the two operations, f_op->mmap_abort
>>> is provided.
>>>
>>> This hook should be used to drop any lock and clean up anything before the
>>> VMA mapping operation is aborted. After this point the VMA will not be
>>> added to any mapping and will not exist.
>>>
>>> We also add a new mmap_context field to the vm_area_desc type which can be
>>> used to pass information pertinent to any locks which are held or any state
>>> which is required for mmap_complete, abort to operate correctly.
>>>
>>> We also update the compatibility layer for nested filesystems which
>>> currently still only specify an f_op->mmap() handler so that it correctly
>>> invokes f_op->mmap_complete as necessary (note that no error can occur
>>> between mmap_prepare and mmap_complete so mmap_abort will never be called
>>> in this case).
>>>
>>> Also update the VMA tests to account for the changes.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>    include/linux/fs.h               |  4 ++
>>>    include/linux/mm_types.h         |  5 ++
>>>    mm/util.c                        | 18 +++++--
>>>    mm/vma.c                         | 82 ++++++++++++++++++++++++++++++--
>>>    tools/testing/vma/vma_internal.h | 31 ++++++++++--
>>>    5 files changed, 129 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>>> index 594bd4d0521e..bb432924993a 100644
>>> --- a/include/linux/fs.h
>>> +++ b/include/linux/fs.h
>>> @@ -2195,6 +2195,10 @@ struct file_operations {
>>>    	int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
>>>    				unsigned int poll_flags);
>>>    	int (*mmap_prepare)(struct vm_area_desc *);
>>> +	int (*mmap_complete)(struct file *, struct vm_area_struct *,
>>> +			     const void *context);
>>> +	void (*mmap_abort)(const struct file *, const void *vm_private_data,
>>> +			   const void *context);
>>
>> Do we have a description somewhere what these things do, when they are
>> called, and what a driver may be allowed to do with a VMA?
> 
> Yeah there's a doc patch that follows this.

Yeah, spotted that afterwards.

> 
>>
>> In particular, the mmap_complete() looks like another candidate for letting
>> a driver just go crazy on the vma? :)
> 
> Well there's only so much we can do. In an ideal world we'd treat VMAs as
> entirely internal data structures and pass some sort of opaque thing around, but
> we have to keep things real here :)

Right, we'd pass something around that cannot be easily abused (like 
modifying random vma flags in mmap_complete).

So I was wondering if most operations that driver would perform during 
the mmap_complete() could be be abstracted, and only those then be 
called with whatever opaque thing we return here.

But I have no feeling about what crazy things a driver might do. Just 
calling remap_pfn_range() would be easy, for example, and we could 
abstract that.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-09  9:26       ` David Hildenbrand
@ 2025-09-09  9:37         ` Lorenzo Stoakes
  2025-09-09 16:43           ` Suren Baghdasaryan
  0 siblings, 1 reply; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09  9:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Hugh Dickins, Baolin Wang,
	Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov, Jann Horn,
	Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel, linux-csky,
	linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm,
	ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Tue, Sep 09, 2025 at 11:26:21AM +0200, David Hildenbrand wrote:
> > >
> > > In particular, the mmap_complete() looks like another candidate for letting
> > > a driver just go crazy on the vma? :)
> >
> > Well there's only so much we can do. In an ideal world we'd treat VMAs as
> > entirely internal data structures and pass some sort of opaque thing around, but
> > we have to keep things real here :)
>
> Right, we'd pass something around that cannot be easily abused (like
> modifying random vma flags in mmap_complete).
>
> So I was wondering if most operations that driver would perform during the
> mmap_complete() could be be abstracted, and only those then be called with
> whatever opaque thing we return here.

Well there's 2 issues at play:

1. I might end up having to rewrite _large parts_ of kernel functionality all of
   which relies on there being a vma parameter (or might find that to be
   intractable).

2. There's always the 'odd ones out' :) so there'll be some drivers that
   absolutely do need to have access to this.

But as I was writing this I thought of an idea - why don't we have something
opaque like this, perhaps with accessor functions, but then _give the ability to
get the VMA if you REALLY have to_.

That way we can handle both problems without too much trouble.

Also Jason suggested generic functions that can just be assigned to
.mmap_complete for instance, which would obviously eliminate the crazy
factor a lot too.

I'm going to refactor to try to put ONLY prepopulate logic in
.mmap_complete where possible which fits with all of this.

>
> But I have no feeling about what crazy things a driver might do. Just
> calling remap_pfn_range() would be easy, for example, and we could abstract
> that.

Yeah, I've obviously already added some wrappers for these.

BTW I really really hate that STUPID ->vm_pgoff hack, if not for that, life
would be much simpler.

But instead now we need to specify PFN in the damn remap prepare wrapper in
case of CoW. God.

>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-09  9:37         ` Lorenzo Stoakes
@ 2025-09-09 16:43           ` Suren Baghdasaryan
  2025-09-09 17:36             ` Lorenzo Stoakes
  0 siblings, 1 reply; 84+ messages in thread
From: Suren Baghdasaryan @ 2025-09-09 16:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki,
	Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato,
	linux-doc, linux-kernel, linux-fsdevel, linux-csky, linux-mips,
	linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec,
	kasan-dev, Jason Gunthorpe

On Tue, Sep 9, 2025 at 2:37 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Sep 09, 2025 at 11:26:21AM +0200, David Hildenbrand wrote:
> > > >
> > > > In particular, the mmap_complete() looks like another candidate for letting
> > > > a driver just go crazy on the vma? :)
> > >
> > > Well there's only so much we can do. In an ideal world we'd treat VMAs as
> > > entirely internal data structures and pass some sort of opaque thing around, but
> > > we have to keep things real here :)
> >
> > Right, we'd pass something around that cannot be easily abused (like
> > modifying random vma flags in mmap_complete).
> >
> > So I was wondering if most operations that driver would perform during the
> > mmap_complete() could be be abstracted, and only those then be called with
> > whatever opaque thing we return here.
>
> Well there's 2 issues at play:
>
> 1. I might end up having to rewrite _large parts_ of kernel functionality all of
>    which relies on there being a vma parameter (or might find that to be
>    intractable).
>
> 2. There's always the 'odd ones out' :) so there'll be some drivers that
>    absolutely do need to have access to this.
>
> But as I was writing this I thought of an idea - why don't we have something
> opaque like this, perhaps with accessor functions, but then _give the ability to
> get the VMA if you REALLY have to_.
>
> That way we can handle both problems without too much trouble.
>
> Also Jason suggested generic functions that can just be assigned to
> .mmap_complete for instance, which would obviously eliminate the crazy
> factor a lot too.
>
> I'm going to refactor to try to put ONLY prepopulate logic in
> .mmap_complete where possible which fits with all of this.

Thinking along these lines, do you have a case when mmap_abort() needs
vm_private_data? I was thinking if VMA mapping failed, why would you
need vm_private_data to unwind prep work? You already have the context
pointer for that, no?

>
> >
> > But I have no feeling about what crazy things a driver might do. Just
> > calling remap_pfn_range() would be easy, for example, and we could abstract
> > that.
>
> Yeah, I've obviously already added some wrappers for these.
>
> BTW I really really hate that STUPID ->vm_pgoff hack, if not for that, life
> would be much simpler.
>
> But instead now we need to specify PFN in the damn remap prepare wrapper in
> case of CoW. God.
>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Cheers, Lorenzo

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
  2025-09-08 12:55   ` Jason Gunthorpe
  2025-09-08 15:27   ` David Hildenbrand
@ 2025-09-09 16:44   ` Suren Baghdasaryan
  2 siblings, 0 replies; 84+ messages in thread
From: Suren Baghdasaryan @ 2025-09-09 16:44 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Andrew Morton, Jonathan Corbet, Matthew Wilcox, Guo Ren,
	Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, David Hildenbrand,
	Konstantin Komarov, Baoquan He, Vivek Goyal, Dave Young,
	Tony Luck, Reinette Chatre, Dave Martin, James Morse,
	Alexander Viro, Christian Brauner, Jan Kara, Liam R . Howlett,
	Vlastimil Babka, Mike Rapoport, Michal Hocko, Hugh Dickins,
	Baolin Wang, Uladzislau Rezki, Dmitry Vyukov, Andrey Konovalov,
	Jann Horn, Pedro Falcato, linux-doc, linux-kernel, linux-fsdevel,
	linux-csky, linux-mips, linux-s390, sparclinux, nvdimm, linux-cxl,
	linux-mm, ntfs3, kexec, kasan-dev, Jason Gunthorpe

On Mon, Sep 8, 2025 at 4:11 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> We have introduced the f_op->mmap_prepare hook to allow for setting up a
> VMA far earlier in the process of mapping memory, reducing problematic
> error handling paths, but this does not provide what all
> drivers/filesystems need.
>
> In order to supply this, and to be able to move forward with removing
> f_op->mmap altogether, introduce f_op->mmap_complete.
>
> This hook is called once the VMA is fully mapped and everything is done,
> however with the mmap write lock and VMA write locks held.
>
> The hook is then provided with a fully initialised VMA which it can do what
> it needs with, though the mmap and VMA write locks must remain held
> throughout.
>
> It is not intended that the VMA be modified at this point, attempts to do
> so will end in tears.
>
> This allows for operations such as pre-population typically via a remap, or
> really anything that requires access to the VMA once initialised.
>
> In addition, a caller may need to take a lock in mmap_prepare, when it is
> possible to modify the VMA, and release it on mmap_complete. In order to
> handle errors which may arise between the two operations, f_op->mmap_abort
> is provided.
>
> This hook should be used to drop any lock and clean up anything before the
> VMA mapping operation is aborted. After this point the VMA will not be
> added to any mapping and will not exist.
>
> We also add a new mmap_context field to the vm_area_desc type which can be
> used to pass information pertinent to any locks which are held or any state
> which is required for mmap_complete, abort to operate correctly.
>
> We also update the compatibility layer for nested filesystems which
> currently still only specify an f_op->mmap() handler so that it correctly
> invokes f_op->mmap_complete as necessary (note that no error can occur
> between mmap_prepare and mmap_complete so mmap_abort will never be called
> in this case).
>
> Also update the VMA tests to account for the changes.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/fs.h               |  4 ++
>  include/linux/mm_types.h         |  5 ++
>  mm/util.c                        | 18 +++++--
>  mm/vma.c                         | 82 ++++++++++++++++++++++++++++++--
>  tools/testing/vma/vma_internal.h | 31 ++++++++++--
>  5 files changed, 129 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 594bd4d0521e..bb432924993a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2195,6 +2195,10 @@ struct file_operations {
>         int (*uring_cmd_iopoll)(struct io_uring_cmd *, struct io_comp_batch *,
>                                 unsigned int poll_flags);
>         int (*mmap_prepare)(struct vm_area_desc *);
> +       int (*mmap_complete)(struct file *, struct vm_area_struct *,
> +                            const void *context);
> +       void (*mmap_abort)(const struct file *, const void *vm_private_data,
> +                          const void *context);
>  } __randomize_layout;
>
>  /* Supports async buffered reads */
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index cf759fe08bb3..052db1f31fb3 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -793,6 +793,11 @@ struct vm_area_desc {
>         /* Write-only fields. */
>         const struct vm_operations_struct *vm_ops;
>         void *private_data;
> +       /*
> +        * A user-defined field, value will be passed to mmap_complete,
> +        * mmap_abort.
> +        */
> +       void *mmap_context;
>  };
>
>  /*
> diff --git a/mm/util.c b/mm/util.c
> index 248f877f629b..f5bcac140cb9 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -1161,17 +1161,26 @@ int __compat_vma_mmap_prepare(const struct file_operations *f_op,
>         err = f_op->mmap_prepare(&desc);
>         if (err)
>                 return err;
> +
>         set_vma_from_desc(vma, &desc);
>
> -       return 0;
> +       /*
> +        * No error can occur between mmap_prepare() and mmap_complete so no
> +        * need to invoke mmap_abort().
> +        */
> +
> +       if (f_op->mmap_complete)
> +               err = f_op->mmap_complete(file, vma, desc.mmap_context);
> +
> +       return err;
>  }
>  EXPORT_SYMBOL(__compat_vma_mmap_prepare);
>
>  /**
>   * compat_vma_mmap_prepare() - Apply the file's .mmap_prepare() hook to an
> - * existing VMA.
> + * existing VMA and invoke .mmap_complete() if provided.
>   * @file: The file which possesss an f_op->mmap_prepare() hook.

nit: possesss seems to be misspelled. Maybe we can fix it here as well?

> - * @vma: The VMA to apply the .mmap_prepare() hook to.
> + * @vma: The VMA to apply the hooks to.
>   *
>   * Ordinarily, .mmap_prepare() is invoked directly upon mmap(). However, certain
>   * stacked filesystems invoke a nested mmap hook of an underlying file.
> @@ -1188,6 +1197,9 @@ EXPORT_SYMBOL(__compat_vma_mmap_prepare);
>   * establishes a struct vm_area_desc descriptor, passes to the underlying
>   * .mmap_prepare() hook and applies any changes performed by it.
>   *
> + * If the relevant hooks are provided, it also invokes .mmap_complete() upon
> + * successful completion.
> + *
>   * Once the conversion of filesystems is complete this function will no longer
>   * be required and will be removed.
>   *
> diff --git a/mm/vma.c b/mm/vma.c
> index 0efa4288570e..a0b568fe9e8d 100644
> --- a/mm/vma.c
> +++ b/mm/vma.c
> @@ -22,6 +22,7 @@ struct mmap_state {
>         /* User-defined fields, perhaps updated by .mmap_prepare(). */
>         const struct vm_operations_struct *vm_ops;
>         void *vm_private_data;
> +       void *mmap_context;
>
>         unsigned long charged;
>
> @@ -2343,6 +2344,23 @@ static int __mmap_prelude(struct mmap_state *map, struct list_head *uf)
>         int error;
>         struct vma_iterator *vmi = map->vmi;
>         struct vma_munmap_struct *vms = &map->vms;
> +       struct file *file = map->file;
> +
> +       if (file) {
> +               /* f_op->mmap_complete requires f_op->mmap_prepare. */
> +               if (file->f_op->mmap_complete && !file->f_op->mmap_prepare)
> +                       return -EINVAL;
> +
> +               /*
> +                * It's not valid to provide an f_op->mmap_abort hook without also
> +                * providing the f_op->mmap_prepare and f_op->mmap_complete hooks it is
> +                * used with.
> +                */
> +               if (file->f_op->mmap_abort &&
> +                    (!file->f_op->mmap_prepare ||
> +                     !file->f_op->mmap_complete))
> +                       return -EINVAL;
> +       }
>
>         /* Find the first overlapping VMA and initialise unmap state. */
>         vms->vma = vma_find(vmi, map->end);
> @@ -2595,6 +2613,7 @@ static int call_mmap_prepare(struct mmap_state *map)
>         /* User-defined fields. */
>         map->vm_ops = desc.vm_ops;
>         map->vm_private_data = desc.private_data;
> +       map->mmap_context = desc.mmap_context;
>
>         return 0;
>  }
> @@ -2636,16 +2655,61 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
>         return false;
>  }
>
> +/*
> + * Invoke the f_op->mmap_complete hook, providing it with a fully initialised
> + * VMA to operate upon.
> + *
> + * The mmap and VMA write locks must be held prior to and after the hook has
> + * been invoked.
> + */
> +static int call_mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> +{
> +       struct file *file = map->file;
> +       void *context = map->mmap_context;
> +       int error;
> +       size_t len;
> +
> +       if (!file || !file->f_op->mmap_complete)
> +               return 0;
> +
> +       error = file->f_op->mmap_complete(file, vma, context);
> +       /* The hook must NOT drop the write locks. */
> +       vma_assert_write_locked(vma);
> +       mmap_assert_write_locked(current->mm);
> +       if (!error)
> +               return 0;
> +
> +       /*
> +        * If an error occurs, unmap the VMA altogether and return an error. We
> +        * only clear the newly allocated VMA, since this function is only
> +        * invoked if we do NOT merge, so we only clean up the VMA we created.
> +        */
> +       len = vma_pages(vma) << PAGE_SHIFT;
> +       do_munmap(current->mm, vma->vm_start, len, NULL);
> +       return error;
> +}
> +
> +static void call_mmap_abort(struct mmap_state *map)
> +{
> +       struct file *file = map->file;
> +       void *vm_private_data = map->vm_private_data;
> +
> +       VM_WARN_ON_ONCE(!file || !file->f_op);
> +       file->f_op->mmap_abort(file, vm_private_data, map->mmap_context);
> +}
> +
>  static unsigned long __mmap_region(struct file *file, unsigned long addr,
>                 unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
>                 struct list_head *uf)
>  {
> -       struct mm_struct *mm = current->mm;
> -       struct vm_area_struct *vma = NULL;
> -       int error;
>         bool have_mmap_prepare = file && file->f_op->mmap_prepare;
> +       bool have_mmap_abort = file && file->f_op->mmap_abort;
> +       struct mm_struct *mm = current->mm;
>         VMA_ITERATOR(vmi, mm, addr);
>         MMAP_STATE(map, mm, &vmi, addr, len, pgoff, vm_flags, file);
> +       struct vm_area_struct *vma = NULL;
> +       bool allocated_new = false;
> +       int error;
>
>         map.check_ksm_early = can_set_ksm_flags_early(&map);
>
> @@ -2668,8 +2732,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
>         /* ...but if we can't, allocate a new VMA. */
>         if (!vma) {
>                 error = __mmap_new_vma(&map, &vma);
> -               if (error)
> +               if (error) {
> +                       if (have_mmap_abort)
> +                               call_mmap_abort(&map);
>                         goto unacct_error;
> +               }
> +               allocated_new = true;
>         }
>
>         if (have_mmap_prepare)
> @@ -2677,6 +2745,12 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
>
>         __mmap_epilogue(&map, vma);
>
> +       if (allocated_new) {
> +               error = call_mmap_complete(&map, vma);
> +               if (error)
> +                       return error;
> +       }
> +
>         return addr;
>
>         /* Accounting was done by __mmap_prelude(). */
> diff --git a/tools/testing/vma/vma_internal.h b/tools/testing/vma/vma_internal.h
> index 07167446dcf4..566cef1c0e0b 100644
> --- a/tools/testing/vma/vma_internal.h
> +++ b/tools/testing/vma/vma_internal.h
> @@ -297,11 +297,20 @@ struct vm_area_desc {
>         /* Write-only fields. */
>         const struct vm_operations_struct *vm_ops;
>         void *private_data;
> +       /*
> +        * A user-defined field, value will be passed to mmap_complete,
> +        * mmap_abort.
> +        */
> +       void *mmap_context;
>  };
>
>  struct file_operations {
>         int (*mmap)(struct file *, struct vm_area_struct *);
>         int (*mmap_prepare)(struct vm_area_desc *);
> +       void (*mmap_abort)(const struct file *, const void *vm_private_data,
> +                          const void *context);
> +       int (*mmap_complete)(struct file *, struct vm_area_struct *,
> +                            const void *context);
>  };
>
>  struct file {
> @@ -1471,7 +1480,7 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
>  {
>         struct vm_area_desc desc = {
>                 .mm = vma->vm_mm,
> -               .file = vma->vm_file,
> +               .file = file,
>                 .start = vma->vm_start,
>                 .end = vma->vm_end,
>
> @@ -1485,13 +1494,21 @@ static inline int __compat_vma_mmap_prepare(const struct file_operations *f_op,
>         err = f_op->mmap_prepare(&desc);
>         if (err)
>                 return err;
> +
>         set_vma_from_desc(vma, &desc);
>
> -       return 0;
> +       /*
> +        * No error can occur between mmap_prepare() and mmap_complete so no
> +        * need to invoke mmap_abort().
> +        */
> +
> +       if (f_op->mmap_complete)
> +               err = f_op->mmap_complete(file, vma, desc.mmap_context);
> +
> +       return err;
>  }
>
> -static inline int compat_vma_mmap_prepare(struct file *file,
> -               struct vm_area_struct *vma)
> +static inline int compat_vma_mmap_prepare(struct file *file, struct vm_area_struct *vma)
>  {
>         return __compat_vma_mmap_prepare(file->f_op, file, vma);
>  }
> @@ -1548,4 +1565,10 @@ static inline vm_flags_t ksm_vma_flags(const struct mm_struct *, const struct fi
>         return vm_flags;
>  }
>
> +static inline int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> +             struct list_head *uf)
> +{
> +       return 0;
> +}
> +
>  #endif /* __MM_VMA_INTERNAL_H */
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 84+ messages in thread

* Re: [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks
  2025-09-09 16:43           ` Suren Baghdasaryan
@ 2025-09-09 17:36             ` Lorenzo Stoakes
  0 siblings, 0 replies; 84+ messages in thread
From: Lorenzo Stoakes @ 2025-09-09 17:36 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Andrew Morton, Jonathan Corbet, Matthew Wilcox,
	Guo Ren, Thomas Bogendoerfer, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	David S . Miller, Andreas Larsson, Arnd Bergmann,
	Greg Kroah-Hartman, Dan Williams, Vishal Verma, Dave Jiang,
	Nicolas Pitre, Muchun Song, Oscar Salvador, Konstantin Komarov,
	Baoquan He, Vivek Goyal, Dave Young, Tony Luck, Reinette Chatre,
	Dave Martin, James Morse, Alexander Viro, Christian Brauner,
	Jan Kara, Liam R . Howlett, Vlastimil Babka, Mike Rapoport,
	Michal Hocko, Hugh Dickins, Baolin Wang, Uladzislau Rezki,
	Dmitry Vyukov, Andrey Konovalov, Jann Horn, Pedro Falcato,
	linux-doc, linux-kernel, linux-fsdevel, linux-csky, linux-mips,
	linux-s390, sparclinux, nvdimm, linux-cxl, linux-mm, ntfs3, kexec,
	kasan-dev, Jason Gunthorpe

On Tue, Sep 09, 2025 at 09:43:25AM -0700, Suren Baghdasaryan wrote:
> On Tue, Sep 9, 2025 at 2:37 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Sep 09, 2025 at 11:26:21AM +0200, David Hildenbrand wrote:
> > > > >
> > > > > In particular, the mmap_complete() looks like another candidate for letting
> > > > > a driver just go crazy on the vma? :)
> > > >
> > > > Well there's only so much we can do. In an ideal world we'd treat VMAs as
> > > > entirely internal data structures and pass some sort of opaque thing around, but
> > > > we have to keep things real here :)
> > >
> > > Right, we'd pass something around that cannot be easily abused (like
> > > modifying random vma flags in mmap_complete).
> > >
> > > So I was wondering if most operations that driver would perform during the
> > > mmap_complete() could be be abstracted, and only those then be called with
> > > whatever opaque thing we return here.
> >
> > Well there's 2 issues at play:
> >
> > 1. I might end up having to rewrite _large parts_ of kernel functionality all of
> >    which relies on there being a vma parameter (or might find that to be
> >    intractable).
> >
> > 2. There's always the 'odd ones out' :) so there'll be some drivers that
> >    absolutely do need to have access to this.
> >
> > But as I was writing this I thought of an idea - why don't we have something
> > opaque like this, perhaps with accessor functions, but then _give the ability to
> > get the VMA if you REALLY have to_.
> >
> > That way we can handle both problems without too much trouble.
> >
> > Also Jason suggested generic functions that can just be assigned to
> > .mmap_complete for instance, which would obviously eliminate the crazy
> > factor a lot too.
> >
> > I'm going to refactor to try to put ONLY prepopulate logic in
> > .mmap_complete where possible which fits with all of this.
>
> Thinking along these lines, do you have a case when mmap_abort() needs
> vm_private_data? I was thinking if VMA mapping failed, why would you
> need vm_private_data to unwind prep work? You already have the context
> pointer for that, no?

Actually have removed mmap_abort in latest respin :) the new version will
be a fairly substantial rewrite based on feedback.

^ permalink raw reply	[flat|nested] 84+ messages in thread

end of thread, other threads:[~2025-09-09 17:37 UTC | newest]

Thread overview: 84+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-08 11:10 [PATCH 00/16] expand mmap_prepare functionality, port more users Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 01/16] mm/shmem: update shmem to use mmap_prepare Lorenzo Stoakes
2025-09-08 14:59   ` David Hildenbrand
2025-09-08 15:28     ` Lorenzo Stoakes
2025-09-09  3:19   ` Baolin Wang
2025-09-09  9:08     ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 02/16] device/dax: update devdax " Lorenzo Stoakes
2025-09-08 15:03   ` David Hildenbrand
2025-09-08 15:28     ` Lorenzo Stoakes
2025-09-08 15:31       ` David Hildenbrand
2025-09-08 11:10 ` [PATCH 03/16] mm: add vma_desc_size(), vma_desc_pages() helpers Lorenzo Stoakes
2025-09-08 12:51   ` Jason Gunthorpe
2025-09-08 13:12     ` Lorenzo Stoakes
2025-09-08 13:32       ` Jason Gunthorpe
2025-09-08 14:09         ` Lorenzo Stoakes
2025-09-08 14:20           ` Jason Gunthorpe
2025-09-08 14:47             ` Lorenzo Stoakes
2025-09-08 15:07               ` David Hildenbrand
2025-09-08 15:35                 ` Lorenzo Stoakes
2025-09-08 17:30                   ` David Hildenbrand
2025-09-09  9:21                     ` Lorenzo Stoakes
2025-09-08 15:16               ` Jason Gunthorpe
2025-09-08 15:24                 ` David Hildenbrand
2025-09-08 15:33                   ` Jason Gunthorpe
2025-09-08 15:46                     ` David Hildenbrand
2025-09-08 15:50                       ` David Hildenbrand
2025-09-08 15:56                         ` Jason Gunthorpe
2025-09-08 17:36                           ` David Hildenbrand
2025-09-08 20:24                             ` Lorenzo Stoakes
2025-09-08 15:33                   ` Lorenzo Stoakes
2025-09-08 15:10   ` David Hildenbrand
2025-09-08 11:10 ` [PATCH 04/16] relay: update relay to use mmap_prepare Lorenzo Stoakes
2025-09-08 15:15   ` David Hildenbrand
2025-09-08 15:29     ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 05/16] mm/vma: rename mmap internal functions to avoid confusion Lorenzo Stoakes
2025-09-08 15:19   ` David Hildenbrand
2025-09-08 15:31     ` Lorenzo Stoakes
2025-09-08 17:38       ` David Hildenbrand
2025-09-09  9:04         ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 06/16] mm: introduce the f_op->mmap_complete, mmap_abort hooks Lorenzo Stoakes
2025-09-08 12:55   ` Jason Gunthorpe
2025-09-08 13:19     ` Lorenzo Stoakes
2025-09-08 15:27   ` David Hildenbrand
2025-09-09  9:13     ` Lorenzo Stoakes
2025-09-09  9:26       ` David Hildenbrand
2025-09-09  9:37         ` Lorenzo Stoakes
2025-09-09 16:43           ` Suren Baghdasaryan
2025-09-09 17:36             ` Lorenzo Stoakes
2025-09-09 16:44   ` Suren Baghdasaryan
2025-09-08 11:10 ` [PATCH 07/16] doc: update porting, vfs documentation for mmap_[complete, abort] Lorenzo Stoakes
2025-09-08 23:17   ` Randy Dunlap
2025-09-09  9:02     ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 08/16] mm: add remap_pfn_range_prepare(), remap_pfn_range_complete() Lorenzo Stoakes
2025-09-08 13:00   ` Jason Gunthorpe
2025-09-08 13:27     ` Lorenzo Stoakes
2025-09-08 13:35       ` Jason Gunthorpe
2025-09-08 14:18         ` Lorenzo Stoakes
2025-09-08 16:03           ` Jason Gunthorpe
2025-09-08 16:07             ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 09/16] mm: introduce io_remap_pfn_range_prepare, complete Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 10/16] mm/hugetlb: update hugetlbfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
2025-09-08 13:11   ` Jason Gunthorpe
2025-09-08 13:37     ` Lorenzo Stoakes
2025-09-08 13:52       ` Jason Gunthorpe
2025-09-08 14:19         ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 11/16] mm: update mem char driver " Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 12/16] mm: update resctl to use mmap_prepare, mmap_complete, mmap_abort Lorenzo Stoakes
2025-09-08 13:24   ` Jason Gunthorpe
2025-09-08 13:40     ` Lorenzo Stoakes
2025-09-08 14:27     ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 13/16] mm: update cramfs to use mmap_prepare, mmap_complete Lorenzo Stoakes
2025-09-08 13:27   ` Jason Gunthorpe
2025-09-08 13:44     ` Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 14/16] fs/proc: add proc_mmap_[prepare, complete] hooks for procfs Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 15/16] fs/proc: update vmcore to use .proc_mmap_[prepare, complete] Lorenzo Stoakes
2025-09-08 11:10 ` [PATCH 16/16] kcov: update kcov to use mmap_prepare, mmap_complete Lorenzo Stoakes
2025-09-08 13:30   ` Jason Gunthorpe
2025-09-08 13:47     ` Lorenzo Stoakes
2025-09-08 13:27 ` [PATCH 00/16] expand mmap_prepare functionality, port more users Jan Kara
2025-09-08 14:48   ` Lorenzo Stoakes
2025-09-08 15:04     ` Jason Gunthorpe
2025-09-08 15:15       ` Lorenzo Stoakes
2025-09-09  8:31 ` Alexander Gordeev
2025-09-09  8:59   ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).