Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* [PATCH v2 06/16] mm: add mmap_action_simple_ioremap()
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

Currently drivers use vm_iomap_memory() as a simple helper function for
I/O remapping memory over a range starting at a specified physical address
over a specified length.

In order to utilise this from mmap_prepare, separate out the core logic
into __simple_ioremap_prep(), update vm_iomap_memory() to use it, and add
simple_ioremap_prepare() to do the same with a VMA descriptor object.

We also add MMAP_SIMPLE_IO_REMAP and relevant fields to the struct
mmap_action type to permit this operation also.

We use mmap_action_ioremap() to set up the actual I/O remap operation once
we have checked and figured out the parameters, which makes
simple_ioremap_prepare() easy to implement.

We then add mmap_action_simple_ioremap() to allow drivers to make use of
this mode.

We update the mmap_prepare documentation to describe this mode.

Finally, we update the VMA tests to reflect this change.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 Documentation/filesystems/mmap_prepare.rst |  3 +
 include/linux/mm.h                         | 24 +++++-
 include/linux/mm_types.h                   |  6 +-
 mm/internal.h                              |  2 +
 mm/memory.c                                | 87 +++++++++++++++-------
 mm/util.c                                  | 12 +++
 tools/testing/vma/include/dup.h            |  6 +-
 7 files changed, 112 insertions(+), 28 deletions(-)

diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
index 20db474915da..be76ae475b9c 100644
--- a/Documentation/filesystems/mmap_prepare.rst
+++ b/Documentation/filesystems/mmap_prepare.rst
@@ -153,5 +153,8 @@ pointer. These are:
 * mmap_action_ioremap_full() - Same as mmap_action_ioremap(), only remaps
   the entire mapping from ``start_pfn`` onward.
 
+* mmap_action_simple_ioremap() - Sets up an I/O remap from a specified
+  physical address and over a specified length.
+
 **NOTE:** The ``action`` field should never normally be manipulated directly,
 rather you ought to use one of these helpers.
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad1b8c3c0cfd..df8fa6e6402b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4337,11 +4337,33 @@ static inline void mmap_action_ioremap(struct vm_area_desc *desc,
  * @start_pfn: The first PFN in the range to remap.
  */
 static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
-					  unsigned long start_pfn)
+					    unsigned long start_pfn)
 {
 	mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
 }
 
+/**
+ * mmap_action_simple_ioremap - helper for mmap_prepare hook to specify that the
+ * physical range in [start_phys_addr, start_phys_addr + size) should be I/O
+ * remapped.
+ * @desc: The VMA descriptor for the VMA requiring remap.
+ * @start_phys_addr: Start of the physical memory to be mapped.
+ * @size: Size of the area to map.
+ *
+ * NOTE: Some drivers might want to tweak desc->page_prot for purposes of
+ * write-combine or similar.
+ */
+static inline void mmap_action_simple_ioremap(struct vm_area_desc *desc,
+					      phys_addr_t start_phys_addr,
+					      unsigned long size)
+{
+	struct mmap_action *action = &desc->action;
+
+	action->simple_ioremap.start_phys_addr = start_phys_addr;
+	action->simple_ioremap.size = size;
+	action->type = MMAP_SIMPLE_IO_REMAP;
+}
+
 int mmap_action_prepare(struct vm_area_desc *desc);
 int mmap_action_complete(struct vm_area_struct *vma,
 			 struct mmap_action *action);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4a229cc0a06b..50685cf29792 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -814,6 +814,7 @@ enum mmap_action_type {
 	MMAP_NOTHING,		/* Mapping is complete, no further action. */
 	MMAP_REMAP_PFN,		/* Remap PFN range. */
 	MMAP_IO_REMAP_PFN,	/* I/O remap PFN range. */
+	MMAP_SIMPLE_IO_REMAP,	/* I/O remap with guardrails. */
 };
 
 /*
@@ -822,13 +823,16 @@ enum mmap_action_type {
  */
 struct mmap_action {
 	union {
-		/* Remap range. */
 		struct {
 			unsigned long start;
 			unsigned long start_pfn;
 			unsigned long size;
 			pgprot_t pgprot;
 		} remap;
+		struct {
+			phys_addr_t start_phys_addr;
+			unsigned long size;
+		} simple_ioremap;
 	};
 	enum mmap_action_type type;
 
diff --git a/mm/internal.h b/mm/internal.h
index f5774892071e..0eaca2f0eb6a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1804,6 +1804,8 @@ int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 int remap_pfn_range_prepare(struct vm_area_desc *desc);
 int remap_pfn_range_complete(struct vm_area_struct *vma,
 			     struct mmap_action *action);
+int simple_ioremap_prepare(struct vm_area_desc *desc);
+/* No simple_ioremap_complete, is ultimately handled by remap complete. */
 
 static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc)
 {
diff --git a/mm/memory.c b/mm/memory.c
index 9dec67a18116..f3f4046aee97 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3170,6 +3170,59 @@ int remap_pfn_range_complete(struct vm_area_struct *vma,
 	return do_remap_pfn_range(vma, start, pfn, size, prot);
 }
 
+static int __simple_ioremap_prep(unsigned long vm_start, unsigned long vm_end,
+				 pgoff_t vm_pgoff, phys_addr_t start_phys,
+				 unsigned long size, unsigned long *pfnp)
+{
+	const unsigned long vm_len = vm_end - vm_start;
+	unsigned long pfn, pages;
+
+	/* Check that the physical memory area passed in looks valid */
+	if (start_phys + size < start_phys)
+		return -EINVAL;
+	/*
+	 * You *really* shouldn't map things that aren't page-aligned,
+	 * but we've historically allowed it because IO memory might
+	 * just have smaller alignment.
+	 */
+	size += start_phys & ~PAGE_MASK;
+	pfn = start_phys >> PAGE_SHIFT;
+	pages = (size + ~PAGE_MASK) >> PAGE_SHIFT;
+	if (pfn + pages < pfn)
+		return -EINVAL;
+
+	/* We start the mapping 'vm_pgoff' pages into the area */
+	if (vm_pgoff > pages)
+		return -EINVAL;
+	pfn += vm_pgoff;
+	pages -= vm_pgoff;
+
+	/* Can we fit all of the mapping? */
+	if ((vm_len >> PAGE_SHIFT) > pages)
+		return -EINVAL;
+
+	*pfnp = pfn;
+	return 0;
+}
+
+int simple_ioremap_prepare(struct vm_area_desc *desc)
+{
+	struct mmap_action *action = &desc->action;
+	const phys_addr_t start = action->simple_ioremap.start_phys_addr;
+	const unsigned long size = action->simple_ioremap.size;
+	unsigned long pfn;
+	int err;
+
+	err = __simple_ioremap_prep(desc->start, desc->end, desc->pgoff,
+				    start, size, &pfn);
+	if (err)
+		return err;
+
+	/* The I/O remap logic does the heavy lifting. */
+	mmap_action_ioremap(desc, desc->start, pfn, vma_desc_size(desc));
+	return mmap_action_prepare(desc);
+}
+
 /**
  * vm_iomap_memory - remap memory to userspace
  * @vma: user vma to map to
@@ -3187,32 +3240,16 @@ int remap_pfn_range_complete(struct vm_area_struct *vma,
  */
 int vm_iomap_memory(struct vm_area_struct *vma, phys_addr_t start, unsigned long len)
 {
-	unsigned long vm_len, pfn, pages;
-
-	/* Check that the physical memory area passed in looks valid */
-	if (start + len < start)
-		return -EINVAL;
-	/*
-	 * You *really* shouldn't map things that aren't page-aligned,
-	 * but we've historically allowed it because IO memory might
-	 * just have smaller alignment.
-	 */
-	len += start & ~PAGE_MASK;
-	pfn = start >> PAGE_SHIFT;
-	pages = (len + ~PAGE_MASK) >> PAGE_SHIFT;
-	if (pfn + pages < pfn)
-		return -EINVAL;
-
-	/* We start the mapping 'vm_pgoff' pages into the area */
-	if (vma->vm_pgoff > pages)
-		return -EINVAL;
-	pfn += vma->vm_pgoff;
-	pages -= vma->vm_pgoff;
+	const unsigned long vm_start = vma->vm_start;
+	const unsigned long vm_end = vma->vm_end;
+	const unsigned long vm_len = vm_end - vm_start;
+	unsigned long pfn;
+	int err;
 
-	/* Can we fit all of the mapping? */
-	vm_len = vma->vm_end - vma->vm_start;
-	if (vm_len >> PAGE_SHIFT > pages)
-		return -EINVAL;
+	err = __simple_ioremap_prep(vm_start, vm_end, vma->vm_pgoff, start,
+				    len, &pfn);
+	if (err)
+		return err;
 
 	/* Ok, let it rip */
 	return io_remap_pfn_range(vma, vma->vm_start, pfn, vm_len, vma->vm_page_prot);
diff --git a/mm/util.c b/mm/util.c
index cdfba09e50d7..aa92e471afe1 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1390,6 +1390,8 @@ int mmap_action_prepare(struct vm_area_desc *desc)
 		return remap_pfn_range_prepare(desc);
 	case MMAP_IO_REMAP_PFN:
 		return io_remap_pfn_range_prepare(desc);
+	case MMAP_SIMPLE_IO_REMAP:
+		return simple_ioremap_prepare(desc);
 	}
 
 	WARN_ON_ONCE(1);
@@ -1421,6 +1423,14 @@ int mmap_action_complete(struct vm_area_struct *vma,
 	case MMAP_IO_REMAP_PFN:
 		err = io_remap_pfn_range_complete(vma, action);
 		break;
+	case MMAP_SIMPLE_IO_REMAP:
+		/*
+		 * The simple I/O remap should have been delegated to an I/O
+		 * remap.
+		 */
+		WARN_ON_ONCE(1);
+		err = -EINVAL;
+		break;
 	}
 
 	return mmap_action_finish(vma, action, err);
@@ -1434,6 +1444,7 @@ int mmap_action_prepare(struct vm_area_desc *desc)
 		break;
 	case MMAP_REMAP_PFN:
 	case MMAP_IO_REMAP_PFN:
+	case MMAP_SIMPLE_IO_REMAP:
 		WARN_ON_ONCE(1); /* nommu cannot handle these. */
 		break;
 	}
@@ -1452,6 +1463,7 @@ int mmap_action_complete(struct vm_area_struct *vma,
 		break;
 	case MMAP_REMAP_PFN:
 	case MMAP_IO_REMAP_PFN:
+	case MMAP_SIMPLE_IO_REMAP:
 		WARN_ON_ONCE(1); /* nommu cannot handle this. */
 
 		err = -EINVAL;
diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
index 4570ec77f153..114daaef4f73 100644
--- a/tools/testing/vma/include/dup.h
+++ b/tools/testing/vma/include/dup.h
@@ -453,6 +453,7 @@ enum mmap_action_type {
 	MMAP_NOTHING,		/* Mapping is complete, no further action. */
 	MMAP_REMAP_PFN,		/* Remap PFN range. */
 	MMAP_IO_REMAP_PFN,	/* I/O remap PFN range. */
+	MMAP_SIMPLE_IO_REMAP,	/* I/O remap with guardrails. */
 };
 
 /*
@@ -461,13 +462,16 @@ enum mmap_action_type {
  */
 struct mmap_action {
 	union {
-		/* Remap range. */
 		struct {
 			unsigned long start;
 			unsigned long start_pfn;
 			unsigned long size;
 			pgprot_t pgprot;
 		} remap;
+		struct {
+			phys_addr_t start;
+			unsigned long len;
+		} simple_ioremap;
 	};
 	enum mmap_action_type type;
 
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 05/16] fs: afs: correctly drop reference count on mapping failure
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

Commit 9d5403b1036c ("fs: convert most other generic_file_*mmap() users to
.mmap_prepare()") updated AFS to use the mmap_prepare callback in favour of
the deprecated mmap callback.

However, it did not account for the fact that mmap_prepare is called
pre-merge, and may then be merged, nor that mmap_prepare can fail to map
due to an out of memory error.

Both of those are cases in which we should not be incrementing a reference
count.

With the newly added vm_ops->mapped callback available, we can simply defer
this operation to that callback which is only invoked once the mapping is
successfully in place (but not yet visible to userspace as the mmap and VMA
write locks are held).

Therefore add afs_mapped() to implement this callback for AFS, and remove
the code doing so in afs_mmap_prepare().

Also update afs_vm_open(), afs_vm_close() and afs_vm_map_pages() to be
consistent in how the vnode is accessed.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 fs/afs/file.c | 36 ++++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/fs/afs/file.c b/fs/afs/file.c
index f609366fd2ac..85696ac984cc 100644
--- a/fs/afs/file.c
+++ b/fs/afs/file.c
@@ -28,6 +28,8 @@ static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
 static void afs_vm_open(struct vm_area_struct *area);
 static void afs_vm_close(struct vm_area_struct *area);
 static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff);
+static int afs_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
+		      const struct file *file, void **vm_private_data);
 
 const struct file_operations afs_file_operations = {
 	.open		= afs_open,
@@ -61,6 +63,7 @@ const struct address_space_operations afs_file_aops = {
 };
 
 static const struct vm_operations_struct afs_vm_ops = {
+	.mapped		= afs_mapped,
 	.open		= afs_vm_open,
 	.close		= afs_vm_close,
 	.fault		= filemap_fault,
@@ -494,32 +497,45 @@ static void afs_drop_open_mmap(struct afs_vnode *vnode)
  */
 static int afs_file_mmap_prepare(struct vm_area_desc *desc)
 {
-	struct afs_vnode *vnode = AFS_FS_I(file_inode(desc->file));
 	int ret;
 
-	afs_add_open_mmap(vnode);
-
 	ret = generic_file_mmap_prepare(desc);
-	if (ret == 0)
-		desc->vm_ops = &afs_vm_ops;
-	else
-		afs_drop_open_mmap(vnode);
+	if (ret)
+		return ret;
+
+	desc->vm_ops = &afs_vm_ops;
 	return ret;
 }
 
+static int afs_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
+		      const struct file *file, void **vm_private_data)
+{
+	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
+
+	afs_add_open_mmap(vnode);
+	return 0;
+}
+
 static void afs_vm_open(struct vm_area_struct *vma)
 {
-	afs_add_open_mmap(AFS_FS_I(file_inode(vma->vm_file)));
+	struct file *file = vma->vm_file;
+	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
+
+	afs_add_open_mmap(vnode);
 }
 
 static void afs_vm_close(struct vm_area_struct *vma)
 {
-	afs_drop_open_mmap(AFS_FS_I(file_inode(vma->vm_file)));
+	struct file *file = vma->vm_file;
+	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
+
+	afs_drop_open_mmap(vnode);
 }
 
 static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff)
 {
-	struct afs_vnode *vnode = AFS_FS_I(file_inode(vmf->vma->vm_file));
+	struct file *file = vmf->vma->vm_file;
+	struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
 
 	if (afs_check_validity(vnode))
 		return filemap_map_pages(vmf, start_pgoff, end_pgoff);
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 04/16] mm: add vm_ops->mapped hook
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

Previously, when a driver needed to do something like establish a
reference count, it could do so in the mmap hook in the knowledge that the
mapping would succeed.

With the introduction of f_op->mmap_prepare this is no longer the case, as
it is invoked prior to actually establishing the mapping.

mmap_prepare is not appropriate for this kind of thing as it is called
before any merge might take place, and after which an error might occur
meaning resources could be leaked.

To take this into account, introduce a new vm_ops->mapped callback which
is invoked when the VMA is first mapped (though notably - not when it is
merged - which is correct and mirrors existing mmap/open/close behaviour).

We do better that vm_ops->open() here, as this callback can return an
error, at which point the VMA will be unmapped.

Note that vm_ops->mapped() is invoked after any mmap action is complete
(such as I/O remapping).

We intentionally do not expose the VMA at this point, exposing only the
fields that could be used, and an output parameter in case the operation
needs to update the vma->vm_private_data field.

In order to deal with stacked filesystems which invoke inner filesystem's
mmap() invocations, add __compat_vma_mapped() and invoke it on vfs_mmap()
(via compat_vma_mmap()) to ensure that the mapped callback is handled when
an mmap() caller invokes a nested filesystem's mmap_prepare() callback.

We can now also remove call_action_complete() and invoke
mmap_action_complete() directly, as we separate out the rmap lock logic to
be called in __mmap_region() instead via maybe_drop_file_rmap_lock().

We also abstract unmapping of a VMA on mmap action completion into its own
helper function, unmap_vma_locked().

Update the mmap_prepare documentation to describe the mapped hook and make
it clear what its intended use is.

Additionally, update VMA userland test headers to reflect the change.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 Documentation/filesystems/mmap_prepare.rst | 15 ++++
 include/linux/fs.h                         |  9 ++-
 include/linux/mm.h                         | 17 +++++
 mm/internal.h                              |  8 ++
 mm/util.c                                  | 85 ++++++++++++++++------
 mm/vma.c                                   | 41 ++++++++---
 tools/testing/vma/include/dup.h            | 27 ++++++-
 7 files changed, 164 insertions(+), 38 deletions(-)

diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
index 65a1f094e469..20db474915da 100644
--- a/Documentation/filesystems/mmap_prepare.rst
+++ b/Documentation/filesystems/mmap_prepare.rst
@@ -25,6 +25,21 @@ That is - no resources should be allocated nor state updated to reflect that a
 mapping has been established, as the mapping may either be merged, or fail to be
 mapped after the callback is complete.
 
+Mapped callback
+---------------
+
+If resources need to be allocated per-mapping, or state such as a reference
+count needs to be manipulated, this should be done using the ``vm_ops->mapped``
+hook, which itself should be set by the >mmap_prepare hook.
+
+This callback is only invoked if a new mapping has been established and was not
+merged with any other, and is invoked at a point where no error may occur before
+the mapping is established.
+
+You may return an error to the callback itself, which will cause the mapping to
+become unmapped and an error returned to the mmap() caller. This is useful if
+resources need to be allocated, and that allocation might fail.
+
 How To Use
 ==========
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a2628a12bd2b..c390f5c667e3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2059,13 +2059,20 @@ static inline bool can_mmap_file(struct file *file)
 }
 
 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
+int __vma_check_mmap_hook(struct vm_area_struct *vma);
 
 static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	int err;
+
 	if (file->f_op->mmap_prepare)
 		return compat_vma_mmap(file, vma);
 
-	return file->f_op->mmap(file, vma);
+	err = file->f_op->mmap(file, vma);
+	if (err)
+		return err;
+
+	return __vma_check_mmap_hook(vma);
 }
 
 static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da94edb287cd..ad1b8c3c0cfd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -777,6 +777,23 @@ struct vm_operations_struct {
 	 * Context: User context.  May sleep.  Caller holds mmap_lock.
 	 */
 	void (*close)(struct vm_area_struct *vma);
+	/**
+	 * @mapped: Called when the VMA is first mapped in the MM. Not called if
+	 * the new VMA is merged with an adjacent VMA.
+	 *
+	 * The @vm_private_data field is an output field allowing the user to
+	 * modify vma->vm_private_data as necessary.
+	 *
+	 * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
+	 * set from f_op->mmap.
+	 *
+	 * Returns %0 on success, or an error otherwise. On error, the VMA will
+	 * be unmapped.
+	 *
+	 * Context: User context.  May sleep.  Caller holds mmap_lock.
+	 */
+	int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
+		      const struct file *file, void **vm_private_data);
 	/* Called any time before splitting to check if it's allowed */
 	int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
 	int (*mremap)(struct vm_area_struct *vma);
diff --git a/mm/internal.h b/mm/internal.h
index 9e42a57e8a12..f5774892071e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -202,6 +202,14 @@ static inline void vma_close(struct vm_area_struct *vma)
 /* unmap_vmas is in mm/memory.c */
 void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap);
 
+static inline void unmap_vma_locked(struct vm_area_struct *vma)
+{
+	const size_t len = vma_pages(vma) << PAGE_SHIFT;
+
+	mmap_assert_write_locked(vma->vm_mm);
+	do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
+}
+
 #ifdef CONFIG_MMU
 
 static inline void get_anon_vma(struct anon_vma *anon_vma)
diff --git a/mm/util.c b/mm/util.c
index ac9dd6490523..cdfba09e50d7 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1163,6 +1163,54 @@ void flush_dcache_folio(struct folio *folio)
 EXPORT_SYMBOL(flush_dcache_folio);
 #endif
 
+static int __compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct vm_area_desc desc = {
+		.mm = vma->vm_mm,
+		.file = file,
+		.start = vma->vm_start,
+		.end = vma->vm_end,
+
+		.pgoff = vma->vm_pgoff,
+		.vm_file = vma->vm_file,
+		.vma_flags = vma->flags,
+		.page_prot = vma->vm_page_prot,
+
+		.action.type = MMAP_NOTHING, /* Default */
+	};
+	int err;
+
+	err = vfs_mmap_prepare(file, &desc);
+	if (err)
+		return err;
+
+	err = mmap_action_prepare(&desc);
+	if (err)
+		return err;
+
+	set_vma_from_desc(vma, &desc);
+	return mmap_action_complete(vma, &desc.action);
+}
+
+static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
+{
+	const struct vm_operations_struct *vm_ops = vma->vm_ops;
+	void *vm_private_data = vma->vm_private_data;
+	int err;
+
+	if (!vm_ops || !vm_ops->mapped)
+		return 0;
+
+	err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, file,
+			     &vm_private_data);
+	if (err)
+		unmap_vma_locked(vma);
+	else if (vm_private_data != vma->vm_private_data)
+		vma->vm_private_data = vm_private_data;
+
+	return err;
+}
+
 /**
  * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
  * existing VMA and execute any requested actions.
@@ -1191,34 +1239,26 @@ EXPORT_SYMBOL(flush_dcache_folio);
  */
 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	struct vm_area_desc desc = {
-		.mm = vma->vm_mm,
-		.file = file,
-		.start = vma->vm_start,
-		.end = vma->vm_end,
-
-		.pgoff = vma->vm_pgoff,
-		.vm_file = vma->vm_file,
-		.vma_flags = vma->flags,
-		.page_prot = vma->vm_page_prot,
-
-		.action.type = MMAP_NOTHING, /* Default */
-	};
 	int err;
 
-	err = vfs_mmap_prepare(file, &desc);
-	if (err)
-		return err;
-
-	err = mmap_action_prepare(&desc);
+	err = __compat_vma_mmap(file, vma);
 	if (err)
 		return err;
 
-	set_vma_from_desc(vma, &desc);
-	return mmap_action_complete(vma, &desc.action);
+	return __compat_vma_mapped(file, vma);
 }
 EXPORT_SYMBOL(compat_vma_mmap);
 
+int __vma_check_mmap_hook(struct vm_area_struct *vma)
+{
+	/* vm_ops->mapped is not valid if mmap() is specified. */
+	if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped))
+		return -EINVAL;
+
+	return 0;
+}
+EXPORT_SYMBOL(__vma_check_mmap_hook);
+
 static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
 			 const struct page *page)
 {
@@ -1316,10 +1356,7 @@ static int mmap_action_finish(struct vm_area_struct *vma,
 	 * invoked if we do NOT merge, so we only clean up the VMA we created.
 	 */
 	if (err) {
-		const size_t len = vma_pages(vma) << PAGE_SHIFT;
-
-		do_munmap(current->mm, vma->vm_start, len, NULL);
-
+		unmap_vma_locked(vma);
 		if (action->error_hook) {
 			/* We may want to filter the error. */
 			err = action->error_hook(err);
diff --git a/mm/vma.c b/mm/vma.c
index 2a86c7575000..3a0fb2caa1c6 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2731,21 +2731,35 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
 	return false;
 }
 
-static int call_action_complete(struct mmap_state *map,
-				struct mmap_action *action,
-				struct vm_area_struct *vma)
+static int call_mapped_hook(struct vm_area_struct *vma)
 {
-	int ret;
+	const struct vm_operations_struct *vm_ops = vma->vm_ops;
+	void *vm_private_data = vma->vm_private_data;
+	int err;
 
-	ret = mmap_action_complete(vma, action);
+	if (!vm_ops || !vm_ops->mapped)
+		return 0;
+	err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff,
+			     vma->vm_file, &vm_private_data);
+	if (err) {
+		unmap_vma_locked(vma);
+		return err;
+	}
+	/* Update private data if changed. */
+	if (vm_private_data != vma->vm_private_data)
+		vma->vm_private_data = vm_private_data;
+	return 0;
+}
 
-	/* If we held the file rmap we need to release it. */
-	if (map->hold_file_rmap_lock) {
-		struct file *file = vma->vm_file;
+static void maybe_drop_file_rmap_lock(struct mmap_state *map,
+				      struct vm_area_struct *vma)
+{
+	struct file *file;
 
-		i_mmap_unlock_write(file->f_mapping);
-	}
-	return ret;
+	if (!map->hold_file_rmap_lock)
+		return;
+	file = vma->vm_file;
+	i_mmap_unlock_write(file->f_mapping);
 }
 
 static unsigned long __mmap_region(struct file *file, unsigned long addr,
@@ -2799,8 +2813,11 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 	__mmap_complete(&map, vma);
 
 	if (have_mmap_prepare && allocated_new) {
-		error = call_action_complete(&map, &desc.action, vma);
+		error = mmap_action_complete(vma, &desc.action);
+		if (!error)
+			error = call_mapped_hook(vma);
 
+		maybe_drop_file_rmap_lock(&map, vma);
 		if (error)
 			return error;
 	}
diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
index ccf1f061c65a..4570ec77f153 100644
--- a/tools/testing/vma/include/dup.h
+++ b/tools/testing/vma/include/dup.h
@@ -642,7 +642,24 @@ struct vm_operations_struct {
 	 * @close: Called when the VMA is being removed from the MM.
 	 * Context: User context.  May sleep.  Caller holds mmap_lock.
 	 */
-	void (*close)(struct vm_area_struct * area);
+	void (*close)(struct vm_area_struct *vma);
+	/**
+	 * @mapped: Called when the VMA is first mapped in the MM. Not called if
+	 * the new VMA is merged with an adjacent VMA.
+	 *
+	 * The @vm_private_data field is an output field allowing the user to
+	 * modify vma->vm_private_data as necessary.
+	 *
+	 * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
+	 * set from f_op->mmap.
+	 *
+	 * Returns %0 on success, or an error otherwise. On error, the VMA will
+	 * be unmapped.
+	 *
+	 * Context: User context.  May sleep.  Caller holds mmap_lock.
+	 */
+	int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
+		      const struct file *file, void **vm_private_data);
 	/* Called any time before splitting to check if it's allowed */
 	int (*may_split)(struct vm_area_struct *area, unsigned long addr);
 	int (*mremap)(struct vm_area_struct *area);
@@ -1500,3 +1517,11 @@ static inline pgprot_t vma_get_page_prot(vma_flags_t vma_flags)
 
 	return vm_get_page_prot(vm_flags);
 }
+
+static inline void unmap_vma_locked(struct vm_area_struct *vma)
+{
+	const size_t len = vma_pages(vma) << PAGE_SHIFT;
+
+	mmap_assert_write_locked(vma->vm_mm);
+	do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
+}
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 03/16] mm: document vm_operations_struct->open the same as close()
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

Describe when the operation is invoked and the context in which it is
invoked, matching the description already added for vm_op->close().

While we're here, update all outdated references to an 'area' field for
VMAs to the more consistent 'vma'.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 include/linux/mm.h              | 15 ++++++++++-----
 tools/testing/vma/include/dup.h |  5 +++++
 2 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1e63b3a44a47..da94edb287cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -766,15 +766,20 @@ struct vm_uffd_ops;
  * to the functions called when a no-page or a wp-page exception occurs.
  */
 struct vm_operations_struct {
-	void (*open)(struct vm_area_struct * area);
+	/**
+	 * @open: Called when a VMA is remapped, split or forked. Not called
+	 * upon first mapping a VMA.
+	 * Context: User context.  May sleep.  Caller holds mmap_lock.
+	 */
+	void (*open)(struct vm_area_struct *vma);
 	/**
 	 * @close: Called when the VMA is being removed from the MM.
 	 * Context: User context.  May sleep.  Caller holds mmap_lock.
 	 */
-	void (*close)(struct vm_area_struct * area);
+	void (*close)(struct vm_area_struct *vma);
 	/* Called any time before splitting to check if it's allowed */
-	int (*may_split)(struct vm_area_struct *area, unsigned long addr);
-	int (*mremap)(struct vm_area_struct *area);
+	int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
+	int (*mremap)(struct vm_area_struct *vma);
 	/*
 	 * Called by mprotect() to make driver-specific permission
 	 * checks before mprotect() is finalised.   The VMA must not
@@ -786,7 +791,7 @@ struct vm_operations_struct {
 	vm_fault_t (*huge_fault)(struct vm_fault *vmf, unsigned int order);
 	vm_fault_t (*map_pages)(struct vm_fault *vmf,
 			pgoff_t start_pgoff, pgoff_t end_pgoff);
-	unsigned long (*pagesize)(struct vm_area_struct * area);
+	unsigned long (*pagesize)(struct vm_area_struct *vma);
 
 	/* notification that a previously read-only page is about to become
 	 * writable, if an error is returned it will cause a SIGBUS */
diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
index 9eada1e0949c..ccf1f061c65a 100644
--- a/tools/testing/vma/include/dup.h
+++ b/tools/testing/vma/include/dup.h
@@ -632,6 +632,11 @@ struct vm_area_struct {
 } __randomize_layout;
 
 struct vm_operations_struct {
+	/**
+	 * @open: Called when a VMA is remapped, split or forked. Not called
+	 * upon first mapping a VMA.
+	 * Context: User context.  May sleep.  Caller holds mmap_lock.
+	 */
 	void (*open)(struct vm_area_struct * area);
 	/**
 	 * @close: Called when the VMA is being removed from the MM.
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 02/16] mm: add documentation for the mmap_prepare file operation callback
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

This documentation makes it easier for a driver/file system implementer to
correctly use this callback.

It covers the fundamentals, whilst intentionally leaving the less lovely
possible actions one might take undocumented (for instance - the
success_hook, error_hook fields in mmap_action).

The document also covers the new VMA flags implementation which is the
only one which will work correctly with mmap_prepare.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 Documentation/filesystems/index.rst        |   1 +
 Documentation/filesystems/mmap_prepare.rst | 142 +++++++++++++++++++++
 2 files changed, 143 insertions(+)
 create mode 100644 Documentation/filesystems/mmap_prepare.rst

diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index f4873197587d..6cbc3e0292ae 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -29,6 +29,7 @@ algorithms work.
    fiemap
    files
    locks
+   mmap_prepare
    multigrain-ts
    mount_api
    quota
diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
new file mode 100644
index 000000000000..65a1f094e469
--- /dev/null
+++ b/Documentation/filesystems/mmap_prepare.rst
@@ -0,0 +1,142 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===========================
+mmap_prepare callback HOWTO
+===========================
+
+Introduction
+============
+
+The ``struct file->f_op->mmap()`` callback has been deprecated as it is both a
+stability and security risk, and doesn't always permit the merging of adjacent
+mappings resulting in unnecessary memory fragmentation.
+
+It has been replaced with the ``file->f_op->mmap_prepare()`` callback which
+solves these problems.
+
+This hook is called right at the beginning of setting up the mapping, and
+importantly it is invoked *before* any merging of adjacent mappings has taken
+place.
+
+If an error arises upon mapping, it might arise after this callback has been
+invoked, therefore it should be treated as effectively stateless.
+
+That is - no resources should be allocated nor state updated to reflect that a
+mapping has been established, as the mapping may either be merged, or fail to be
+mapped after the callback is complete.
+
+How To Use
+==========
+
+In your driver's struct file_operations struct, specify an ``mmap_prepare``
+callback rather than an ``mmap`` one, e.g. for ext4:
+
+.. code-block:: C
+
+    const struct file_operations ext4_file_operations = {
+        ...
+        .mmap_prepare    = ext4_file_mmap_prepare,
+    };
+
+This has a signature of ``int (*mmap_prepare)(struct vm_area_desc *)``.
+
+Examining the struct vm_area_desc type:
+
+.. code-block:: C
+
+    struct vm_area_desc {
+        /* Immutable state. */
+        const struct mm_struct *const mm;
+        struct file *const file; /* May vary from vm_file in stacked callers. */
+        unsigned long start;
+        unsigned long end;
+
+        /* Mutable fields. Populated with initial state. */
+        pgoff_t pgoff;
+        struct file *vm_file;
+        vma_flags_t vma_flags;
+        pgprot_t page_prot;
+
+        /* Write-only fields. */
+        const struct vm_operations_struct *vm_ops;
+        void *private_data;
+
+        /* Take further action? */
+        struct mmap_action action;
+    };
+
+This is straightforward - you have all the fields you need to set up the
+mapping, and you can update the mutable and writable fields, for instance:
+
+.. code-block:: C
+
+    static int ext4_file_mmap_prepare(struct vm_area_desc *desc)
+    {
+        int ret;
+        struct file *file = desc->file;
+        struct inode *inode = file->f_mapping->host;
+
+        ...
+
+        file_accessed(file);
+        if (IS_DAX(file_inode(file))) {
+            desc->vm_ops = &ext4_dax_vm_ops;
+            vma_desc_set_flags(desc, VMA_HUGEPAGE_BIT);
+        } else {
+            desc->vm_ops = &ext4_file_vm_ops;
+        }
+        return 0;
+    }
+
+Importantly, you no longer have to dance around with reference counts or locks
+when updating these fields - **you can simply go ahead and change them**.
+
+Everything is taken care of by the mapping code.
+
+VMA Flags
+---------
+
+Along with ``mmap_prepare``, VMA flags have undergone an overhaul. Where before
+you would invoke one of vm_flags_init(), vm_flags_reset(), vm_flags_set(),
+vm_flags_clear(), and vm_flags_mod() to modify flags (and to have the
+locking done correctly for you, this is no longer necessary.
+
+Also, the legacy approach of specifying VMA flags via ``VM_READ``, ``VM_WRITE``,
+etc. - i.e. using a ``-VM_xxx``- macro has changed too.
+
+When implementing mmap_prepare(), reference flags by their bit number, defined
+as a ``VMA_xxx_BIT`` macro, e.g. ``VMA_READ_BIT``, ``VMA_WRITE_BIT`` etc.,
+and use one of (where ``desc`` is a pointer to struct vm_area_desc):
+
+* ``vma_desc_test_flags(desc, ...)`` - Specify a comma-separated list of flags
+  you wish to test for (whether _any_ are set), e.g. - ``vma_desc_test_flags(
+  desc, VMA_WRITE_BIT, VMA_MAYWRITE_BIT)`` - returns ``true`` if either are set,
+  otherwise ``false``.
+* ``vma_desc_set_flags(desc, ...)`` - Update the VMA descriptor flags to set
+  additional flags specified by a comma-separated list,
+  e.g. - ``vma_desc_set_flags(desc, VMA_PFNMAP_BIT, VMA_IO_BIT)``.
+* ``vma_desc_clear_flags(desc, ...)`` - Update the VMA descriptor flags to clear
+  flags specified by a comma-separated list, e.g. - ``vma_desc_clear_flags(
+  desc, VMA_WRITE_BIT, VMA_MAYWRITE_BIT)``.
+
+Actions
+=======
+
+You can now very easily have actions be performed upon a mapping once set up by
+utilising simple helper functions invoked upon the struct vm_area_desc
+pointer. These are:
+
+* mmap_action_remap() - Remaps a range consisting only of PFNs for a specific
+  range starting a virtual address and PFN number of a set size.
+
+* mmap_action_remap_full() - Same as mmap_action_remap(), only remaps the
+  entire mapping from ``start_pfn`` onward.
+
+* mmap_action_ioremap() - Same as mmap_action_remap(), only performs an I/O
+  remap.
+
+* mmap_action_ioremap_full() - Same as mmap_action_ioremap(), only remaps
+  the entire mapping from ``start_pfn`` onward.
+
+**NOTE:** The ``action`` field should never normally be manipulated directly,
+rather you ought to use one of these helpers.
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 01/16] mm: various small mmap_prepare cleanups
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

Rather than passing arbitrary fields, pass a vm_area_desc pointer to mmap
prepare functions to mmap prepare, and an action and vma pointer to mmap
complete in order to put all the action-specific logic in the function
actually doing the work.

Additionally, allow mmap prepare functions to return an error so we can
error out as soon as possible if there is something logically incorrect in
the input.

Update remap_pfn_range_prepare() to properly check the input range for the
CoW case.

While we're here, make remap_pfn_range_prepare_vma() a little neater, and
pass mmap_action directly to call_action_complete().

Then, update compat_vma_mmap() to perform its logic directly, as
__compat_vma_map() is not used by anything so we don't need to export it.

Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than
calling the mmap_prepare op directly.

Finally, update the VMA userland tests to reflect the changes.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 include/linux/fs.h                |   2 -
 include/linux/mm.h                |   7 +-
 mm/internal.h                     |  27 ++++---
 mm/memory.c                       |  45 +++++++----
 mm/util.c                         | 119 +++++++++++++-----------------
 mm/vma.c                          |  24 +++---
 tools/testing/vma/include/dup.h   |   7 +-
 tools/testing/vma/include/stubs.h |   8 +-
 8 files changed, 123 insertions(+), 116 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b3dd145b25e..a2628a12bd2b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2058,8 +2058,6 @@ static inline bool can_mmap_file(struct file *file)
 	return true;
 }
 
-int __compat_vma_mmap(const struct file_operations *f_op,
-		struct file *file, struct vm_area_struct *vma);
 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
 
 static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 42cc40aa63d9..1e63b3a44a47 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4320,10 +4320,9 @@ static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
 	mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
 }
 
-void mmap_action_prepare(struct mmap_action *action,
-			 struct vm_area_desc *desc);
-int mmap_action_complete(struct mmap_action *action,
-			 struct vm_area_struct *vma);
+int mmap_action_prepare(struct vm_area_desc *desc);
+int mmap_action_complete(struct vm_area_struct *vma,
+			 struct mmap_action *action);
 
 /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
 static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
diff --git a/mm/internal.h b/mm/internal.h
index 708d240b4198..9e42a57e8a12 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1793,26 +1793,31 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
 int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
-void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
-int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t pgprot);
+int remap_pfn_range_prepare(struct vm_area_desc *desc);
+int remap_pfn_range_complete(struct vm_area_struct *vma,
+			     struct mmap_action *action);
 
-static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc,
-		unsigned long orig_pfn, unsigned long size)
+static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc)
 {
+	struct mmap_action *action = &desc->action;
+	const unsigned long orig_pfn = action->remap.start_pfn;
+	const unsigned long size = action->remap.size;
 	const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
 
-	return remap_pfn_range_prepare(desc, pfn);
+	action->remap.start_pfn = pfn;
+	return remap_pfn_range_prepare(desc);
 }
 
 static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long orig_pfn, unsigned long size,
-		pgprot_t orig_prot)
+					      struct mmap_action *action)
 {
-	const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
-	const pgprot_t prot = pgprot_decrypted(orig_prot);
+	const unsigned long size = action->remap.size;
+	const unsigned long orig_pfn = action->remap.start_pfn;
+	const pgprot_t orig_prot = vma->vm_page_prot;
 
-	return remap_pfn_range_complete(vma, addr, pfn, size, prot);
+	action->remap.pgprot = pgprot_decrypted(orig_prot);
+	action->remap.start_pfn  = io_remap_pfn_range_pfn(orig_pfn, size);
+	return remap_pfn_range_complete(vma, action);
 }
 
 #ifdef CONFIG_MMU_NOTIFIER
diff --git a/mm/memory.c b/mm/memory.c
index 219b9bf6cae0..9dec67a18116 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3099,26 +3099,34 @@ static int do_remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 }
 #endif
 
-void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+int remap_pfn_range_prepare(struct vm_area_desc *desc)
 {
-	/*
-	 * We set addr=VMA start, end=VMA end here, so this won't fail, but we
-	 * check it again on complete and will fail there if specified addr is
-	 * invalid.
-	 */
-	get_remap_pgoff(vma_desc_is_cow_mapping(desc), desc->start, desc->end,
-			desc->start, desc->end, pfn, &desc->pgoff);
+	const struct mmap_action *action = &desc->action;
+	const unsigned long start = action->remap.start;
+	const unsigned long end = start + action->remap.size;
+	const unsigned long pfn = action->remap.start_pfn;
+	const bool is_cow = vma_desc_is_cow_mapping(desc);
+	int err;
+
+	err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn,
+			      &desc->pgoff);
+	if (err)
+		return err;
+
 	vma_desc_set_flags_mask(desc, VMA_REMAP_FLAGS);
+	return 0;
 }
 
-static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size)
+static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma,
+				       unsigned long addr, unsigned long pfn,
+				       unsigned long size)
 {
-	unsigned long end = addr + PAGE_ALIGN(size);
+	const unsigned long end = addr + PAGE_ALIGN(size);
+	const bool is_cow = is_cow_mapping(vma->vm_flags);
 	int err;
 
-	err = get_remap_pgoff(is_cow_mapping(vma->vm_flags), addr, end,
-			      vma->vm_start, vma->vm_end, pfn, &vma->vm_pgoff);
+	err = get_remap_pgoff(is_cow, addr, end, vma->vm_start, vma->vm_end,
+			      pfn, &vma->vm_pgoff);
 	if (err)
 		return err;
 
@@ -3151,10 +3159,15 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
-int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+int remap_pfn_range_complete(struct vm_area_struct *vma,
+			     struct mmap_action *action)
 {
-	return do_remap_pfn_range(vma, addr, pfn, size, prot);
+	const unsigned long start = action->remap.start;
+	const unsigned long pfn = action->remap.start_pfn;
+	const unsigned long size = action->remap.size;
+	const pgprot_t prot = action->remap.pgprot;
+
+	return do_remap_pfn_range(vma, start, pfn, size, prot);
 }
 
 /**
diff --git a/mm/util.c b/mm/util.c
index ce7ae80047cf..ac9dd6490523 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1163,43 +1163,6 @@ void flush_dcache_folio(struct folio *folio)
 EXPORT_SYMBOL(flush_dcache_folio);
 #endif
 
-/**
- * __compat_vma_mmap() - See description for compat_vma_mmap()
- * for details. This is the same operation, only with a specific file operations
- * struct which may or may not be the same as vma->vm_file->f_op.
- * @f_op: The file operations whose .mmap_prepare() hook is specified.
- * @file: The file which backs or will back the mapping.
- * @vma: The VMA to apply the .mmap_prepare() hook to.
- * Returns: 0 on success or error.
- */
-int __compat_vma_mmap(const struct file_operations *f_op,
-		struct file *file, struct vm_area_struct *vma)
-{
-	struct vm_area_desc desc = {
-		.mm = vma->vm_mm,
-		.file = file,
-		.start = vma->vm_start,
-		.end = vma->vm_end,
-
-		.pgoff = vma->vm_pgoff,
-		.vm_file = vma->vm_file,
-		.vma_flags = vma->flags,
-		.page_prot = vma->vm_page_prot,
-
-		.action.type = MMAP_NOTHING, /* Default */
-	};
-	int err;
-
-	err = f_op->mmap_prepare(&desc);
-	if (err)
-		return err;
-
-	mmap_action_prepare(&desc.action, &desc);
-	set_vma_from_desc(vma, &desc);
-	return mmap_action_complete(&desc.action, vma);
-}
-EXPORT_SYMBOL(__compat_vma_mmap);
-
 /**
  * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
  * existing VMA and execute any requested actions.
@@ -1228,7 +1191,31 @@ EXPORT_SYMBOL(__compat_vma_mmap);
  */
 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	return __compat_vma_mmap(file->f_op, file, vma);
+	struct vm_area_desc desc = {
+		.mm = vma->vm_mm,
+		.file = file,
+		.start = vma->vm_start,
+		.end = vma->vm_end,
+
+		.pgoff = vma->vm_pgoff,
+		.vm_file = vma->vm_file,
+		.vma_flags = vma->flags,
+		.page_prot = vma->vm_page_prot,
+
+		.action.type = MMAP_NOTHING, /* Default */
+	};
+	int err;
+
+	err = vfs_mmap_prepare(file, &desc);
+	if (err)
+		return err;
+
+	err = mmap_action_prepare(&desc);
+	if (err)
+		return err;
+
+	set_vma_from_desc(vma, &desc);
+	return mmap_action_complete(vma, &desc.action);
 }
 EXPORT_SYMBOL(compat_vma_mmap);
 
@@ -1320,8 +1307,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 	}
 }
 
-static int mmap_action_finish(struct mmap_action *action,
-		const struct vm_area_struct *vma, int err)
+static int mmap_action_finish(struct vm_area_struct *vma,
+			      struct mmap_action *action, int err)
 {
 	/*
 	 * If an error occurs, unmap the VMA altogether and return an error. We
@@ -1353,37 +1340,38 @@ static int mmap_action_finish(struct mmap_action *action,
 /**
  * mmap_action_prepare - Perform preparatory setup for an VMA descriptor
  * action which need to be performed.
- * @desc: The VMA descriptor to prepare for @action.
- * @action: The action to perform.
+ * @desc: The VMA descriptor to prepare for its @desc->action.
+ *
+ * Returns: %0 on success, otherwise error.
  */
-void mmap_action_prepare(struct mmap_action *action,
-			 struct vm_area_desc *desc)
+int mmap_action_prepare(struct vm_area_desc *desc)
 {
-	switch (action->type) {
+	switch (desc->action.type) {
 	case MMAP_NOTHING:
-		break;
+		return 0;
 	case MMAP_REMAP_PFN:
-		remap_pfn_range_prepare(desc, action->remap.start_pfn);
-		break;
+		return remap_pfn_range_prepare(desc);
 	case MMAP_IO_REMAP_PFN:
-		io_remap_pfn_range_prepare(desc, action->remap.start_pfn,
-					   action->remap.size);
-		break;
+		return io_remap_pfn_range_prepare(desc);
 	}
+
+	WARN_ON_ONCE(1);
+	return -EINVAL;
 }
 EXPORT_SYMBOL(mmap_action_prepare);
 
 /**
  * mmap_action_complete - Execute VMA descriptor action.
- * @action: The action to perform.
  * @vma: The VMA to perform the action upon.
+ * @action: The action to perform.
  *
  * Similar to mmap_action_prepare().
  *
  * Return: 0 on success, or error, at which point the VMA will be unmapped.
  */
-int mmap_action_complete(struct mmap_action *action,
-			 struct vm_area_struct *vma)
+int mmap_action_complete(struct vm_area_struct *vma,
+			 struct mmap_action *action)
+
 {
 	int err = 0;
 
@@ -1391,25 +1379,20 @@ int mmap_action_complete(struct mmap_action *action,
 	case MMAP_NOTHING:
 		break;
 	case MMAP_REMAP_PFN:
-		err = remap_pfn_range_complete(vma, action->remap.start,
-				action->remap.start_pfn, action->remap.size,
-				action->remap.pgprot);
+		err = remap_pfn_range_complete(vma, action);
 		break;
 	case MMAP_IO_REMAP_PFN:
-		err = io_remap_pfn_range_complete(vma, action->remap.start,
-				action->remap.start_pfn, action->remap.size,
-				action->remap.pgprot);
+		err = io_remap_pfn_range_complete(vma, action);
 		break;
 	}
 
-	return mmap_action_finish(action, vma, err);
+	return mmap_action_finish(vma, action, err);
 }
 EXPORT_SYMBOL(mmap_action_complete);
 #else
-void mmap_action_prepare(struct mmap_action *action,
-			struct vm_area_desc *desc)
+int mmap_action_prepare(struct vm_area_desc *desc)
 {
-	switch (action->type) {
+	switch (desc->action.type) {
 	case MMAP_NOTHING:
 		break;
 	case MMAP_REMAP_PFN:
@@ -1417,11 +1400,13 @@ void mmap_action_prepare(struct mmap_action *action,
 		WARN_ON_ONCE(1); /* nommu cannot handle these. */
 		break;
 	}
+
+	return 0;
 }
 EXPORT_SYMBOL(mmap_action_prepare);
 
-int mmap_action_complete(struct mmap_action *action,
-			struct vm_area_struct *vma)
+int mmap_action_complete(struct vm_area_struct *vma,
+			 struct mmap_action *action)
 {
 	int err = 0;
 
@@ -1436,7 +1421,7 @@ int mmap_action_complete(struct mmap_action *action,
 		break;
 	}
 
-	return mmap_action_finish(action, vma, err);
+	return mmap_action_finish(vma, action, err);
 }
 EXPORT_SYMBOL(mmap_action_complete);
 #endif
diff --git a/mm/vma.c b/mm/vma.c
index c1f183235756..2a86c7575000 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2640,15 +2640,18 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
 	vma_set_page_prot(vma);
 }
 
-static void call_action_prepare(struct mmap_state *map,
-				struct vm_area_desc *desc)
+static int call_action_prepare(struct mmap_state *map,
+			       struct vm_area_desc *desc)
 {
-	struct mmap_action *action = &desc->action;
+	int err;
 
-	mmap_action_prepare(action, desc);
+	err = mmap_action_prepare(desc);
+	if (err)
+		return err;
 
-	if (action->hide_from_rmap_until_complete)
+	if (desc->action.hide_from_rmap_until_complete)
 		map->hold_file_rmap_lock = true;
+	return 0;
 }
 
 /*
@@ -2672,7 +2675,9 @@ static int call_mmap_prepare(struct mmap_state *map,
 	if (err)
 		return err;
 
-	call_action_prepare(map, desc);
+	err = call_action_prepare(map, desc);
+	if (err)
+		return err;
 
 	/* Update fields permitted to be changed. */
 	map->pgoff = desc->pgoff;
@@ -2727,13 +2732,12 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
 }
 
 static int call_action_complete(struct mmap_state *map,
-				struct vm_area_desc *desc,
+				struct mmap_action *action,
 				struct vm_area_struct *vma)
 {
-	struct mmap_action *action = &desc->action;
 	int ret;
 
-	ret = mmap_action_complete(action, vma);
+	ret = mmap_action_complete(vma, action);
 
 	/* If we held the file rmap we need to release it. */
 	if (map->hold_file_rmap_lock) {
@@ -2795,7 +2799,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 	__mmap_complete(&map, vma);
 
 	if (have_mmap_prepare && allocated_new) {
-		error = call_action_complete(&map, &desc, vma);
+		error = call_action_complete(&map, &desc.action, vma);
 
 		if (error)
 			return error;
diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
index 999357e18eb0..9eada1e0949c 100644
--- a/tools/testing/vma/include/dup.h
+++ b/tools/testing/vma/include/dup.h
@@ -1271,9 +1271,12 @@ static inline int __compat_vma_mmap(const struct file_operations *f_op,
 	if (err)
 		return err;
 
-	mmap_action_prepare(&desc.action, &desc);
+	err = mmap_action_prepare(&desc);
+	if (err)
+		return err;
+
 	set_vma_from_desc(vma, &desc);
-	return mmap_action_complete(&desc.action, vma);
+	return mmap_action_complete(vma, &desc.action);
 }
 
 static inline int compat_vma_mmap(struct file *file,
diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
index 5afb0afe2d48..a30b8bc84955 100644
--- a/tools/testing/vma/include/stubs.h
+++ b/tools/testing/vma/include/stubs.h
@@ -81,13 +81,13 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
 {
 }
 
-static inline void mmap_action_prepare(struct mmap_action *action,
-					   struct vm_area_desc *desc)
+static inline int mmap_action_prepare(struct vm_area_desc *desc)
 {
+	return 0;
 }
 
-static inline int mmap_action_complete(struct mmap_action *action,
-					   struct vm_area_struct *vma)
+static inline int mmap_action_complete(struct vm_area_struct *vma,
+				       struct mmap_action *action)
 {
 	return 0;
 }
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 00/16] mm: expand mmap_prepare functionality and usage
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts

This series expands the mmap_prepare functionality, which is intended to
replace the deprecated f_op->mmap hook which has been the source of bugs
and security issues for some time.

This series starts with some cleanup of existing mmap_prepare logic, then
adds documentation for the mmap_prepare call to make it easier for
filesystem and driver writers to understand how it works.

It then importantly adds a vm_ops->mapped hook, a key feature that was
missing from mmap_prepare previously - this is invoked when a driver which
specifies mmap_prepare has successfully been mapped but not merged with
another VMA.

mmap_prepare is invoked prior to a merge being attempted, so you cannot
manipulate state such as reference counts as if it were a new mapping.

The vm_ops->mapped hook allows a driver to perform tasks required at this
stage, and provides symmetry against subsequent vm_ops->open,close calls.

The series uses this to correct the afs implementation which wrongly
manipulated reference count at mmap_prepare time.

It then adds an mmap_prepare equivalent of vm_iomap_memory() -
mmap_action_simple_ioremap(), then uses this to update a number of drivers.

It then splits out the mmap_prepare compatibility layer (which allows for
invocation of mmap_prepare hooks in an mmap() hook) in such a way as to
allow for more incremental implementation of mmap_prepare hooks.

It then uses this to extend mmap_prepare usage in drivers.

Finally it adds an mmap_prepare equivalent of vm_map_pages(), which lays
the foundation for future work which will extend mmap_prepare to DMA
coherent mappings.

v2:
* Rebased on
  https://lore.kernel.org/all/cover.1773665966.git.ljs@kernel.org/ to make
  Andrew's life easier :)
* Folded all interim fixes into series (thanks Randy for many doc fixes!))
* As per Suren, removed a comment about allocations too small to fail.
* As per Randy, fixed up typo in documentation for vm_area_desc.
* Fixed mmap_action_prepare() not returning if invalid action->type
  specified, as updated from Andrew's interim fix (thanks!) and also
  reported by kernel test bot.
* Updated mmap_action_prepare() and specific prepare functions to only
  pass vm_area_desc parameter as per Suren.
* Fixed up whitespace as per Suren.
* Updated vm_op->open comment in vm_operations_struct to reference forking
  as per Suren.
* Added a commit to check that input range is within VMA on remap as per
  Suren (this also covers I/O remap and all other cases already asserted).
* Updated AFS to not incorrectly reference count on mmap prepare as per
  Usama.
* Also updated various static AFS functions to be consistent with each
  other.
* Updated AFS commit message to reflect mmap_prepare being before any VMA
  merging as per Suren.
* Updated __compat_vma_mapped() to check for NULL vm_ops as per Usama.
* Updated __compat_vma_mapped() to not reference an unmapped VMA's fields
  as per Usama.
* Updated __vma_check_mmap_hook() to check for NULL vm_ops as per Usama.
* Dropped comment about preferring mmap_prepare as seems overly confusing,
  as per Suren.
* Updated the mmap lock assert in unmap_vma_locked() to a write lock assert
  as per Suren.
* Copied vm_ops->open comment over to VMA tests in appropriate patch as per
  Suren.
* Updated mmap_prepare documentation to reflect the fact that no resources
  should be allocated upon mmap_prepare.
* Updated mmap_prepare documentation to reference the vm_ops->mapped
  callback.
* Fixed stray markdown '## How to use' in documentation.
* Fixed bug reported by kernel test bot re: overlooked
  vma_desc_test_flags() -> vma_desc_test() in MTD driver for nommu.

v1:
https://lore.kernel.org/linux-mm/cover.1773346620.git.ljs@kernel.org/

Lorenzo Stoakes (Oracle) (16):
  mm: various small mmap_prepare cleanups
  mm: add documentation for the mmap_prepare file operation callback
  mm: document vm_operations_struct->open the same as close()
  mm: add vm_ops->mapped hook
  fs: afs: correctly drop reference count on mapping failure
  mm: add mmap_action_simple_ioremap()
  misc: open-dice: replace deprecated mmap hook with mmap_prepare
  hpet: replace deprecated mmap hook with mmap_prepare
  mtdchar: replace deprecated mmap hook with mmap_prepare, clean up
  stm: replace deprecated mmap hook with mmap_prepare
  staging: vme_user: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  mm: add mmap_action_map_kernel_pages[_full]()
  mm: on remap assert that input range within the proposed VMA

 Documentation/driver-api/vme.rst           |   2 +-
 Documentation/filesystems/index.rst        |   1 +
 Documentation/filesystems/mmap_prepare.rst | 168 ++++++++++++++++
 drivers/char/hpet.c                        |  12 +-
 drivers/hv/hyperv_vmbus.h                  |   4 +-
 drivers/hv/vmbus_drv.c                     |  27 ++-
 drivers/hwtracing/stm/core.c               |  31 ++-
 drivers/misc/open-dice.c                   |  19 +-
 drivers/mtd/mtdchar.c                      |  21 +-
 drivers/staging/vme_user/vme.c             |  20 +-
 drivers/staging/vme_user/vme.h             |   2 +-
 drivers/staging/vme_user/vme_user.c        |  51 +++--
 drivers/target/target_core_user.c          |  26 ++-
 drivers/uio/uio.c                          |  10 +-
 drivers/uio/uio_hv_generic.c               |  11 +-
 fs/afs/file.c                              |  36 +++-
 include/linux/fs.h                         |  14 +-
 include/linux/hyperv.h                     |   4 +-
 include/linux/mm.h                         | 158 +++++++++++++--
 include/linux/mm_types.h                   |  17 +-
 include/linux/uio_driver.h                 |   4 +-
 mm/internal.h                              |  37 ++--
 mm/memory.c                                | 177 ++++++++++++-----
 mm/util.c                                  | 219 +++++++++++++++------
 mm/vma.c                                   |  59 ++++--
 mm/vma.h                                   |   2 +-
 tools/testing/vma/include/dup.h            | 141 +++++++++----
 tools/testing/vma/include/stubs.h          |   8 +-
 28 files changed, 970 insertions(+), 311 deletions(-)
 create mode 100644 Documentation/filesystems/mmap_prepare.rst

--
2.53.0

^ permalink raw reply

* [PATCH v2] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Long Li @ 2026-03-16 21:07 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Long Li, Rob Herring, Michael Kelley, linux-hyperv, linux-pci,
	linux-kernel

When hv_pci_assign_numa_node() processes a device that does not have
HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
virtual_numa_node, the device NUMA node is left unset. On x86_64,
the uninitialized default happens to be 0, but on ARM64 it is
NUMA_NO_NODE (-1).

Tests show that when no NUMA information is available from the Hyper-V
host, devices perform best when assigned to node 0. With NUMA_NO_NODE
the kernel may spread work across NUMA nodes, which degrades
performance on Hyper-V, particularly for high-throughput devices like
MANA.

Always set the device NUMA node to 0 before the conditional NUMA
affinity check, so that devices get a performant default when the host
provides no NUMA information, and behavior is consistent on both
x86_64 and ARM64.

Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
Signed-off-by: Long Li <longli@microsoft.com>
---
Changes in v2:
- Rewrite commit message to focus on performance as the primary
  motivation: NUMA_NO_NODE causes the kernel to spread work across
  NUMA nodes, degrading performance on Hyper-V

 drivers/pci/controller/pci-hyperv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 2c7a406b4ba8..38a790f642a1 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2485,6 +2485,14 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
 		if (!hv_dev)
 			continue;

+		/*
+		 * If the Hyper-V host doesn't provide a NUMA node for the
+		 * device, default to node 0. With NUMA_NO_NODE the kernel
+		 * may spread work across NUMA nodes, which degrades
+		 * performance on Hyper-V.
+		 */
+		set_dev_node(&dev->dev, 0);
+
 		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
 		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
 			/*
-- 
2.43.0

^ permalink raw reply related

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Long Li @ 2026-03-16 20:53 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157BBDC4D3A535D55B7E31BD440A@SN6PR02MB4157.namprd02.prod.outlook.com>



> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Monday, March 16, 2026 12:16 PM
> To: Long Li <longli@microsoft.com>; Michael Kelley <mhklinux@outlook.com>;
> KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Lorenzo Pieralisi <lpieralisi@kernel.org>; Krzysztof
> Wilczyński <kwilczynski@kernel.org>; Manivannan Sadhasivam
> <mani@kernel.org>; Bjorn Helgaas <bhelgaas@google.com>
> Cc: Rob Herring <robh@kernel.org>; Michael Kelley <mikelley@microsoft.com>;
> linux-hyperv@vger.kernel.org; linux-pci@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0 for
> devices without affinity info
> 
> From: Long Li <longli@microsoft.com> Sent: Monday, March 16, 2026 10:38
> AM
> >
> > > Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0
> > > for devices without affinity info
> > >
> > > From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026
> > > 3:33 PM
> > > >
> > > > When a Hyper-V PCI device does not have
> > > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> > > > virtual_numa_node, hv_pci_assign_numa_node() leaves the device
> > > > NUMA node unset. On x86_64, the default NUMA node happens to be 0,
> > > > but on
> > > > ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent behavior
> > > > across architectures.
> > > >
> > > > In Azure, when no NUMA information is available from the host,
> > > > devices perform best when assigned to node 0. Set the device NUMA
> > > > node to 0 unconditionally before the conditional NUMA affinity
> > > > check, so that devices always get a valid default and behavior is
> > > > consistent on both
> > > > x86_64 and ARM64.
> > >
> > > I'm wondering if this is the right overall approach to the inconsistency.
> > > Arguably, the arm64 value of NUMA_NO_NODE is more correct when the
> > > Hyper- V host has not provided any NUMA information to the guest.
> > > Maybe the x86/x64 side should be changed to default to NUMA_NO_NODE
> > > when there's no NUMA information provided.
> >
> > Tests have shown when Azure doesn't provide NUMA information for a PCI
> > device, workloads runs best when the node defaults to 0. NUMA_NO_NODE
> > results in performance degradation on ARM64. This affects most
> > high-performance devices like MANA when tested to line limit.
> >
> > >
> > > The observed x86/x64 default of NUMA node 0 does not come from
> > > x86/x64 architecture specific PCI code. It's a Hyper-V specific
> > > behavior due to how
> > > hv_pci_probe() allocates the struct hv_pcibus_device, with its
> > > embedded struct pci_sysdata. That struct pci_sysdata has a "node"
> > > field that the x86/x64
> > > __pcibus_to_node() function accesses when called from pci_device_add().
> > > If hv_pci_probe() were to initialize that "node" field to
> > > NUMA_NO_NODE at the same time that it sets the "domain" field,
> > > x86/x64 guests on Hyper-V would see the PCI device NUMA node default
> > > to NUMA_NO_NODE like on arm64. The current behavior of letting the
> > > sysdata "node" field stay zero as allocated might just be an historical
> oversight that no one noticed.
> >
> > I agree this was an oversight in the original X64 code, in that it
> > sets to numa node 0 by chance. But it turns out to be the ideal node
> > configuration for Azure when affinity information is not available
> > through the vPCI. (i.e. non isolated VM sizes). This results in
> > X64 perform better than ARM64 on multiple NUMA non-isolated VM sizes.
> >
> > >
> > > Are there any observed problems on arm64 with the default being
> > > NUMA_NO_NODE? If there are such problems, they should be fixed
> > > separately since that case needs to work for a kernel built with
> CONFIG_NUMA=n.
> > > pcibus_to_node() will return NUMA_NO_NODE, making the default on
> > > x86/x64 be NUMA_NO_NODE as well.
> > >
> > > I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(),
> > > and didn't see any obviously problems in an x86/x64 Azure VM with a
> > > MANA VF and multiple NVMe pass-thru devices. The NUMA node reported
> > > in /sys for these PCI devices is indeed NUMA_NO_NODE.
> > > But maybe there's some other issue that I'm not aware of.
> >
> > Extensive tests have shown defaulting NUMA node to 0 preserved the
> > existing behavior on X64, while improving performance on ARM64,
> > especially for MANA. This has been confirmed by the Hyper-V team, and
> Windows VM uses the same values for defaults.
> 
> Ah, OK.  That makes sense.  I'd suggest doing a new version of the patch with
> the commit message and the code comment describing performance as the
> main reason for the patch.  You somewhat said that in your current commit
> message, but it got muddled with the compatibility discussion, and the code
> comment just mentions compatibility. Compatibility between x86/x64 and
> arm64 isn't really the issue. The idea is that hv_pci_assign_numa_node() should
> always set the NUMA node to something, rather than depending on the default,
> which might be NUMA_NO_NODE. If the Hyper-V host provides a NUMA node,
> use that. But if not, use node 0 because that is usually where the underlying
> hardware actually has the physical device attached. Node 0 might not be right
> in certain situations, but if Hyper-V doesn't provide more information to the
> guest, guessing node 0 is better than letting the Linux kernel do something like
> load balancing across NUMA nodes, which could happen with
> NUMA_NO_NODE.  (At least, that's what I think happens!)
> 
> Michael

I'm adding the performance part to the commit message in v2. The compatibility part is still valid in that we want the consistent kernel behavior on X64 and on ARM.

Long

> 
> >
> > Thanks,
> >
> > Long
> >
> > >
> > > Michael
> > >
> > > >
> > > > Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and
> > > > support PCI_BUS_RELATIONS2")
> > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > ---
> > > >  drivers/pci/controller/pci-hyperv.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > b/drivers/pci/controller/pci-hyperv.c
> > > > index 2c7a406b4ba8..5c03b6e4cdab 100644
> > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct
> hv_pcibus_device *hbus)
> > > >  		if (!hv_dev)
> > > >  			continue;
> > > >
> > > > +		/* Default to node 0 for consistent behavior across
> architectures */
> > > > +		set_dev_node(&dev->dev, 0);
> > > > +
> > > >  		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
> > > >  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
> > > >  			/*
> > > > --
> > > > 2.43.0
> > > >
> >


^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Long Li @ 2026-03-16 20:50 UTC (permalink / raw)
  To: Leon Romanovsky, Erni Sri Satya Vennela
  Cc: Konstantin Taranov, Jason Gunthorpe, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260316194929.GI61385@unreal>

> On Thu, Mar 12, 2026 at 11:16:41AM -0700, Erni Sri Satya Vennela wrote:
> > As part of MANA hardening for CVM, clamp hardware-reported adapter
> > capability values from the MANA_IB_GET_ADAPTER_CAP response before
> > they are used by the IB subsystem.
> >
> > The response fields (max_qp_count, max_cq_count, max_mr_count,
> > max_pd_count, max_inbound_read_limit, max_outbound_read_limit,
> > max_qp_wr, max_send_sge_count, max_recv_sge_count) are u32 but are
> > assigned to signed int members in struct ib_device_attr. If hardware
> > returns a value exceeding INT_MAX, the implicit u32-to-int conversion
> > produces a negative value, which can cause incorrect behavior in the
> > IB core and userspace applications.
> 
> This sentence does not make sense in the context of the Linux kernel.
> The fundamental assumption is that the underlying hardware behaves correctly,
> and driver code should not attempt to guard against purely hypothetical
> failures. The kernel only implements such self‑protection when there is a
> documented hardware issue accompanied by official errata.
> 
> Thanks

The idea is that a malicious hardware can't corrupt and steal other data from the kernel.

The assumption is that in a public cloud environment, you can't trust the hardware 100%.

^ permalink raw reply

* Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Leon Romanovsky @ 2026-03-16 20:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Haiyang Zhang,
	K . Y . Srinivasan, Wei Liu, Dexuan Cui, Simon Horman, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260313165928.GH1704121@ziepe.ca>

On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > When the MANA hardware undergoes a service reset, the ETH auxiliary device
> > > (mana.eth) used by DPDK persists across the reset cycle — it is not removed
> > > and re-added like RC/UD/GSI QPs. This means userspace RDMA consumers such
> > > as DPDK have no way of knowing that firmware handles for their PD, CQ, WQ,
> > > QP and MR resources have become stale.
> > 
> > NAK to any of this.
> > 
> > In case of hardware reset, mana_ib AUX device needs to be destroyed and
> > recreated later.
> 
> Yeah, that is our general model for any serious RAS event where the
> driver's view of resources becomes out of sync with the HW.
> 
> You have tear down the ib_device by removing the aux and then bring
> back a new one.
> 
> There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> tell userspace to close and re-open their uverbs FD.
> 
> We don't have a model where a uverbs FD in userspace can continue to
> work after the device has a catasrophic RAS event.
> 
> There may be room to have a model where the ib device doesn't fully
> unplug/replug so it retains its name and things, but that is core code
> not driver stuff.

Good luck with that model. It is going to break RDMA-CM hotplug support.

Thanks

> 
> Jason
> 

^ permalink raw reply

* Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Leon Romanovsky @ 2026-03-16 19:49 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: longli, kotaranov, Jason Gunthorpe, linux-rdma, linux-hyperv,
	linux-kernel
In-Reply-To: <20260312181642.989735-1-ernis@linux.microsoft.com>

On Thu, Mar 12, 2026 at 11:16:41AM -0700, Erni Sri Satya Vennela wrote:
> As part of MANA hardening for CVM, clamp hardware-reported adapter
> capability values from the MANA_IB_GET_ADAPTER_CAP response before
> they are used by the IB subsystem.
> 
> The response fields (max_qp_count, max_cq_count, max_mr_count,
> max_pd_count, max_inbound_read_limit, max_outbound_read_limit,
> max_qp_wr, max_send_sge_count, max_recv_sge_count) are u32 but are
> assigned to signed int members in struct ib_device_attr. If hardware
> returns a value exceeding INT_MAX, the implicit u32-to-int conversion
> produces a negative value, which can cause incorrect behavior in the
> IB core and userspace applications.

This sentence does not make sense in the context of the Linux kernel.  
The fundamental assumption is that the underlying hardware behaves  
correctly, and driver code should not attempt to guard against purely  
hypothetical failures. The kernel only implements such self‑protection  
when there is a documented hardware issue accompanied by official  
errata.

Thanks

^ permalink raw reply

* Re: [PATCH 02/15] mm: add documentation for the mmap_prepare file operation callback
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 19:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpGd702=Xop3X5Aop9rrScdiAOQEEooTu1gcJqR9pmO5GA@mail.gmail.com>

On Sun, Mar 15, 2026 at 04:23:14PM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 12, 2026 at 1:27 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > This documentation makes it easier for a driver/file system implementer to
> > correctly use this callback.
> >
> > It covers the fundamentals, whilst intentionally leaving the less lovely
> > possible actions one might take undocumented (for instance - the
> > success_hook, error_hook fields in mmap_action).
> >
> > The document also covers the new VMA flags implementation which is the only
> > one which will work correctly with mmap_prepare.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  Documentation/filesystems/mmap_prepare.rst | 131 +++++++++++++++++++++
> >  1 file changed, 131 insertions(+)
> >  create mode 100644 Documentation/filesystems/mmap_prepare.rst
> >
> > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > new file mode 100644
> > index 000000000000..76908200f3a1
> > --- /dev/null
> > +++ b/Documentation/filesystems/mmap_prepare.rst
> > @@ -0,0 +1,131 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===========================
> > +mmap_prepare callback HOWTO
> > +===========================
> > +
> > +Introduction
> > +############
> > +
> > +The `struct file->f_op->mmap()` callback has been deprecated as it is both a
> > +stability and security risk, and doesn't always permit the merging of adjacent
> > +mappings resulting in unnecessary memory fragmentation.
> > +
> > +It has been replaced with the `file->f_op->mmap_prepare()` callback which solves
> > +these problems.
> > +
> > +## How To Use
> > +
> > +In your driver's `struct file_operations` struct, specify an `mmap_prepare`
> > +callback rather than an `mmap` one, e.g. for ext4:
> > +
> > +
> > +.. code-block:: C
> > +
> > +    const struct file_operations ext4_file_operations = {
> > +        ...
> > +        .mmap_prepare    = ext4_file_mmap_prepare,
> > +    };
> > +
> > +This has a signature of `int (*mmap_prepare)(struct vm_area_desc *)`.
> > +
> > +Examining the `struct vm_area_desc` type:
> > +
> > +.. code-block:: C
> > +
> > +    struct vm_area_desc {
> > +        /* Immutable state. */
> > +        const struct mm_struct *const mm;
> > +        struct file *const file; /* May vary from vm_file in stacked callers. */
> > +        unsigned long start;
> > +        unsigned long end;
> > +
> > +        /* Mutable fields. Populated with initial state. */
> > +        pgoff_t pgoff;
> > +        struct file *vm_file;
> > +        vma_flags_t vma_flags;
> > +        pgprot_t page_prot;
> > +
> > +        /* Write-only fields. */
> > +        const struct vm_operations_struct *vm_ops;
> > +        void *private_data;
> > +
> > +        /* Take further action? */
> > +        struct mmap_action action;
>
> So, action still belongs to /* Write-only fields. */ section? This is
> nitpicky, but it might be better to have this as:
>
>         /* Write-only fields. */
>         const struct vm_operations_struct *vm_ops;
>         void *private_data;
>         struct mmap_action action; /* Take further action? */

Absolutely not. This field is not to be written to by the user.

We sadly have to allow hugetlb to do some hacks, but these are things we don't
want to point out.

Users should use mmap_action_xxx() functions.

>
> > +    };
> > +
> > +This is straightforward - you have all the fields you need to set up the
> > +mapping, and you can update the mutable and writable fields, for instance:
> > +
> > +.. code-block:: Cw
> > +
> > +    static int ext4_file_mmap_prepare(struct vm_area_desc *desc)
> > +    {
> > +        int ret;
> > +        struct file *file = desc->file;
> > +        struct inode *inode = file->f_mapping->host;
> > +
> > +        ...
> > +
> > +        file_accessed(file);
> > +        if (IS_DAX(file_inode(file))) {
> > +            desc->vm_ops = &ext4_dax_vm_ops;
> > +            vma_desc_set_flags(desc, VMA_HUGEPAGE_BIT);
> > +        } else {
> > +            desc->vm_ops = &ext4_file_vm_ops;
> > +        }
> > +        return 0;
> > +    }
> > +
> > +Importantly, you no longer have to dance around with reference counts or locks
> > +when updating these fields - __you can simply go ahead and change them__.
> > +
> > +Everything is taken care of by the mapping code.
> > +
> > +VMA Flags
> > +=========
> > +
> > +Along with `mmap_prepare`, VMA flags have undergone an overhaul. Where before
> > +you would invoke one of `vm_flags_init()`, `vm_flags_reset()`, `vm_flags_set()`,
> > +`vm_flags_clear()`, and `vm_flags_mod()` to modify flags (and to have the
> > +locking done correctly for you, this is no longer necessary.
> > +
> > +Also, the legacy approach of specifying VMA flags via `VM_READ`, `VM_WRITE`,
> > +etc. - i.e. using a `VM_xxx` macro has changed too.
> > +
> > +When implementing `mmap_prepare()`, reference flags by their bit number, defined
> > +as a `VMA_xxx_BIT` macro, e.g. `VMA_READ_BIT`, `VMA_WRITE_BIT` etc., and use one
> > +of (where `desc` is a pointer to `struct vma_area_desc`):
> > +
> > +* `vma_desc_test_flags(desc, ...)` - Specify a comma-separated list of flags you
> > +  wish to test for (whether _any_ are set), e.g. - `vma_desc_test_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)` - returns `true` if either are set,
> > +  otherwise `false`.
> > +* `vma_desc_set_flags(desc, ...)` - Update the VMA descriptor flags to set
> > +  additional flags specified by a comma-separated list,
> > +  e.g. - `vma_desc_set_flags(desc, VMA_PFNMAP_BIT, VMA_IO_BIT)`.
> > +* `vma_desc_clear_flags(desc, ...)` - Update the VMA descriptor flags to clear
> > +  flags specified by a comma-separated list, e.g. - `vma_desc_clear_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)`.
> > +
> > +Actions
> > +=======
> > +
> > +You can now very easily have actions be performed upon a mapping once set up by
> > +utilising simple helper functions invoked upon the `struct vm_area_desc`
> > +pointer. These are:
> > +
> > +* `mmap_action_remap()` - Remaps a range consisting only of PFNs for a specific
> > +  range starting a virtual address and PFN number of a set size.
> > +
> > +* `mmap_action_remap_full()` - Same as `mmap_action_remap()`, only remaps the
> > +  entire mapping from `start_pfn` onward.
> > +
> > +* `mmap_action_ioremap()` - Same as `mmap_action_remap()`, only performs an I/O
> > +  remap.
> > +
> > +* `mmap_action_ioremap_full()` - Same as `mmap_action_ioremap()`, only remaps
> > +  the entire mapping from `start_pfn` onward.
> > +
> > +**NOTE:** The 'action' field should never normally be manipulated directly,
> > +rather you ought to use one of these helpers.
>
> I'm guessing the start and size parameters passed to
> mmap_action_remap() and such are restricted by vm_area_desc.start
> vm_area_desc.end. If so, should we document those restrictions and
> enforce them in the code?

I mean it's the same restrictions as all of the functions already apply if you
were to use them with a VMA descriptor.

I think implicitly a remap will fail if you try it out of the VMA range at the
point of applying the change.

But it might be worth adding range_in_vma_desc() checks at prepare time, will
see if I can do that for the respin.

I think it's pretty obvious that you shouldn't be trying to remap totally
unrelated memory, so I'm not sure that's at a level of granularity that's suited
to this document though.

>
> > +    struct vm_area_desc {
> > +        /* Immutable state. */
> > +        const struct mm_struct *const mm;
> > +        struct file *const file; /* May vary from vm_file in stacked callers. */
> > +        unsigned long start;
> > +        unsigned long end;
>
>
> > --
> > 2.53.0
> >

^ permalink raw reply

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Michael Kelley @ 2026-03-16 19:16 UTC (permalink / raw)
  To: Long Li, Michael Kelley, KY Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66837DDAF5F203E832DA5339CE40A@SA1PR21MB6683.namprd21.prod.outlook.com>

From: Long Li <longli@microsoft.com> Sent: Monday, March 16, 2026 10:38 AM
> 
> > Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices
> > without affinity info
> >
> > From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026 3:33 PM
> > >
> > > When a Hyper-V PCI device does not have
> > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> > > virtual_numa_node, hv_pci_assign_numa_node() leaves the device NUMA
> > > node unset. On x86_64, the default NUMA node happens to be 0, but on
> > > ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent behavior across
> > > architectures.
> > >
> > > In Azure, when no NUMA information is available from the host, devices
> > > perform best when assigned to node 0. Set the device NUMA node to 0
> > > unconditionally before the conditional NUMA affinity check, so that
> > > devices always get a valid default and behavior is consistent on both
> > > x86_64 and ARM64.
> >
> > I'm wondering if this is the right overall approach to the inconsistency.
> > Arguably, the arm64 value of NUMA_NO_NODE is more correct when the Hyper-
> > V host has not provided any NUMA information to the guest. Maybe the x86/x64
> > side should be changed to default to NUMA_NO_NODE when there's no NUMA
> > information provided.
> 
> Tests have shown when Azure doesn't provide NUMA information for a PCI device,
> workloads runs best when the node defaults to 0. NUMA_NO_NODE results in
> performance degradation on ARM64. This affects most high-performance devices like
> MANA when tested to line limit.
> 
> >
> > The observed x86/x64 default of NUMA node 0 does not come from x86/x64
> > architecture specific PCI code. It's a Hyper-V specific behavior due to how
> > hv_pci_probe() allocates the struct hv_pcibus_device, with its embedded struct
> > pci_sysdata. That struct pci_sysdata has a "node" field that the x86/x64
> > __pcibus_to_node() function accesses when called from pci_device_add().
> > If hv_pci_probe() were to initialize that "node" field to NUMA_NO_NODE at the
> > same time that it sets the "domain" field, x86/x64 guests on Hyper-V would see
> > the PCI device NUMA node default to NUMA_NO_NODE like on arm64. The
> > current behavior of letting the sysdata "node" field stay zero as allocated might
> > just be an historical oversight that no one noticed.
> 
> I agree this was an oversight in the original X64 code, in that it sets to numa node 0 by
> chance. But it turns out to be the ideal node configuration for Azure when affinity
> information is not available through the vPCI. (i.e. non isolated VM sizes). This results in
> X64 perform better than ARM64 on multiple NUMA non-isolated VM sizes.
> 
> >
> > Are there any observed problems on arm64 with the default being
> > NUMA_NO_NODE? If there are such problems, they should be fixed separately
> > since that case needs to work for a kernel built with CONFIG_NUMA=n.
> > pcibus_to_node() will return NUMA_NO_NODE, making the default on x86/x64
> > be NUMA_NO_NODE as well.
> >
> > I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(), and
> > didn't see any obviously problems in an x86/x64 Azure VM with a MANA VF and
> > multiple NVMe pass-thru devices. The NUMA node reported in /sys for these PCI
> > devices is indeed NUMA_NO_NODE.
> > But maybe there's some other issue that I'm not aware of.
> 
> Extensive tests have shown defaulting NUMA node to 0 preserved the existing behavior
> on X64, while improving performance on ARM64, especially for MANA. This has been
> confirmed by the Hyper-V team, and Windows VM uses the same values for defaults.

Ah, OK.  That makes sense.  I'd suggest doing a new version of the patch with
the commit message and the code comment describing performance as the
main reason for the patch.  You somewhat said that in your current commit
message, but it got muddled with the compatibility discussion, and the code
comment just mentions compatibility. Compatibility between x86/x64 and
arm64 isn't really the issue. The idea is that hv_pci_assign_numa_node() should
always set the NUMA node to something, rather than depending on the default,
which might be NUMA_NO_NODE. If the Hyper-V host provides a NUMA node,
use that. But if not, use node 0 because that is usually where the underlying
hardware actually has the physical device attached. Node 0 might not be
right in certain situations, but if Hyper-V doesn't provide more information
to the guest, guessing node 0 is better than letting the Linux kernel do
something like load balancing across NUMA nodes, which could happen
with NUMA_NO_NODE.  (At least, that's what I think happens!)

Michael

> 
> Thanks,
> 
> Long
> 
> >
> > Michael
> >
> > >
> > > Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
> > > Signed-off-by: Long Li <longli@microsoft.com>
> > > ---
> > >  drivers/pci/controller/pci-hyperv.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > b/drivers/pci/controller/pci-hyperv.c
> > > index 2c7a406b4ba8..5c03b6e4cdab 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> > >  		if (!hv_dev)
> > >  			continue;
> > >
> > > +		/* Default to node 0 for consistent behavior across architectures */
> > > +		set_dev_node(&dev->dev, 0);
> > > +
> > >  		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
> > >  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
> > >  			/*
> > > --
> > > 2.43.0
> > >
> 


^ permalink raw reply

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Long Li @ 2026-03-16 17:38 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415748A42DCBDD8AB635838DD440A@SN6PR02MB4157.namprd02.prod.outlook.com>

> Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices
> without affinity info
> 
> From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026 3:33 PM
> >
> > When a Hyper-V PCI device does not have
> > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> > virtual_numa_node, hv_pci_assign_numa_node() leaves the device NUMA
> > node unset. On x86_64, the default NUMA node happens to be 0, but on
> > ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent behavior across
> > architectures.
> >
> > In Azure, when no NUMA information is available from the host, devices
> > perform best when assigned to node 0. Set the device NUMA node to 0
> > unconditionally before the conditional NUMA affinity check, so that
> > devices always get a valid default and behavior is consistent on both
> > x86_64 and ARM64.
> 
> I'm wondering if this is the right overall approach to the inconsistency.
> Arguably, the arm64 value of NUMA_NO_NODE is more correct when the Hyper-
> V host has not provided any NUMA information to the guest. Maybe the x86/x64
> side should be changed to default to NUMA_NO_NODE when there's no NUMA
> information provided.

Tests have shown when Azure doesn't provide NUMA information for a PCI device, workloads runs best when the node defaults to 0. NUMA_NO_NODE results in performance degradation on ARM64. This affects most high-performance devices like MANA when tested to line limit.

> 
> The observed x86/x64 default of NUMA node 0 does not come from x86/x64
> architecture specific PCI code. It's a Hyper-V specific behavior due to how
> hv_pci_probe() allocates the struct hv_pcibus_device, with its embedded struct
> pci_sysdata. That struct pci_sysdata has a "node" field that the x86/x64
> __pcibus_to_node() function accesses when called from pci_device_add().
> If hv_pci_probe() were to initialize that "node" field to NUMA_NO_NODE at the
> same time that it sets the "domain" field, x86/x64 guests on Hyper-V would see
> the PCI device NUMA node default to NUMA_NO_NODE like on arm64. The
> current behavior of letting the sysdata "node" field stay zero as allocated might
> just be an historical oversight that no one noticed.

I agree this was an oversight in the original X64 code, in that it sets to numa node 0 by chance. But it turns out to be the ideal node configuration for Azure when affinity information is not available through the vPCI. (i.e. non isolated VM sizes). This results in X64 perform better than ARM64 on multiple NUMA non-isolated VM sizes.

> 
> Are there any observed problems on arm64 with the default being
> NUMA_NO_NODE? If there are such problems, they should be fixed separately
> since that case needs to work for a kernel built with CONFIG_NUMA=n.
> pcibus_to_node() will return NUMA_NO_NODE, making the default on x86/x64
> be NUMA_NO_NODE as well.
> 
> I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(), and
> didn't see any obviously problems in an x86/x64 Azure VM with a MANA VF and
> multiple NVMe pass-thru devices. The NUMA node reported in /sys for these PCI
> devices is indeed NUMA_NO_NODE.
> But maybe there's some other issue that I'm not aware of.

Extensive tests have shown defaulting NUMA node to 0 preserved the existing behavior on X64, while improving performance on ARM64, especially for MANA. This has been confirmed by the Hyper-V team, and Windows VM uses the same values for defaults.

Thanks,

Long

> 
> Michael
> 
> >
> > Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and
> > support PCI_BUS_RELATIONS2")
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > index 2c7a406b4ba8..5c03b6e4cdab 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct
> hv_pcibus_device *hbus)
> >  		if (!hv_dev)
> >  			continue;
> >
> > +		/* Default to node 0 for consistent behavior across architectures
> */
> > +		set_dev_node(&dev->dev, 0);
> > +
> >  		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
> >  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
> >  			/*
> > --
> > 2.43.0
> >


^ permalink raw reply

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Michael Kelley @ 2026-03-16 17:11 UTC (permalink / raw)
  To: Long Li, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260312223244.1006305-1-longli@microsoft.com>

From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026 3:33 PM
> 
> When a Hyper-V PCI device does not have
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> virtual_numa_node, hv_pci_assign_numa_node() leaves the device
> NUMA node unset. On x86_64, the default NUMA node happens to be
> 0, but on ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent
> behavior across architectures.
> 
> In Azure, when no NUMA information is available from the host,
> devices perform best when assigned to node 0. Set the device NUMA
> node to 0 unconditionally before the conditional NUMA affinity
> check, so that devices always get a valid default and behavior is
> consistent on both x86_64 and ARM64.

I'm wondering if this is the right overall approach to the inconsistency.
Arguably, the arm64 value of NUMA_NO_NODE is more correct when the
Hyper-V host has not provided any NUMA information to the guest. Maybe
the x86/x64 side should be changed to default to NUMA_NO_NODE when
there's no NUMA information provided.

The observed x86/x64 default of NUMA node 0 does not come from x86/x64
architecture specific PCI code. It's a Hyper-V specific behavior due to how
hv_pci_probe() allocates the struct hv_pcibus_device, with its embedded
struct pci_sysdata. That struct pci_sysdata has a "node" field that the x86/x64
__pcibus_to_node() function accesses when called from pci_device_add().
If hv_pci_probe() were to initialize that "node" field to NUMA_NO_NODE at
the same time that it sets the "domain" field, x86/x64 guests on Hyper-V
would see the PCI device NUMA node default to NUMA_NO_NODE like on
arm64. The current behavior of letting the sysdata "node" field stay zero
as allocated might just be an historical oversight that no one noticed.

Are there any observed problems on arm64 with the default being
NUMA_NO_NODE? If there are such problems, they should be fixed
separately since that case needs to work for a kernel built with
CONFIG_NUMA=n. pcibus_to_node() will return NUMA_NO_NODE,
making the default on x86/x64 be NUMA_NO_NODE as well.

I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(),
and didn't see any obviously problems in an x86/x64 Azure VM with a
MANA VF and multiple NVMe pass-thru devices. The NUMA node
reported in /sys for these PCI devices is indeed NUMA_NO_NODE.
But maybe there's some other issue that I'm not aware of.

Michael

> 
> Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c7a406b4ba8..5c03b6e4cdab 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
>  		if (!hv_dev)
>  			continue;
> 
> +		/* Default to node 0 for consistent behavior across architectures */
> +		set_dev_node(&dev->dev, 0);
> +
>  		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
>  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
>  			/*
> --
> 2.43.0
> 

^ permalink raw reply

* Re: [PATCH 15/15] mm: add mmap_action_map_kernel_pages[_full]()
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:54 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <4fd15134-ae1e-4233-8d5a-9d1e0b9f94dc@infradead.org>

On Thu, Mar 12, 2026 at 04:15:26PM -0700, Randy Dunlap wrote:
>
> On 3/12/26 1:27 PM, Lorenzo Stoakes (Oracle) wrote:
>
> > Finally, we update the VMA tests accordingly to reflect the changes.
>
> IMO we could omit the word "we" 5 times above.
> (but no change is required)
>
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 88f42faeb377..88ad5649c02d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
>
> > +/**
> > + * range_is_subset - Is the specified inner range a subset of the outer range?
> > + * @outer_start: The start of the outer range.
> > + * @outer_end: The exclusive end of the outer range.
> > + * @inner_start: The start of the inner range.
> > + * @inner_end: The exclusive end of the inner range.
> > + *
> > + * Returns %true if [inner_start, inner_end) is a subset of [outer_start,
>
>     * Returns:
> (for kernel-doc)

Ack

>
> > + * outer_end), otherwise %false.
> > + */
> > +static inline bool range_is_subset(unsigned long outer_start,
> > +				   unsigned long outer_end,
> > +				   unsigned long inner_start,
> > +				   unsigned long inner_end)
> > +{
> > +	return outer_start <= inner_start && inner_end <= outer_end;
> > +}
> > +
> > +/**
> > + * range_in_vma - is the specified [@start, @end) range a subset of the VMA?
> > + * @vma: The VMA against which we want to check [@start, @end).
> > + * @start: The start of the range we wish to check.
> > + * @end: The exclusive end of the range we wish to check.
> > + *
> > + * Returns %true if [@start, @end) is a subset of [@vma->vm_start,
>
>     * Returns:

Ack

>
> > + * @vma->vm_end), %false otherwise.
> > + */
> >  static inline bool range_in_vma(const struct vm_area_struct *vma,
> >  				unsigned long start, unsigned long end)
> >  {
> > -	return (vma && vma->vm_start <= start && end <= vma->vm_end);
> > +	if (!vma)
> > +		return false;
> > +
> > +	return range_is_subset(vma->vm_start, vma->vm_end, start, end);
> > +}
> > +
> > +/**
> > + * range_in_vma_desc - is the specified [@start, @end) range a subset of the VMA
> > + * described by @desc, a VMA descriptor?
> > + * @desc: The VMA descriptor against which we want to check [@start, @end).
> > + * @start: The start of the range we wish to check.
> > + * @end: The exclusive end of the range we wish to check.
> > + *
> > + * Returns %true if [@start, @end) is a subset of [@desc->start, @desc->end),
>
>     * Returns:

Ack, I think in general I've seen (or believe I've seen :) other cases without
the colon, so was kinda imitating, but I may also be imagining that ;)

>
> > + * %false otherwise.
> > + */
> > +static inline bool range_in_vma_desc(const struct vm_area_desc *desc,
> > +				     unsigned long start, unsigned long end)
> > +{
> > +	if (!desc)
> > +		return false;
> > +
> > +	return range_is_subset(desc->start, desc->end, start, end);
> >  }
>
> --
> ~Randy
>

Will also fold these changes into the respin!

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 02/15] mm: add documentation for the mmap_prepare file operation callback
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:51 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <f0e33b51-d465-462d-b0f6-98a1db66bb15@infradead.org>

On Thu, Mar 12, 2026 at 05:12:04PM -0700, Randy Dunlap wrote:
> (Andrew: patch attached)
>
>
> On 3/12/26 1:27 PM, Lorenzo Stoakes (Oracle) wrote:
>
> Documentation/filesystems/mmap_prepare.rst: WARNING: document isn't included in any toctree [toc.not_included]
>
> Should be in some index.rst file. In filesystems I suppose.

Ack thanks.

>
> > ---
> >  Documentation/filesystems/mmap_prepare.rst | 131 +++++++++++++++++++++
> >  1 file changed, 131 insertions(+)
> >  create mode 100644 Documentation/filesystems/mmap_prepare.rst
> >
> > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > new file mode 100644
> > index 000000000000..76908200f3a1
> > --- /dev/null
> > +++ b/Documentation/filesystems/mmap_prepare.rst
> > @@ -0,0 +1,131 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===========================
> > +mmap_prepare callback HOWTO
> > +===========================
> > +
> > +Introduction
> > +############
>
> Kernel style is "=============" above instead of "############".

Ack

>
> > +
> > +The `struct file->f_op->mmap()` callback has been deprecated as it is both a
> > +stability and security risk, and doesn't always permit the merging of adjacent
> > +mappings resulting in unnecessary memory fragmentation.
> > +
> > +It has been replaced with the `file->f_op->mmap_prepare()` callback which solves
> > +these problems.
> > +
> > +## How To Use
> > +
> > +In your driver's `struct file_operations` struct, specify an `mmap_prepare`
> > +callback rather than an `mmap` one, e.g. for ext4:
> > +
> > +
> > +.. code-block:: C
> > +
> > +    const struct file_operations ext4_file_operations = {
> > +        ...
> > +        .mmap_prepare    = ext4_file_mmap_prepare,
> > +    };
> > +
> > +This has a signature of `int (*mmap_prepare)(struct vm_area_desc *)`.
> > +
> > +Examining the `struct vm_area_desc` type:
> > +
> > +.. code-block:: C
> > +
> > +    struct vm_area_desc {
> > +        /* Immutable state. */
> > +        const struct mm_struct *const mm;
> > +        struct file *const file; /* May vary from vm_file in stacked callers. */
> > +        unsigned long start;
> > +        unsigned long end;
> > +
> > +        /* Mutable fields. Populated with initial state. */
> > +        pgoff_t pgoff;
> > +        struct file *vm_file;
> > +        vma_flags_t vma_flags;
> > +        pgprot_t page_prot;
> > +
> > +        /* Write-only fields. */
> > +        const struct vm_operations_struct *vm_ops;
> > +        void *private_data;
> > +
> > +        /* Take further action? */
> > +        struct mmap_action action;
> > +    };
> > +
> > +This is straightforward - you have all the fields you need to set up the
> > +mapping, and you can update the mutable and writable fields, for instance:
> > +
> > +.. code-block:: Cw
>
>    .. code-block:: C
>
> Documentation/filesystems/mmap_prepare.rst:60: WARNING: Pygments lexer name 'Cw' is not known [misc.highlighting_failure]
>
> Maybe a typo?

Yeah is a typo thanks!

>
> > +
> > +    static int ext4_file_mmap_prepare(struct vm_area_desc *desc)
> > +    {
> > +        int ret;
> > +        struct file *file = desc->file;
> > +        struct inode *inode = file->f_mapping->host;
> > +
> > +        ...
> > +
> > +        file_accessed(file);
> > +        if (IS_DAX(file_inode(file))) {
> > +            desc->vm_ops = &ext4_dax_vm_ops;
> > +            vma_desc_set_flags(desc, VMA_HUGEPAGE_BIT);
> > +        } else {
> > +            desc->vm_ops = &ext4_file_vm_ops;
> > +        }
> > +        return 0;
> > +    }
> > +
> > +Importantly, you no longer have to dance around with reference counts or locks
> > +when updating these fields - __you can simply go ahead and change them__.
> > +
> > +Everything is taken care of by the mapping code.
> > +
> > +VMA Flags
> > +=========
>
> and then use "---------------" here instead of "==============".

Ack

>
> (from Documentation/doc-guide/sphinx.rst)
>
> > +
> > +Along with `mmap_prepare`, VMA flags have undergone an overhaul. Where before
> > +you would invoke one of `vm_flags_init()`, `vm_flags_reset()`, `vm_flags_set()`,
> > +`vm_flags_clear()`, and `vm_flags_mod()` to modify flags (and to have the
> > +locking done correctly for you, this is no longer necessary.
> > +
> > +Also, the legacy approach of specifying VMA flags via `VM_READ`, `VM_WRITE`,
> > +etc. - i.e. using a `VM_xxx` macro has changed too.
> > +
> > +When implementing `mmap_prepare()`, reference flags by their bit number, defined
> > +as a `VMA_xxx_BIT` macro, e.g. `VMA_READ_BIT`, `VMA_WRITE_BIT` etc., and use one
> > +of (where `desc` is a pointer to `struct vma_area_desc`):
> > +
> > +* `vma_desc_test_flags(desc, ...)` - Specify a comma-separated list of flags you
> > +  wish to test for (whether _any_ are set), e.g. - `vma_desc_test_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)` - returns `true` if either are set,
> > +  otherwise `false`.
> > +* `vma_desc_set_flags(desc, ...)` - Update the VMA descriptor flags to set
> > +  additional flags specified by a comma-separated list,
> > +  e.g. - `vma_desc_set_flags(desc, VMA_PFNMAP_BIT, VMA_IO_BIT)`.
> > +* `vma_desc_clear_flags(desc, ...)` - Update the VMA descriptor flags to clear
> > +  flags specified by a comma-separated list, e.g. - `vma_desc_clear_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)`.
> > +
> > +Actions
> > +=======
> > +
> > +You can now very easily have actions be performed upon a mapping once set up by
> > +utilising simple helper functions invoked upon the `struct vm_area_desc`
> > +pointer. These are:
> > +
> > +* `mmap_action_remap()` - Remaps a range consisting only of PFNs for a specific
> > +  range starting a virtual address and PFN number of a set size.
> > +
> > +* `mmap_action_remap_full()` - Same as `mmap_action_remap()`, only remaps the
> > +  entire mapping from `start_pfn` onward.
> > +
> > +* `mmap_action_ioremap()` - Same as `mmap_action_remap()`, only performs an I/O
> > +  remap.
> > +
> > +* `mmap_action_ioremap_full()` - Same as `mmap_action_ioremap()`, only remaps
> > +  the entire mapping from `start_pfn` onward.
> > +
> > +**NOTE:** The 'action' field should never normally be manipulated directly,
> > +rather you ought to use one of these helpers.
>
> I also see this warning, but I don't know what it is referring to:
>
> Documentation/filesystems/mmap_prepare.rst:132: ERROR: Anonymous hyperlink mismatch: 1 references but 0 targets.
> See "backrefs" attribute for IDs. [docutils]
>
> (OK, I found/fixed that also.)
>
> There are also lots of single ` marks which mean italics. I thought those were
> not what was intended, so I changed (most of) them to `` marks, which means
> "code block / monospace". I can fix those if needed.
>
> from the patch file:
> @Lorenzo: ISTR that you prefer explicit quoting on structs and
> functions. I didn't do that here since kernel automarkup does that,
> but if you prefer, I can redo the patch with those changes.

The issue was in another document it didn't seem to properly recognise the types
AFAICT (but I might have been mistaken anyway!) But I'm fine without.

>
> HTH.
> --
> ~Randy

Thanks for this, will fold the patch into the respin also!

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 01/15] mm: various small mmap_prepare cleanups
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpHcjFU1r7ixiJM4b_a5HTesxBmW6DiCreaWpJ8DLM5haQ@mail.gmail.com>

On Sun, Mar 15, 2026 at 04:06:48PM -0700, Suren Baghdasaryan wrote:
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -4116,10 +4116,10 @@ static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
> > >         mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
> > >  }
> > >
> > > -void mmap_action_prepare(struct mmap_action *action,
> > > -                        struct vm_area_desc *desc);
> > > -int mmap_action_complete(struct mmap_action *action,
> > > -                        struct vm_area_struct *vma);
> > > +int mmap_action_prepare(struct vm_area_desc *desc,
> > > +                       struct mmap_action *action);
> > > +int mmap_action_complete(struct vm_area_struct *vma,
> > > +                        struct mmap_action *action);
> > >
> > >  /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
> > >  static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 95b583e7e4f7..7bfa85b5e78b 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -1775,26 +1775,32 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> > >  void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> > >  int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> > >
> > > -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
> > > -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
> > > -               unsigned long pfn, unsigned long size, pgprot_t pgprot);
> > > +int remap_pfn_range_prepare(struct vm_area_desc *desc,
> > > +                           struct mmap_action *action);
> > > +int remap_pfn_range_complete(struct vm_area_struct *vma,
> > > +                            struct mmap_action *action);
> > >
> > > -static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > > -               unsigned long orig_pfn, unsigned long size)
> > > +static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > > +                                            struct mmap_action *action)
> > >  {
> > > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > > +       const unsigned long size = action->remap.size;
> > >         const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> > >
> > > -       return remap_pfn_range_prepare(desc, pfn);
> > > +       action->remap.start_pfn = pfn;
> > > +       return remap_pfn_range_prepare(desc, action);
> > >  }
> > >
> > >  static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
> > > -               unsigned long addr, unsigned long orig_pfn, unsigned long size,
> > > -               pgprot_t orig_prot)
> > > +                                             struct mmap_action *action)
> > >  {
> > > -       const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> > > -       const pgprot_t prot = pgprot_decrypted(orig_prot);
> > > +       const unsigned long size = action->remap.size;
> > > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > > +       const pgprot_t orig_prot = vma->vm_page_prot;
> > >
> > > -       return remap_pfn_range_complete(vma, addr, pfn, size, prot);
> > > +       action->remap.pgprot = pgprot_decrypted(orig_prot);
>
> I'm guessing it doesn't really matter but after this change
> action->remap.pgprot will store the decrypted value while before this
> change it was kept the way mmap_prepare() originally set it. We pass
> the action structure later to mmap_actpion_finish() but it does not use
> action->remap.pgprot, so this probably doesn't matter.

Yeah it doesn't really matter either way.

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 01/15] mm: various small mmap_prepare cleanups
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpEsCrFEYNkkTfRLGojGOYAAx1=WOojOhpBb_=WZBr6bnQ@mail.gmail.com>

On Sun, Mar 15, 2026 at 03:56:54PM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 12, 2026 at 1:27 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > Rather than passing arbitrary fields, pass an mmap_action field directly to
> > mmap prepare and complete helpers to put all the action-specific logic in
> > the function actually doing the work.
> >
> > Additionally, allow mmap prepare functions to return an error so we can
> > error out as soon as possible if there is something logically incorrect in
> > the input.
> >
> > Update remap_pfn_range_prepare() to properly check the input range for the
> > CoW case.
>
> By "properly check" do you mean the replacement of desc->start and
> desc->end with action->remap.start and action->remap.start +
> action->remap.size when calling get_remap_pgoff() from
> remap_pfn_range_prepare()?
>
> >
> > While we're here, make remap_pfn_range_prepare_vma() a little neater, and
> > pass mmap_action directly to call_action_complete().
> >
> > Then, update compat_vma_mmap() to perform its logic directly, as
> > __compat_vma_map() is not used by anything so we don't need to export it.
>
> Not directly related to this patch but while reviewing, I was also
> checking vma locking rules in this mmap_prepare() + mmap() sequence
> and I noticed that the new VMA flag modification functions like
> vma_set_flags_mask() do assert vma_assert_locked(vma). It would be

Do NOT? :)

I don't think it'd work, because in some cases you're setting flags for a
VMA that is not yet inserted in the tree, etc.

I don't think it's hugely useful to split out these functions in some way
in the way the vm_flags_*() stuff is split so we assert sometimes, not
others.

I'd rather keep this as clean an interface as possible.

In any case the majority of cases where flags are being set are not on the
VMA, so really only core code, that would likely otherwise assert when it
needs to, would already be asserting.

The cases where drivers will do it, all of them will be using
vma_desc_set_flags() etc.

> useful to add these but as a separate change. I will add it to my todo
> list.

So I don't think it'd be generally useful at this time.

>
> >
> > Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than calling
> > the mmap_prepare op directly.
> >
> > Finally, update the VMA userland tests to reflect the changes.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  include/linux/fs.h                |   2 -
> >  include/linux/mm.h                |   8 +--
> >  mm/internal.h                     |  28 +++++---
> >  mm/memory.c                       |  45 +++++++-----
> >  mm/util.c                         | 112 +++++++++++++-----------------
> >  mm/vma.c                          |  21 +++---
> >  tools/testing/vma/include/dup.h   |   9 ++-
> >  tools/testing/vma/include/stubs.h |   9 +--
> >  8 files changed, 123 insertions(+), 111 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8b3dd145b25e..a2628a12bd2b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2058,8 +2058,6 @@ static inline bool can_mmap_file(struct file *file)
> >         return true;
> >  }
> >
> > -int __compat_vma_mmap(const struct file_operations *f_op,
> > -               struct file *file, struct vm_area_struct *vma);
> >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
> >
> >  static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 4c4fd55fc823..cc5960a84382 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4116,10 +4116,10 @@ static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
> >         mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
> >  }
> >
> > -void mmap_action_prepare(struct mmap_action *action,
> > -                        struct vm_area_desc *desc);
> > -int mmap_action_complete(struct mmap_action *action,
> > -                        struct vm_area_struct *vma);
> > +int mmap_action_prepare(struct vm_area_desc *desc,
> > +                       struct mmap_action *action);
> > +int mmap_action_complete(struct vm_area_struct *vma,
> > +                        struct mmap_action *action);
> >
> >  /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
> >  static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 95b583e7e4f7..7bfa85b5e78b 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1775,26 +1775,32 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> >  void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> >  int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> >
> > -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
> > -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
> > -               unsigned long pfn, unsigned long size, pgprot_t pgprot);
> > +int remap_pfn_range_prepare(struct vm_area_desc *desc,
> > +                           struct mmap_action *action);
> > +int remap_pfn_range_complete(struct vm_area_struct *vma,
> > +                            struct mmap_action *action);
> >
> > -static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > -               unsigned long orig_pfn, unsigned long size)
> > +static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > +                                            struct mmap_action *action)
> >  {
> > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > +       const unsigned long size = action->remap.size;
> >         const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> >
> > -       return remap_pfn_range_prepare(desc, pfn);
> > +       action->remap.start_pfn = pfn;
> > +       return remap_pfn_range_prepare(desc, action);
> >  }
> >
> >  static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
> > -               unsigned long addr, unsigned long orig_pfn, unsigned long size,
> > -               pgprot_t orig_prot)
> > +                                             struct mmap_action *action)
> >  {
> > -       const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> > -       const pgprot_t prot = pgprot_decrypted(orig_prot);
> > +       const unsigned long size = action->remap.size;
> > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > +       const pgprot_t orig_prot = vma->vm_page_prot;
> >
> > -       return remap_pfn_range_complete(vma, addr, pfn, size, prot);
> > +       action->remap.pgprot = pgprot_decrypted(orig_prot);
> > +       action->remap.start_pfn  = io_remap_pfn_range_pfn(orig_pfn, size);
> > +       return remap_pfn_range_complete(vma, action);
> >  }
> >
> >  #ifdef CONFIG_MMU_NOTIFIER
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6aa0ea4af1fc..364fa8a45360 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3099,26 +3099,34 @@ static int do_remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  #endif
> >
> > -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
> > +int remap_pfn_range_prepare(struct vm_area_desc *desc,
> > +                           struct mmap_action *action)
> >  {
> > -       /*
> > -        * We set addr=VMA start, end=VMA end here, so this won't fail, but we
> > -        * check it again on complete and will fail there if specified addr is
> > -        * invalid.
> > -        */
> > -       get_remap_pgoff(vma_desc_is_cow_mapping(desc), desc->start, desc->end,
> > -                       desc->start, desc->end, pfn, &desc->pgoff);
> > +       const unsigned long start = action->remap.start;
> > +       const unsigned long end = start + action->remap.size;
> > +       const unsigned long pfn = action->remap.start_pfn;
> > +       const bool is_cow = vma_desc_is_cow_mapping(desc);
>
> I was trying to figure out who sets action->remap.start and
> action->remap.size and if they somehow guaranteed to be always equal
> to desc->start and (desc->end - desc->start). My understanding is that
> action->remap.start and action->remap.size are set by
> f_op->mmap_prepare() but I'm not sure if they are always the same as
> desc->start and (desc->end - desc->start) and if so, how do we enforce
> that.

They are set, and they might not always be the same, because the existing
implementation does not set them the same.

Once I've completed the change, I can check to ensure that nobody is doing
anything crazy with this.

I also plan to add specific discontiguous range handlers to handle the
cases where drivers wish to map that way.

In fact, I already implemented it (and DMA coherent stuff) but stripped it
out the series for now for time (the original series was ~27 patches :) as
I want to test that more etc.

Users have access to mmap_action_remap_full() to specify that they want to
remap the full range.

>
> > +       int err;
> > +
> > +       err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn,
> > +                             &desc->pgoff);
> > +       if (err)
> > +               return err;
> > +
> >         vma_desc_set_flags_mask(desc, VMA_REMAP_FLAGS);
> > +       return 0;
> >  }
> >
> > -static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma, unsigned long addr,
> > -               unsigned long pfn, unsigned long size)
> > +static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma,
> > +                                      unsigned long addr, unsigned long pfn,
> > +                                      unsigned long size)
> >  {
> > -       unsigned long end = addr + PAGE_ALIGN(size);
> > +       const unsigned long end = addr + PAGE_ALIGN(size);
> > +       const bool is_cow = is_cow_mapping(vma->vm_flags);
> >         int err;
> >
> > -       err = get_remap_pgoff(is_cow_mapping(vma->vm_flags), addr, end,
> > -                             vma->vm_start, vma->vm_end, pfn, &vma->vm_pgoff);
> > +       err = get_remap_pgoff(is_cow, addr, end, vma->vm_start, vma->vm_end,
> > +                             pfn, &vma->vm_pgoff);
> >         if (err)
> >                 return err;
> >
> > @@ -3151,10 +3159,15 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  EXPORT_SYMBOL(remap_pfn_range);
> >
> > -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
> > -               unsigned long pfn, unsigned long size, pgprot_t prot)
> > +int remap_pfn_range_complete(struct vm_area_struct *vma,
> > +                            struct mmap_action *action)
> >  {
> > -       return do_remap_pfn_range(vma, addr, pfn, size, prot);
> > +       const unsigned long start = action->remap.start;
> > +       const unsigned long pfn = action->remap.start_pfn;
> > +       const unsigned long size = action->remap.size;
> > +       const pgprot_t prot = action->remap.pgprot;
> > +
> > +       return do_remap_pfn_range(vma, start, pfn, size, prot);
> >  }
> >
> >  /**
> > diff --git a/mm/util.c b/mm/util.c
> > index ce7ae80047cf..dba1191725b6 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -1163,43 +1163,6 @@ void flush_dcache_folio(struct folio *folio)
> >  EXPORT_SYMBOL(flush_dcache_folio);
> >  #endif
> >
> > -/**
> > - * __compat_vma_mmap() - See description for compat_vma_mmap()
> > - * for details. This is the same operation, only with a specific file operations
> > - * struct which may or may not be the same as vma->vm_file->f_op.
> > - * @f_op: The file operations whose .mmap_prepare() hook is specified.
> > - * @file: The file which backs or will back the mapping.
> > - * @vma: The VMA to apply the .mmap_prepare() hook to.
> > - * Returns: 0 on success or error.
> > - */
> > -int __compat_vma_mmap(const struct file_operations *f_op,
> > -               struct file *file, struct vm_area_struct *vma)
> > -{
> > -       struct vm_area_desc desc = {
> > -               .mm = vma->vm_mm,
> > -               .file = file,
> > -               .start = vma->vm_start,
> > -               .end = vma->vm_end,
> > -
> > -               .pgoff = vma->vm_pgoff,
> > -               .vm_file = vma->vm_file,
> > -               .vma_flags = vma->flags,
> > -               .page_prot = vma->vm_page_prot,
> > -
> > -               .action.type = MMAP_NOTHING, /* Default */
> > -       };
> > -       int err;
> > -
> > -       err = f_op->mmap_prepare(&desc);
> > -       if (err)
> > -               return err;
> > -
> > -       mmap_action_prepare(&desc.action, &desc);
> > -       set_vma_from_desc(vma, &desc);
> > -       return mmap_action_complete(&desc.action, vma);
> > -}
> > -EXPORT_SYMBOL(__compat_vma_mmap);
> > -
> >  /**
> >   * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
> >   * existing VMA and execute any requested actions.
> > @@ -1228,7 +1191,31 @@ EXPORT_SYMBOL(__compat_vma_mmap);
> >   */
> >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> >  {
> > -       return __compat_vma_mmap(file->f_op, file, vma);
> > +       struct vm_area_desc desc = {
> > +               .mm = vma->vm_mm,
> > +               .file = file,
> > +               .start = vma->vm_start,
> > +               .end = vma->vm_end,
> > +
> > +               .pgoff = vma->vm_pgoff,
> > +               .vm_file = vma->vm_file,
> > +               .vma_flags = vma->flags,
> > +               .page_prot = vma->vm_page_prot,
> > +
> > +               .action.type = MMAP_NOTHING, /* Default */
> > +       };
> > +       int err;
> > +
> > +       err = vfs_mmap_prepare(file, &desc);
> > +       if (err)
> > +               return err;
> > +
> > +       err = mmap_action_prepare(&desc, &desc.action);
> > +       if (err)
> > +               return err;
> > +
> > +       set_vma_from_desc(vma, &desc);
> > +       return mmap_action_complete(vma, &desc.action);
> >  }
> >  EXPORT_SYMBOL(compat_vma_mmap);
> >
> > @@ -1320,8 +1307,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
> >         }
> >  }
> >
> > -static int mmap_action_finish(struct mmap_action *action,
> > -               const struct vm_area_struct *vma, int err)
> > +static int mmap_action_finish(struct vm_area_struct *vma,
> > +                             struct mmap_action *action, int err)
> >  {
> >         /*
> >          * If an error occurs, unmap the VMA altogether and return an error. We
> > @@ -1355,35 +1342,36 @@ static int mmap_action_finish(struct mmap_action *action,
> >   * action which need to be performed.
> >   * @desc: The VMA descriptor to prepare for @action.
> >   * @action: The action to perform.
> > + *
> > + * Returns: 0 on success, otherwise error.
> >   */
> > -void mmap_action_prepare(struct mmap_action *action,
> > -                        struct vm_area_desc *desc)
> > +int mmap_action_prepare(struct vm_area_desc *desc,
> > +                       struct mmap_action *action)
>
> Any reason you are swapping the arguments?

For consistency with other functions to be added.

> It also looks like we always call mmap_action_prepare() with action ==
> desc->action, like this: mmap_action_prepare(&desc.action, &desc). Why
> don't we eliminate the action parameter altogether and use desc.action
> from inside the function?

I think in previous iterations I thought about overriding one action with
another and wanted to keep that flexibility, but then have never done that
in practice.

So probably I can just drop that yes, will try it on respin.

>
> > +
>
> extra new line.

Ack will fix

>
> >  {
> >         switch (action->type) {
> >         case MMAP_NOTHING:
> > -               break;
> > +               return 0;
> >         case MMAP_REMAP_PFN:
> > -               remap_pfn_range_prepare(desc, action->remap.start_pfn);
> > -               break;
> > +               return remap_pfn_range_prepare(desc, action);
> >         case MMAP_IO_REMAP_PFN:
> > -               io_remap_pfn_range_prepare(desc, action->remap.start_pfn,
> > -                                          action->remap.size);
> > -               break;
> > +               return io_remap_pfn_range_prepare(desc, action);
> >         }
> >  }
> >  EXPORT_SYMBOL(mmap_action_prepare);
> >
> >  /**
> >   * mmap_action_complete - Execute VMA descriptor action.
> > - * @action: The action to perform.
> >   * @vma: The VMA to perform the action upon.
> > + * @action: The action to perform.
> >   *

> >   * Similar to mmap_action_prepare().
> >   *
> >   * Return: 0 on success, or error, at which point the VMA will be unmapped.
> >   */
> > -int mmap_action_complete(struct mmap_action *action,
> > -                        struct vm_area_struct *vma)
> > +int mmap_action_complete(struct vm_area_struct *vma,
> > +                        struct mmap_action *action)
> > +
> >  {
> >         int err = 0;
> >
> > @@ -1391,23 +1379,19 @@ int mmap_action_complete(struct mmap_action *action,
> >         case MMAP_NOTHING:
> >                 break;
> >         case MMAP_REMAP_PFN:
> > -               err = remap_pfn_range_complete(vma, action->remap.start,
> > -                               action->remap.start_pfn, action->remap.size,
> > -                               action->remap.pgprot);
> > +               err = remap_pfn_range_complete(vma, action);
> >                 break;
> >         case MMAP_IO_REMAP_PFN:
> > -               err = io_remap_pfn_range_complete(vma, action->remap.start,
> > -                               action->remap.start_pfn, action->remap.size,
> > -                               action->remap.pgprot);
> > +               err = io_remap_pfn_range_complete(vma, action);
> >                 break;
> >         }
> >
> > -       return mmap_action_finish(action, vma, err);
> > +       return mmap_action_finish(vma, action, err);
> >  }
> >  EXPORT_SYMBOL(mmap_action_complete);
> >  #else
> > -void mmap_action_prepare(struct mmap_action *action,
> > -                       struct vm_area_desc *desc)
> > +int mmap_action_prepare(struct vm_area_desc *desc,
> > +                       struct mmap_action *action)
> >  {
> >         switch (action->type) {
> >         case MMAP_NOTHING:
> > @@ -1417,11 +1401,13 @@ void mmap_action_prepare(struct mmap_action *action,
> >                 WARN_ON_ONCE(1); /* nommu cannot handle these. */
> >                 break;
> >         }
> > +
> > +       return 0;
> >  }
> >  EXPORT_SYMBOL(mmap_action_prepare);
> >
> > -int mmap_action_complete(struct mmap_action *action,
> > -                       struct vm_area_struct *vma)
> > +int mmap_action_complete(struct vm_area_struct *vma,
> > +                        struct mmap_action *action)
> >  {
> >         int err = 0;
> >
> > @@ -1436,7 +1422,7 @@ int mmap_action_complete(struct mmap_action *action,
> >                 break;
> >         }
> >
> > -       return mmap_action_finish(action, vma, err);
> > +       return mmap_action_finish(vma, action, err);
> >  }
> >  EXPORT_SYMBOL(mmap_action_complete);
> >  #endif
> > diff --git a/mm/vma.c b/mm/vma.c
> > index be64f781a3aa..054cf1d262fb 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -2613,15 +2613,19 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> >         vma_set_page_prot(vma);
> >  }
> >
> > -static void call_action_prepare(struct mmap_state *map,
> > -                               struct vm_area_desc *desc)
> > +static int call_action_prepare(struct mmap_state *map,
> > +                              struct vm_area_desc *desc)
> >  {
> >         struct mmap_action *action = &desc->action;
> > +       int err;
> >
> > -       mmap_action_prepare(action, desc);
> > +       err = mmap_action_prepare(desc, action);
> > +       if (err)
> > +               return err;
> >
> >         if (action->hide_from_rmap_until_complete)
> >                 map->hold_file_rmap_lock = true;
> > +       return 0;
> >  }
> >
> >  /*
> > @@ -2645,7 +2649,9 @@ static int call_mmap_prepare(struct mmap_state *map,
> >         if (err)
> >                 return err;
> >
> > -       call_action_prepare(map, desc);
> > +       err = call_action_prepare(map, desc);
> > +       if (err)
> > +               return err;
> >
> >         /* Update fields permitted to be changed. */
> >         map->pgoff = desc->pgoff;
> > @@ -2700,13 +2706,12 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
> >  }
> >
> >  static int call_action_complete(struct mmap_state *map,
> > -                               struct vm_area_desc *desc,
> > +                               struct mmap_action *action,
> >                                 struct vm_area_struct *vma)
> >  {
> > -       struct mmap_action *action = &desc->action;
> >         int ret;
> >
> > -       ret = mmap_action_complete(action, vma);
> > +       ret = mmap_action_complete(vma, action);
> >
> >         /* If we held the file rmap we need to release it. */
> >         if (map->hold_file_rmap_lock) {
> > @@ -2768,7 +2773,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> >         __mmap_complete(&map, vma);
> >
> >         if (have_mmap_prepare && allocated_new) {
> > -               error = call_action_complete(&map, &desc, vma);
> > +               error = call_action_complete(&map, &desc.action, vma);
> >
> >                 if (error)
> >                         return error;
> > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > index 5eb313beb43d..908beb263307 100644
> > --- a/tools/testing/vma/include/dup.h
> > +++ b/tools/testing/vma/include/dup.h
> > @@ -1106,7 +1106,7 @@ static inline int __compat_vma_mmap(const struct file_operations *f_op,
> >
> >                 .pgoff = vma->vm_pgoff,
> >                 .vm_file = vma->vm_file,
> > -               .vm_flags = vma->vm_flags,
> > +               .vma_flags = vma->flags,
> >                 .page_prot = vma->vm_page_prot,
> >
> >                 .action.type = MMAP_NOTHING, /* Default */
> > @@ -1117,9 +1117,12 @@ static inline int __compat_vma_mmap(const struct file_operations *f_op,
> >         if (err)
> >                 return err;
> >
> > -       mmap_action_prepare(&desc.action, &desc);
> > +       err = mmap_action_prepare(&desc, &desc.action);
> > +       if (err)
> > +               return err;
> > +
> >         set_vma_from_desc(vma, &desc);
> > -       return mmap_action_complete(&desc.action, vma);
> > +       return mmap_action_complete(vma, &desc.action);
> >  }
> >
> >  static inline int compat_vma_mmap(struct file *file,
> > diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
> > index 947a3a0c2566..76c4b668bc62 100644
> > --- a/tools/testing/vma/include/stubs.h
> > +++ b/tools/testing/vma/include/stubs.h
> > @@ -81,13 +81,14 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
> >  {
> >  }
> >
> > -static inline void mmap_action_prepare(struct mmap_action *action,
> > -                                          struct vm_area_desc *desc)
> > +static inline int mmap_action_prepare(struct vm_area_desc *desc,
> > +                                     struct mmap_action *action)
> >  {
> > +       return 0;
> >  }
> >
> > -static inline int mmap_action_complete(struct mmap_action *action,
> > -                                          struct vm_area_struct *vma)
> > +static inline int mmap_action_complete(struct vm_area_struct *vma,
> > +                                      struct mmap_action *action)
> >  {
> >         return 0;
> >  }
> > --
> > 2.53.0
> >

^ permalink raw reply

* Re: [PATCH 03/15] mm: document vm_operations_struct->open the same as close()
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpHVN66abFrJgorXKBsjv7Ut=CP-E4NpLMC4SW613tJwtw@mail.gmail.com>

On Sun, Mar 15, 2026 at 05:43:41PM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 12, 2026 at 1:27 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > Describe when the operation is invoked and the context in which it is
> > invoked, matching the description already added for vm_op->close().
> >
> > While we're here, update all outdated references to an 'area' field for
> > VMAs to the more consistent 'vma'.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  include/linux/mm.h | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index cc5960a84382..12a0b4c63736 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -748,15 +748,20 @@ struct vm_uffd_ops;
> >   * to the functions called when a no-page or a wp-page exception occurs.
> >   */
> >  struct vm_operations_struct {
> > -       void (*open)(struct vm_area_struct * area);
> > +       /**
> > +        * @open: Called when a VMA is remapped or split. Not called upon first
> > +        * mapping a VMA.
>
> It's also called from dup_mmap() which is part of forking.

Ah yup :) will update thanks!

>
> > +        * Context: User context.  May sleep.  Caller holds mmap_lock.
> > +        */
> > +       void (*open)(struct vm_area_struct *vma);
> >         /**
> >          * @close: Called when the VMA is being removed from the MM.
> >          * Context: User context.  May sleep.  Caller holds mmap_lock.
> >          */
> > -       void (*close)(struct vm_area_struct * area);
> > +       void (*close)(struct vm_area_struct *vma);
> >         /* Called any time before splitting to check if it's allowed */
> > -       int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> > -       int (*mremap)(struct vm_area_struct *area);
> > +       int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
> > +       int (*mremap)(struct vm_area_struct *vma);
> >         /*
> >          * Called by mprotect() to make driver-specific permission
> >          * checks before mprotect() is finalised.   The VMA must not
> > @@ -768,7 +773,7 @@ struct vm_operations_struct {
> >         vm_fault_t (*huge_fault)(struct vm_fault *vmf, unsigned int order);
> >         vm_fault_t (*map_pages)(struct vm_fault *vmf,
> >                         pgoff_t start_pgoff, pgoff_t end_pgoff);
> > -       unsigned long (*pagesize)(struct vm_area_struct * area);
> > +       unsigned long (*pagesize)(struct vm_area_struct *vma);
> >
> >         /* notification that a previously read-only page is about to become
> >          * writable, if an error is returned it will cause a SIGBUS */
> > --
> > 2.53.0
> >

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 05/15] fs: afs: correctly drop reference count on mapping failure
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:29 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Usama Arif, Andrew Morton, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpFio6n-O-1NkPXrymV0o3UqvHYS8ZOyQtt=JXnZ5dTGhQ@mail.gmail.com>

On Sun, Mar 15, 2026 at 07:32:54PM -0700, Suren Baghdasaryan wrote:
> On Fri, Mar 13, 2026 at 5:00 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Fri, Mar 13, 2026 at 04:07:43AM -0700, Usama Arif wrote:
> > > On Thu, 12 Mar 2026 20:27:20 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > >
> > > > Commit 9d5403b1036c ("fs: convert most other generic_file_*mmap() users to
> > > > .mmap_prepare()") updated AFS to use the mmap_prepare callback in favour of
> > > > the deprecated mmap callback.
> > > >
> > > > However, it did not account for the fact that mmap_prepare can fail to map
> > > > due to an out of memory error, and thus should not be incrementing a
> > > > reference count on mmap_prepare.
>
> This is a bit confusing. I see the current implementation does
> afs_add_open_mmap() and then if generic_file_mmap_prepare() fails it
> does afs_drop_open_mmap(), therefore refcounting seems to be balanced.
> Is there really a problem?

Firstly, mmap_prepare is invoked before we try to merge, so the VMA could in
theory get merged and then the refcounting will be wrong.

Secondly, mmap_prepare occurs at such at time where it is _possible_ that
allocation failures as described below could happen.

I'll update the commit message to reflect the merge aspect actually.

>
> > > >
> > > > With the newly added vm_ops->mapped callback available, we can simply defer
> > > > this operation to that callback which is only invoked once the mapping is
> > > > successfully in place (but not yet visible to userspace as the mmap and VMA
> > > > write locks are held).
> > > >
> > > > Therefore add afs_mapped() to implement this callback for AFS.
> > > >
> > > > In practice the mapping allocations are 'too small to fail' so this is
> > > > something that realistically should never happen in practice (or would do
> > > > so in a case where the process is about to die anyway), but we should still
> > > > handle this.
>
> nit: I would drop the above paragraph. If it's impossible why are you
> handling it? If it's unlikely, then handling it is even more
> important.

Sure I can drop it, but it's an ongoing thing with these small allocations.

I wish we could just move to a scenario where we can simpy assume allocations
will always succeed :)

Vlasta - thoughts?

Cheers, Lorenzo

>
> > > >
> > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > > ---
> > > >  fs/afs/file.c | 20 ++++++++++++++++----
> > > >  1 file changed, 16 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/afs/file.c b/fs/afs/file.c
> > > > index f609366fd2ac..69ef86f5e274 100644
> > > > --- a/fs/afs/file.c
> > > > +++ b/fs/afs/file.c
> > > > @@ -28,6 +28,8 @@ static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
> > > >  static void afs_vm_open(struct vm_area_struct *area);
> > > >  static void afs_vm_close(struct vm_area_struct *area);
> > > >  static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff);
> > > > +static int afs_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data);
> > > >
> > > >  const struct file_operations afs_file_operations = {
> > > >     .open           = afs_open,
> > > > @@ -61,6 +63,7 @@ const struct address_space_operations afs_file_aops = {
> > > >  };
> > > >
> > > >  static const struct vm_operations_struct afs_vm_ops = {
> > > > +   .mapped         = afs_mapped,
> > > >     .open           = afs_vm_open,
> > > >     .close          = afs_vm_close,
> > > >     .fault          = filemap_fault,
> > > > @@ -500,13 +503,22 @@ static int afs_file_mmap_prepare(struct vm_area_desc *desc)
> > > >     afs_add_open_mmap(vnode);
> > >
> > > Is the above afs_add_open_mmap an additional one, which could cause a reference
> > > leak? Does the above one need to be removed and only the one in afs_mapped()
> > > needs to be kept?
> >
> > Ah yeah good spot, will fix thanks!
> >
> > >
> > > >
> > > >     ret = generic_file_mmap_prepare(desc);
> > > > -   if (ret == 0)
> > > > -           desc->vm_ops = &afs_vm_ops;
> > > > -   else
> > > > -           afs_drop_open_mmap(vnode);
> > > > +   if (ret)
> > > > +           return ret;
> > > > +
> > > > +   desc->vm_ops = &afs_vm_ops;
> > > >     return ret;
> > > >  }
> > > >
> > > > +static int afs_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data)
> > > > +{
> > > > +   struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
> > > > +
> > > > +   afs_add_open_mmap(vnode);
> > > > +   return 0;
> > > > +}
> > > > +
> > > >  static void afs_vm_open(struct vm_area_struct *vma)
> > > >  {
> > > >     afs_add_open_mmap(AFS_FS_I(file_inode(vma->vm_file)));
> > > > --
> > > > 2.53.0
> > > >
> > > >
> >
> > Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 04/15] mm: add vm_ops->mapped hook
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 13:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Usama Arif, Andrew Morton, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpH1gzi50aWni7rh9=2gM8WwCzm=fY14DCFbjweAq82i6Q@mail.gmail.com>

On Sun, Mar 15, 2026 at 07:18:38PM -0700, Suren Baghdasaryan wrote:
> On Fri, Mar 13, 2026 at 4:58 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Fri, Mar 13, 2026 at 04:02:36AM -0700, Usama Arif wrote:
> > > On Thu, 12 Mar 2026 20:27:19 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > >
> > > > Previously, when a driver needed to do something like establish a reference
> > > > count, it could do so in the mmap hook in the knowledge that the mapping
> > > > would succeed.
> > > >
> > > > With the introduction of f_op->mmap_prepare this is no longer the case, as
> > > > it is invoked prior to actually establishing the mapping.
> > > >
> > > > To take this into account, introduce a new vm_ops->mapped callback which is
> > > > invoked when the VMA is first mapped (though notably - not when it is
> > > > merged - which is correct and mirrors existing mmap/open/close behaviour).
> > > >
> > > > We do better that vm_ops->open() here, as this callback can return an
> > > > error, at which point the VMA will be unmapped.
> > > >
> > > > Note that vm_ops->mapped() is invoked after any mmap action is
> > > > complete (such as I/O remapping).
> > > >
> > > > We intentionally do not expose the VMA at this point, exposing only the
> > > > fields that could be used, and an output parameter in case the operation
> > > > needs to update the vma->vm_private_data field.
> > > >
> > > > In order to deal with stacked filesystems which invoke inner filesystem's
> > > > mmap() invocations, add __compat_vma_mapped() and invoke it on
> > > > vfs_mmap() (via compat_vma_mmap()) to ensure that the mapped callback is
> > > > handled when an mmap() caller invokes a nested filesystem's mmap_prepare()
> > > > callback.
> > > >
> > > > We can now also remove call_action_complete() and invoke
> > > > mmap_action_complete() directly, as we separate out the rmap lock logic to
> > > > be called in __mmap_region() instead via maybe_drop_file_rmap_lock().
> > > >
> > > > We also abstract unmapping of a VMA on mmap action completion into its own
> > > > helper function, unmap_vma_locked().
> > > >
> > > > Additionally, update VMA userland test headers to reflect the change.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > > ---
> > > >  include/linux/fs.h              |  9 +++-
> > > >  include/linux/mm.h              | 17 +++++++
> > > >  mm/internal.h                   | 10 ++++
> > > >  mm/util.c                       | 86 ++++++++++++++++++++++++---------
> > > >  mm/vma.c                        | 41 +++++++++++-----
> > > >  tools/testing/vma/include/dup.h | 34 ++++++++++++-
> > > >  6 files changed, 158 insertions(+), 39 deletions(-)
> > > >
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index a2628a12bd2b..c390f5c667e3 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -2059,13 +2059,20 @@ static inline bool can_mmap_file(struct file *file)
> > > >  }
> > > >
> > > >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
> > > > +int __vma_check_mmap_hook(struct vm_area_struct *vma);
> > > >
> > > >  static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  {
> > > > +   int err;
> > > > +
> > > >     if (file->f_op->mmap_prepare)
> > > >             return compat_vma_mmap(file, vma);
> > > >
> > > > -   return file->f_op->mmap(file, vma);
> > > > +   err = file->f_op->mmap(file, vma);
> > > > +   if (err)
> > > > +           return err;
> > > > +
> > > > +   return __vma_check_mmap_hook(vma);
> > > >  }
> > > >
> > > >  static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 12a0b4c63736..7333d5db1221 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -759,6 +759,23 @@ struct vm_operations_struct {
> > > >      * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > >      */
> > > >     void (*close)(struct vm_area_struct *vma);
> > > > +   /**
> > > > +    * @mapped: Called when the VMA is first mapped in the MM. Not called if
> > > > +    * the new VMA is merged with an adjacent VMA.
> > > > +    *
> > > > +    * The @vm_private_data field is an output field allowing the user to
> > > > +    * modify vma->vm_private_data as necessary.
> > > > +    *
> > > > +    * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
> > > > +    * set from f_op->mmap.
> > > > +    *
> > > > +    * Returns %0 on success, or an error otherwise. On error, the VMA will
> > > > +    * be unmapped.
> > > > +    *
> > > > +    * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > +    */
> > > > +   int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data);
> > > >     /* Called any time before splitting to check if it's allowed */
> > > >     int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
> > > >     int (*mremap)(struct vm_area_struct *vma);
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 7bfa85b5e78b..f0f2cf1caa36 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -158,6 +158,8 @@ static inline void *folio_raw_mapping(const struct folio *folio)
> > > >   * mmap hook and safely handle error conditions. On error, VMA hooks will be
> > > >   * mutated.
> > > >   *
> > > > + * IMPORTANT: f_op->mmap() is deprecated, prefer f_op->mmap_prepare().
> > > > + *
>
> What exactly would one do to "prefer f_op->mmap_prepare()"?

I'm saying a person should implement f_op->mmap_prepare() rather than
f_op->mmap(), since the latter is deprecated :)

I think that's pretty clear no?

> Since you are adding this comment for mmap_file(), I think you need to
> describe more specifically what one should call instead.

I think it'd be a complete distraction, since if you're at the point of calling
mmap_file() you're already not implement mmap_prepare except as a compatbility
layer.

I mean maybe I'll just drop this as it seems to be causing confusion.

>
> > > >   * @file: File which backs the mapping.
> > > >   * @vma:  VMA which we are mapping.
> > > >   *
> > > > @@ -201,6 +203,14 @@ static inline void vma_close(struct vm_area_struct *vma)
> > > >  /* unmap_vmas is in mm/memory.c */
> > > >  void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap);
> > > >
> > > > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > > > +{
> > > > +   const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > +
> > > > +   mmap_assert_locked(vma->vm_mm);
>
> You must hold the mmap write lock when unmapping. Would be better to
> assert mmap_assert_write_locked() or even vma_assert_write_locked(),
> which implies mmap_assert_write_locked().

I'm not sure why we don't assert this in those paths.

I think I assumed we could only assert readonly because one of those paths
downgrades the mmap write lock to a read lock.

I don't think we can do a VMA write lock assert here, since at the point of
do_munmap() all callers can't possibly have the VMA write lock, since they are
_looking up_ the VMA at the specified address.

But I can convert this to an mmap_assert_write_locked()!

>
> > > > +   do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > > > +}
> > > > +
> > > >  #ifdef CONFIG_MMU
> > > >
> > > >  static inline void get_anon_vma(struct anon_vma *anon_vma)
> > > > diff --git a/mm/util.c b/mm/util.c
> > > > index dba1191725b6..2b0ed54008d6 100644
> > > > --- a/mm/util.c
> > > > +++ b/mm/util.c
> > > > @@ -1163,6 +1163,55 @@ void flush_dcache_folio(struct folio *folio)
> > > >  EXPORT_SYMBOL(flush_dcache_folio);
> > > >  #endif
> > > >
> > > > +static int __compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > > > +{
> > > > +   struct vm_area_desc desc = {
> > > > +           .mm = vma->vm_mm,
> > > > +           .file = file,
> > > > +           .start = vma->vm_start,
> > > > +           .end = vma->vm_end,
> > > > +
> > > > +           .pgoff = vma->vm_pgoff,
> > > > +           .vm_file = vma->vm_file,
> > > > +           .vma_flags = vma->flags,
> > > > +           .page_prot = vma->vm_page_prot,
> > > > +
> > > > +           .action.type = MMAP_NOTHING, /* Default */
> > > > +   };
> > > > +   int err;
> > > > +
> > > > +   err = vfs_mmap_prepare(file, &desc);
> > > > +   if (err)
> > > > +           return err;
> > > > +
> > > > +   err = mmap_action_prepare(&desc, &desc.action);
> > > > +   if (err)
> > > > +           return err;
> > > > +
> > > > +   set_vma_from_desc(vma, &desc);
> > > > +   return mmap_action_complete(vma, &desc.action);
> > > > +}
> > > > +
> > > > +static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> > > > +{
> > > > +   const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > > > +   void *vm_private_data = vma->vm_private_data;
> > > > +   int err;
> > > > +
> > > > +   if (!vm_ops->mapped)
> > > > +           return 0;
> > > > +
> > >
> > > Hello!
> > >
> > > Can vm_ops be NULL here?  __compat_vma_mapped() is called from
> > > compat_vma_mmap(), which is reached when a filesystem provides
> > > mmap_prepare.  If the mmap_prepare hook does not set desc->vm_ops,
> > > vma->vm_ops will be NULL and this dereferences a NULL pointer.
> >
> > I _think_ for this to ever be invoked, you would need to be dealing with a
> > file-backed VMA so vm_ops->fault would HAVE to be defined.
> >
> > But you're right anyway as a matter of principle we should check it! Will fix.
> >
> > >
> > > For e.g. drivers/char/mem.c, mmap_zero_prepare() would trigger
> > > a NULL pointer dereference here.
> > >
> > > Would need to do
> > >       if (!vm_ops || !vm_ops->mapped)
> > >               return 0;
> > >
> > > here
> >
> > Yes.
> >
> > >
> > >
> > > > +   err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, file,
> > > > +                        &vm_private_data);
> > > > +   if (err)
> > > > +           unmap_vma_locked(vma);
> > >
> > > when mapped() returns an error, unmap_vma_locked(vma) is called
> > > but execution continues into the vm_private_data update below.  After
> > > unmap_vma_locked() the VMA may be freed (do_munmap can remove the VMA
> > > entirely), so accessing vma->vm_private_data after that is a
> > > use-after-free.
> >
> > Very good point :) will fix thanks!
> >
> > Probably:
> >
> >         if (err)
> >                 unmap_vma_locked(vma);
> >         else if (vm_private_data != vma->vm_private_data)
> >                 vma->vm_private_data = vm_private_data;
> >
> >         return err;
> >
> > Would be fine.
> >
> > >
> > > Probably need to do:
> > >       if (err) {
> > >               unmap_vma_locked(vma);
> > >               return err;
> > >       }
> > >
> > > > +   /* Update private data if changed. */
> > > > +   if (vm_private_data != vma->vm_private_data)
> > > > +           vma->vm_private_data = vm_private_data;
> > > > +
> > > > +   return err;
> > > > +}
> > > > +
> > > >  /**
> > > >   * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
> > > >   * existing VMA and execute any requested actions.
> > > > @@ -1191,34 +1240,26 @@ EXPORT_SYMBOL(flush_dcache_folio);
> > > >   */
> > > >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  {
> > > > -   struct vm_area_desc desc = {
> > > > -           .mm = vma->vm_mm,
> > > > -           .file = file,
> > > > -           .start = vma->vm_start,
> > > > -           .end = vma->vm_end,
> > > > -
> > > > -           .pgoff = vma->vm_pgoff,
> > > > -           .vm_file = vma->vm_file,
> > > > -           .vma_flags = vma->flags,
> > > > -           .page_prot = vma->vm_page_prot,
> > > > -
> > > > -           .action.type = MMAP_NOTHING, /* Default */
> > > > -   };
> > > >     int err;
> > > >
> > > > -   err = vfs_mmap_prepare(file, &desc);
> > > > -   if (err)
> > > > -           return err;
> > > > -
> > > > -   err = mmap_action_prepare(&desc, &desc.action);
> > > > +   err = __compat_vma_mmap(file, vma);
> > > >     if (err)
> > > >             return err;
> > > >
> > > > -   set_vma_from_desc(vma, &desc);
> > > > -   return mmap_action_complete(vma, &desc.action);
> > > > +   return __compat_vma_mapped(file, vma);
> > > >  }
> > > >  EXPORT_SYMBOL(compat_vma_mmap);
> > > >
> > > > +int __vma_check_mmap_hook(struct vm_area_struct *vma)
> > > > +{
> > > > +   /* vm_ops->mapped is not valid if mmap() is specified. */
> > > > +   if (WARN_ON_ONCE(vma->vm_ops->mapped))
> > > > +           return -EINVAL;
> > >
> > > I think vma->vm_ops can be NULL here. Should be:
> > >
> > >       if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped))
> > >               return -EINVAL;
> >
> > I think again you'd probably only invoke this on file-backed so be ok, but again
> > as a matter of principle we should check it so will fix, thanks!
> >
> > >
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(__vma_check_mmap_hook);
>
> nit: Any reason __vma_check_mmap_hook() is not inlined next to its
> user vfs_mmap()?

Headers fun, fs.h is a 'before mm.h' header, so vm_operations_struct is not
declared yet here, so we can't actually do the check there.

>
> > > > +
> > > >  static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
> > > >                      const struct page *page)
> > > >  {
> > > > @@ -1316,10 +1357,7 @@ static int mmap_action_finish(struct vm_area_struct *vma,
> > > >      * invoked if we do NOT merge, so we only clean up the VMA we created.
> > > >      */
> > > >     if (err) {
> > > > -           const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > -
> > > > -           do_munmap(current->mm, vma->vm_start, len, NULL);
> > > > -
> > > > +           unmap_vma_locked(vma);
> > > >             if (action->error_hook) {
> > > >                     /* We may want to filter the error. */
> > > >                     err = action->error_hook(err);
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index 054cf1d262fb..ef9f5a5365d1 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -2705,21 +2705,35 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
> > > >     return false;
> > > >  }
> > > >
> > > > -static int call_action_complete(struct mmap_state *map,
> > > > -                           struct mmap_action *action,
> > > > -                           struct vm_area_struct *vma)
> > > > +static int call_mapped_hook(struct vm_area_struct *vma)
> > > >  {
> > > > -   int ret;
> > > > +   const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > > > +   void *vm_private_data = vma->vm_private_data;
> > > > +   int err;
> > > >
> > > > -   ret = mmap_action_complete(vma, action);
> > > > +   if (!vm_ops || !vm_ops->mapped)
> > > > +           return 0;
> > > > +   err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff,
> > > > +                        vma->vm_file, &vm_private_data);
> > > > +   if (err) {
> > > > +           unmap_vma_locked(vma);
> > > > +           return err;
> > > > +   }
> > > > +   /* Update private data if changed. */
> > > > +   if (vm_private_data != vma->vm_private_data)
> > > > +           vma->vm_private_data = vm_private_data;
> > > > +   return 0;
> > > > +}
> > > >
> > > > -   /* If we held the file rmap we need to release it. */
> > > > -   if (map->hold_file_rmap_lock) {
> > > > -           struct file *file = vma->vm_file;
> > > > +static void maybe_drop_file_rmap_lock(struct mmap_state *map,
> > > > +                                 struct vm_area_struct *vma)
> > > > +{
> > > > +   struct file *file;
> > > >
> > > > -           i_mmap_unlock_write(file->f_mapping);
> > > > -   }
> > > > -   return ret;
> > > > +   if (!map->hold_file_rmap_lock)
> > > > +           return;
> > > > +   file = vma->vm_file;
> > > > +   i_mmap_unlock_write(file->f_mapping);
> > > >  }
> > > >
> > > >  static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > > @@ -2773,8 +2787,11 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > >     __mmap_complete(&map, vma);
> > > >
> > > >     if (have_mmap_prepare && allocated_new) {
> > > > -           error = call_action_complete(&map, &desc.action, vma);
> > > > +           error = mmap_action_complete(vma, &desc.action);
> > > > +           if (!error)
> > > > +                   error = call_mapped_hook(vma);
> > > >
> > > > +           maybe_drop_file_rmap_lock(&map, vma);
> > > >             if (error)
> > > >                     return error;
> > > >     }
> > > > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > > > index 908beb263307..47d8db809f31 100644
> > > > --- a/tools/testing/vma/include/dup.h
> > > > +++ b/tools/testing/vma/include/dup.h
> > > > @@ -606,12 +606,34 @@ struct vm_area_struct {
> > > >  } __randomize_layout;
> > > >
> > > >  struct vm_operations_struct {
> > > > -   void (*open)(struct vm_area_struct * area);
> > > > +   /**
> > > > +    * @open: Called when a VMA is remapped or split. Not called upon first
> > > > +    * mapping a VMA.
> > > > +    * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > +    */
>
> This comment should have been introduced in the previous patch.

It's the testing code, it's not really important. But if I respin I'll fix... :)

>
> > > > +   void (*open)(struct vm_area_struct *vma);
> > > >     /**
> > > >      * @close: Called when the VMA is being removed from the MM.
> > > >      * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > >      */
> > > > -   void (*close)(struct vm_area_struct * area);
> > > > +   void (*close)(struct vm_area_struct *vma);
> > > > +   /**
> > > > +    * @mapped: Called when the VMA is first mapped in the MM. Not called if
> > > > +    * the new VMA is merged with an adjacent VMA.
> > > > +    *
> > > > +    * The @vm_private_data field is an output field allowing the user to
> > > > +    * modify vma->vm_private_data as necessary.
> > > > +    *
> > > > +    * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
> > > > +    * set from f_op->mmap.
> > > > +    *
> > > > +    * Returns %0 on success, or an error otherwise. On error, the VMA will
> > > > +    * be unmapped.
> > > > +    *
> > > > +    * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > +    */
> > > > +   int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data);
> > > >     /* Called any time before splitting to check if it's allowed */
> > > >     int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> > > >     int (*mremap)(struct vm_area_struct *area);
> > > > @@ -1345,3 +1367,11 @@ static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
> > > >     swap(vma->vm_file, file);
> > > >     fput(file);
> > > >  }
> > > > +
> > > > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > > > +{
> > > > +   const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > +
> > > > +   mmap_assert_locked(vma->vm_mm);
> > > > +   do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > > > +}
> > > > --
> > > > 2.53.0
> > > >
> > > >
> >
> > Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 50/61] iommu: Prefer IS_ERR_OR_NULL over manual NULL check
From: Robin Murphy @ 2026-03-16 13:30 UTC (permalink / raw)
  To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
	dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
	linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
	linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
	linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
	linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
	linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
	linux-s390, linux-scsi, linux-sctp, linux-security-module,
	linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
	linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
	target-devel, tipc-discussion, v9fs
  Cc: Joerg Roedel, Will Deacon
In-Reply-To: <20260310-b4-is_err_or_null-v1-50-bd63b656022d@avm.de>

On 2026-03-10 11:49 am, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.

AFAICS it doesn't look possible for the argument to be anything other 
than valid at both callsites, so *both* conditions here seem in fact to 
be entirely redundant.

> Change generated with coccinelle.

Please use coccinelle responsibly. Mechanical changes are great for 
scripted API updates, but for cleanup, whilst it's ideal for *finding* 
areas of code that are worth looking at, the code then wants actually 
looking at, in its whole context, because meaningful cleanup often goes 
deeper than trivial replacement.

In particular, anywhere IS_ERR_OR_NULL() is genuinely relevant is 
usually a sign of bad interface design, so if you're looking at this 
then you really should be looking first and foremost to remove any 
checks that are already unnecessary, and for the remainder, to see if 
the thing being checked can be improved to not mix the two different 
styles. That would be constructive and (usually) welcome cleanup. Simply 
churning a bunch of code with this ugly macro that's arguably less 
readable than what it replaces, not so much.

Thanks,
Robin.

> To: Joerg Roedel <joro@8bytes.org>
> To: Will Deacon <will@kernel.org>
> To: Robin Murphy <robin.murphy@arm.com>
> Cc: iommu@lists.linux.dev
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
>   drivers/iommu/omap-iommu.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
> index 8231d7d6bb6a9202025643639a6b28e6faa84659..500a42b57a997696ff37c76f028a717ab71d01f9 100644
> --- a/drivers/iommu/omap-iommu.c
> +++ b/drivers/iommu/omap-iommu.c
> @@ -881,7 +881,7 @@ static int omap_iommu_attach(struct omap_iommu *obj, u32 *iopgd)
>    **/
>   static void omap_iommu_detach(struct omap_iommu *obj)
>   {
> -	if (!obj || IS_ERR(obj))
> +	if (IS_ERR_OR_NULL(obj))
>   		return;
>   
>   	spin_lock(&obj->iommu_lock);
> 

^ permalink raw reply

* [PATCH 11/11] Drivers: hv: Kconfig: Add ARM64 support for MSHV_VTL
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

Enable ARM64 support in MSHV_VTL Kconfig now that all the necessary
support is present.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 drivers/hv/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 7937ac0cbd0f..393cef272590 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -87,7 +87,7 @@ config MSHV_ROOT
 
 config MSHV_VTL
 	tristate "Microsoft Hyper-V VTL driver"
-	depends on X86_64 && HYPERV_VTL_MODE
+	depends on (X86_64 || ARM64) && HYPERV_VTL_MODE
 	depends on HYPERV_VMBUS
 	# Mapping VTL0 memory to a userspace process in VTL2 is supported in OpenHCL.
 	# VTL2 for OpenHCL makes use of Huge Pages to improve performance on VMs,
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox