Linux-HyperV List
 help / color / mirror / Atom feed
* [PATCH v2 01/16] mm: various small mmap_prepare cleanups
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <cover.1773695307.git.ljs@kernel.org>

Rather than passing arbitrary fields, pass a vm_area_desc pointer to mmap
prepare functions to mmap prepare, and an action and vma pointer to mmap
complete in order to put all the action-specific logic in the function
actually doing the work.

Additionally, allow mmap prepare functions to return an error so we can
error out as soon as possible if there is something logically incorrect in
the input.

Update remap_pfn_range_prepare() to properly check the input range for the
CoW case.

While we're here, make remap_pfn_range_prepare_vma() a little neater, and
pass mmap_action directly to call_action_complete().

Then, update compat_vma_mmap() to perform its logic directly, as
__compat_vma_map() is not used by anything so we don't need to export it.

Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than
calling the mmap_prepare op directly.

Finally, update the VMA userland tests to reflect the changes.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 include/linux/fs.h                |   2 -
 include/linux/mm.h                |   7 +-
 mm/internal.h                     |  27 ++++---
 mm/memory.c                       |  45 +++++++----
 mm/util.c                         | 119 +++++++++++++-----------------
 mm/vma.c                          |  24 +++---
 tools/testing/vma/include/dup.h   |   7 +-
 tools/testing/vma/include/stubs.h |   8 +-
 8 files changed, 123 insertions(+), 116 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8b3dd145b25e..a2628a12bd2b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2058,8 +2058,6 @@ static inline bool can_mmap_file(struct file *file)
 	return true;
 }
 
-int __compat_vma_mmap(const struct file_operations *f_op,
-		struct file *file, struct vm_area_struct *vma);
 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
 
 static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 42cc40aa63d9..1e63b3a44a47 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4320,10 +4320,9 @@ static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
 	mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
 }
 
-void mmap_action_prepare(struct mmap_action *action,
-			 struct vm_area_desc *desc);
-int mmap_action_complete(struct mmap_action *action,
-			 struct vm_area_struct *vma);
+int mmap_action_prepare(struct vm_area_desc *desc);
+int mmap_action_complete(struct vm_area_struct *vma,
+			 struct mmap_action *action);
 
 /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
 static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
diff --git a/mm/internal.h b/mm/internal.h
index 708d240b4198..9e42a57e8a12 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1793,26 +1793,31 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
 void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
 int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
 
-void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
-int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t pgprot);
+int remap_pfn_range_prepare(struct vm_area_desc *desc);
+int remap_pfn_range_complete(struct vm_area_struct *vma,
+			     struct mmap_action *action);
 
-static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc,
-		unsigned long orig_pfn, unsigned long size)
+static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc)
 {
+	struct mmap_action *action = &desc->action;
+	const unsigned long orig_pfn = action->remap.start_pfn;
+	const unsigned long size = action->remap.size;
 	const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
 
-	return remap_pfn_range_prepare(desc, pfn);
+	action->remap.start_pfn = pfn;
+	return remap_pfn_range_prepare(desc);
 }
 
 static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
-		unsigned long addr, unsigned long orig_pfn, unsigned long size,
-		pgprot_t orig_prot)
+					      struct mmap_action *action)
 {
-	const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
-	const pgprot_t prot = pgprot_decrypted(orig_prot);
+	const unsigned long size = action->remap.size;
+	const unsigned long orig_pfn = action->remap.start_pfn;
+	const pgprot_t orig_prot = vma->vm_page_prot;
 
-	return remap_pfn_range_complete(vma, addr, pfn, size, prot);
+	action->remap.pgprot = pgprot_decrypted(orig_prot);
+	action->remap.start_pfn  = io_remap_pfn_range_pfn(orig_pfn, size);
+	return remap_pfn_range_complete(vma, action);
 }
 
 #ifdef CONFIG_MMU_NOTIFIER
diff --git a/mm/memory.c b/mm/memory.c
index 219b9bf6cae0..9dec67a18116 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3099,26 +3099,34 @@ static int do_remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 }
 #endif
 
-void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
+int remap_pfn_range_prepare(struct vm_area_desc *desc)
 {
-	/*
-	 * We set addr=VMA start, end=VMA end here, so this won't fail, but we
-	 * check it again on complete and will fail there if specified addr is
-	 * invalid.
-	 */
-	get_remap_pgoff(vma_desc_is_cow_mapping(desc), desc->start, desc->end,
-			desc->start, desc->end, pfn, &desc->pgoff);
+	const struct mmap_action *action = &desc->action;
+	const unsigned long start = action->remap.start;
+	const unsigned long end = start + action->remap.size;
+	const unsigned long pfn = action->remap.start_pfn;
+	const bool is_cow = vma_desc_is_cow_mapping(desc);
+	int err;
+
+	err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn,
+			      &desc->pgoff);
+	if (err)
+		return err;
+
 	vma_desc_set_flags_mask(desc, VMA_REMAP_FLAGS);
+	return 0;
 }
 
-static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size)
+static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma,
+				       unsigned long addr, unsigned long pfn,
+				       unsigned long size)
 {
-	unsigned long end = addr + PAGE_ALIGN(size);
+	const unsigned long end = addr + PAGE_ALIGN(size);
+	const bool is_cow = is_cow_mapping(vma->vm_flags);
 	int err;
 
-	err = get_remap_pgoff(is_cow_mapping(vma->vm_flags), addr, end,
-			      vma->vm_start, vma->vm_end, pfn, &vma->vm_pgoff);
+	err = get_remap_pgoff(is_cow, addr, end, vma->vm_start, vma->vm_end,
+			      pfn, &vma->vm_pgoff);
 	if (err)
 		return err;
 
@@ -3151,10 +3159,15 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
-int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
-		unsigned long pfn, unsigned long size, pgprot_t prot)
+int remap_pfn_range_complete(struct vm_area_struct *vma,
+			     struct mmap_action *action)
 {
-	return do_remap_pfn_range(vma, addr, pfn, size, prot);
+	const unsigned long start = action->remap.start;
+	const unsigned long pfn = action->remap.start_pfn;
+	const unsigned long size = action->remap.size;
+	const pgprot_t prot = action->remap.pgprot;
+
+	return do_remap_pfn_range(vma, start, pfn, size, prot);
 }
 
 /**
diff --git a/mm/util.c b/mm/util.c
index ce7ae80047cf..ac9dd6490523 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -1163,43 +1163,6 @@ void flush_dcache_folio(struct folio *folio)
 EXPORT_SYMBOL(flush_dcache_folio);
 #endif
 
-/**
- * __compat_vma_mmap() - See description for compat_vma_mmap()
- * for details. This is the same operation, only with a specific file operations
- * struct which may or may not be the same as vma->vm_file->f_op.
- * @f_op: The file operations whose .mmap_prepare() hook is specified.
- * @file: The file which backs or will back the mapping.
- * @vma: The VMA to apply the .mmap_prepare() hook to.
- * Returns: 0 on success or error.
- */
-int __compat_vma_mmap(const struct file_operations *f_op,
-		struct file *file, struct vm_area_struct *vma)
-{
-	struct vm_area_desc desc = {
-		.mm = vma->vm_mm,
-		.file = file,
-		.start = vma->vm_start,
-		.end = vma->vm_end,
-
-		.pgoff = vma->vm_pgoff,
-		.vm_file = vma->vm_file,
-		.vma_flags = vma->flags,
-		.page_prot = vma->vm_page_prot,
-
-		.action.type = MMAP_NOTHING, /* Default */
-	};
-	int err;
-
-	err = f_op->mmap_prepare(&desc);
-	if (err)
-		return err;
-
-	mmap_action_prepare(&desc.action, &desc);
-	set_vma_from_desc(vma, &desc);
-	return mmap_action_complete(&desc.action, vma);
-}
-EXPORT_SYMBOL(__compat_vma_mmap);
-
 /**
  * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
  * existing VMA and execute any requested actions.
@@ -1228,7 +1191,31 @@ EXPORT_SYMBOL(__compat_vma_mmap);
  */
 int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	return __compat_vma_mmap(file->f_op, file, vma);
+	struct vm_area_desc desc = {
+		.mm = vma->vm_mm,
+		.file = file,
+		.start = vma->vm_start,
+		.end = vma->vm_end,
+
+		.pgoff = vma->vm_pgoff,
+		.vm_file = vma->vm_file,
+		.vma_flags = vma->flags,
+		.page_prot = vma->vm_page_prot,
+
+		.action.type = MMAP_NOTHING, /* Default */
+	};
+	int err;
+
+	err = vfs_mmap_prepare(file, &desc);
+	if (err)
+		return err;
+
+	err = mmap_action_prepare(&desc);
+	if (err)
+		return err;
+
+	set_vma_from_desc(vma, &desc);
+	return mmap_action_complete(vma, &desc.action);
 }
 EXPORT_SYMBOL(compat_vma_mmap);
 
@@ -1320,8 +1307,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
 	}
 }
 
-static int mmap_action_finish(struct mmap_action *action,
-		const struct vm_area_struct *vma, int err)
+static int mmap_action_finish(struct vm_area_struct *vma,
+			      struct mmap_action *action, int err)
 {
 	/*
 	 * If an error occurs, unmap the VMA altogether and return an error. We
@@ -1353,37 +1340,38 @@ static int mmap_action_finish(struct mmap_action *action,
 /**
  * mmap_action_prepare - Perform preparatory setup for an VMA descriptor
  * action which need to be performed.
- * @desc: The VMA descriptor to prepare for @action.
- * @action: The action to perform.
+ * @desc: The VMA descriptor to prepare for its @desc->action.
+ *
+ * Returns: %0 on success, otherwise error.
  */
-void mmap_action_prepare(struct mmap_action *action,
-			 struct vm_area_desc *desc)
+int mmap_action_prepare(struct vm_area_desc *desc)
 {
-	switch (action->type) {
+	switch (desc->action.type) {
 	case MMAP_NOTHING:
-		break;
+		return 0;
 	case MMAP_REMAP_PFN:
-		remap_pfn_range_prepare(desc, action->remap.start_pfn);
-		break;
+		return remap_pfn_range_prepare(desc);
 	case MMAP_IO_REMAP_PFN:
-		io_remap_pfn_range_prepare(desc, action->remap.start_pfn,
-					   action->remap.size);
-		break;
+		return io_remap_pfn_range_prepare(desc);
 	}
+
+	WARN_ON_ONCE(1);
+	return -EINVAL;
 }
 EXPORT_SYMBOL(mmap_action_prepare);
 
 /**
  * mmap_action_complete - Execute VMA descriptor action.
- * @action: The action to perform.
  * @vma: The VMA to perform the action upon.
+ * @action: The action to perform.
  *
  * Similar to mmap_action_prepare().
  *
  * Return: 0 on success, or error, at which point the VMA will be unmapped.
  */
-int mmap_action_complete(struct mmap_action *action,
-			 struct vm_area_struct *vma)
+int mmap_action_complete(struct vm_area_struct *vma,
+			 struct mmap_action *action)
+
 {
 	int err = 0;
 
@@ -1391,25 +1379,20 @@ int mmap_action_complete(struct mmap_action *action,
 	case MMAP_NOTHING:
 		break;
 	case MMAP_REMAP_PFN:
-		err = remap_pfn_range_complete(vma, action->remap.start,
-				action->remap.start_pfn, action->remap.size,
-				action->remap.pgprot);
+		err = remap_pfn_range_complete(vma, action);
 		break;
 	case MMAP_IO_REMAP_PFN:
-		err = io_remap_pfn_range_complete(vma, action->remap.start,
-				action->remap.start_pfn, action->remap.size,
-				action->remap.pgprot);
+		err = io_remap_pfn_range_complete(vma, action);
 		break;
 	}
 
-	return mmap_action_finish(action, vma, err);
+	return mmap_action_finish(vma, action, err);
 }
 EXPORT_SYMBOL(mmap_action_complete);
 #else
-void mmap_action_prepare(struct mmap_action *action,
-			struct vm_area_desc *desc)
+int mmap_action_prepare(struct vm_area_desc *desc)
 {
-	switch (action->type) {
+	switch (desc->action.type) {
 	case MMAP_NOTHING:
 		break;
 	case MMAP_REMAP_PFN:
@@ -1417,11 +1400,13 @@ void mmap_action_prepare(struct mmap_action *action,
 		WARN_ON_ONCE(1); /* nommu cannot handle these. */
 		break;
 	}
+
+	return 0;
 }
 EXPORT_SYMBOL(mmap_action_prepare);
 
-int mmap_action_complete(struct mmap_action *action,
-			struct vm_area_struct *vma)
+int mmap_action_complete(struct vm_area_struct *vma,
+			 struct mmap_action *action)
 {
 	int err = 0;
 
@@ -1436,7 +1421,7 @@ int mmap_action_complete(struct mmap_action *action,
 		break;
 	}
 
-	return mmap_action_finish(action, vma, err);
+	return mmap_action_finish(vma, action, err);
 }
 EXPORT_SYMBOL(mmap_action_complete);
 #endif
diff --git a/mm/vma.c b/mm/vma.c
index c1f183235756..2a86c7575000 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2640,15 +2640,18 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
 	vma_set_page_prot(vma);
 }
 
-static void call_action_prepare(struct mmap_state *map,
-				struct vm_area_desc *desc)
+static int call_action_prepare(struct mmap_state *map,
+			       struct vm_area_desc *desc)
 {
-	struct mmap_action *action = &desc->action;
+	int err;
 
-	mmap_action_prepare(action, desc);
+	err = mmap_action_prepare(desc);
+	if (err)
+		return err;
 
-	if (action->hide_from_rmap_until_complete)
+	if (desc->action.hide_from_rmap_until_complete)
 		map->hold_file_rmap_lock = true;
+	return 0;
 }
 
 /*
@@ -2672,7 +2675,9 @@ static int call_mmap_prepare(struct mmap_state *map,
 	if (err)
 		return err;
 
-	call_action_prepare(map, desc);
+	err = call_action_prepare(map, desc);
+	if (err)
+		return err;
 
 	/* Update fields permitted to be changed. */
 	map->pgoff = desc->pgoff;
@@ -2727,13 +2732,12 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
 }
 
 static int call_action_complete(struct mmap_state *map,
-				struct vm_area_desc *desc,
+				struct mmap_action *action,
 				struct vm_area_struct *vma)
 {
-	struct mmap_action *action = &desc->action;
 	int ret;
 
-	ret = mmap_action_complete(action, vma);
+	ret = mmap_action_complete(vma, action);
 
 	/* If we held the file rmap we need to release it. */
 	if (map->hold_file_rmap_lock) {
@@ -2795,7 +2799,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
 	__mmap_complete(&map, vma);
 
 	if (have_mmap_prepare && allocated_new) {
-		error = call_action_complete(&map, &desc, vma);
+		error = call_action_complete(&map, &desc.action, vma);
 
 		if (error)
 			return error;
diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
index 999357e18eb0..9eada1e0949c 100644
--- a/tools/testing/vma/include/dup.h
+++ b/tools/testing/vma/include/dup.h
@@ -1271,9 +1271,12 @@ static inline int __compat_vma_mmap(const struct file_operations *f_op,
 	if (err)
 		return err;
 
-	mmap_action_prepare(&desc.action, &desc);
+	err = mmap_action_prepare(&desc);
+	if (err)
+		return err;
+
 	set_vma_from_desc(vma, &desc);
-	return mmap_action_complete(&desc.action, vma);
+	return mmap_action_complete(vma, &desc.action);
 }
 
 static inline int compat_vma_mmap(struct file *file,
diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
index 5afb0afe2d48..a30b8bc84955 100644
--- a/tools/testing/vma/include/stubs.h
+++ b/tools/testing/vma/include/stubs.h
@@ -81,13 +81,13 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
 {
 }
 
-static inline void mmap_action_prepare(struct mmap_action *action,
-					   struct vm_area_desc *desc)
+static inline int mmap_action_prepare(struct vm_area_desc *desc)
 {
+	return 0;
 }
 
-static inline int mmap_action_complete(struct mmap_action *action,
-					   struct vm_area_struct *vma)
+static inline int mmap_action_complete(struct vm_area_struct *vma,
+				       struct mmap_action *action)
 {
 	return 0;
 }
-- 
2.53.0


^ permalink raw reply related

* [PATCH v2 00/16] mm: expand mmap_prepare functionality and usage
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 21:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts

This series expands the mmap_prepare functionality, which is intended to
replace the deprecated f_op->mmap hook which has been the source of bugs
and security issues for some time.

This series starts with some cleanup of existing mmap_prepare logic, then
adds documentation for the mmap_prepare call to make it easier for
filesystem and driver writers to understand how it works.

It then importantly adds a vm_ops->mapped hook, a key feature that was
missing from mmap_prepare previously - this is invoked when a driver which
specifies mmap_prepare has successfully been mapped but not merged with
another VMA.

mmap_prepare is invoked prior to a merge being attempted, so you cannot
manipulate state such as reference counts as if it were a new mapping.

The vm_ops->mapped hook allows a driver to perform tasks required at this
stage, and provides symmetry against subsequent vm_ops->open,close calls.

The series uses this to correct the afs implementation which wrongly
manipulated reference count at mmap_prepare time.

It then adds an mmap_prepare equivalent of vm_iomap_memory() -
mmap_action_simple_ioremap(), then uses this to update a number of drivers.

It then splits out the mmap_prepare compatibility layer (which allows for
invocation of mmap_prepare hooks in an mmap() hook) in such a way as to
allow for more incremental implementation of mmap_prepare hooks.

It then uses this to extend mmap_prepare usage in drivers.

Finally it adds an mmap_prepare equivalent of vm_map_pages(), which lays
the foundation for future work which will extend mmap_prepare to DMA
coherent mappings.

v2:
* Rebased on
  https://lore.kernel.org/all/cover.1773665966.git.ljs@kernel.org/ to make
  Andrew's life easier :)
* Folded all interim fixes into series (thanks Randy for many doc fixes!))
* As per Suren, removed a comment about allocations too small to fail.
* As per Randy, fixed up typo in documentation for vm_area_desc.
* Fixed mmap_action_prepare() not returning if invalid action->type
  specified, as updated from Andrew's interim fix (thanks!) and also
  reported by kernel test bot.
* Updated mmap_action_prepare() and specific prepare functions to only
  pass vm_area_desc parameter as per Suren.
* Fixed up whitespace as per Suren.
* Updated vm_op->open comment in vm_operations_struct to reference forking
  as per Suren.
* Added a commit to check that input range is within VMA on remap as per
  Suren (this also covers I/O remap and all other cases already asserted).
* Updated AFS to not incorrectly reference count on mmap prepare as per
  Usama.
* Also updated various static AFS functions to be consistent with each
  other.
* Updated AFS commit message to reflect mmap_prepare being before any VMA
  merging as per Suren.
* Updated __compat_vma_mapped() to check for NULL vm_ops as per Usama.
* Updated __compat_vma_mapped() to not reference an unmapped VMA's fields
  as per Usama.
* Updated __vma_check_mmap_hook() to check for NULL vm_ops as per Usama.
* Dropped comment about preferring mmap_prepare as seems overly confusing,
  as per Suren.
* Updated the mmap lock assert in unmap_vma_locked() to a write lock assert
  as per Suren.
* Copied vm_ops->open comment over to VMA tests in appropriate patch as per
  Suren.
* Updated mmap_prepare documentation to reflect the fact that no resources
  should be allocated upon mmap_prepare.
* Updated mmap_prepare documentation to reference the vm_ops->mapped
  callback.
* Fixed stray markdown '## How to use' in documentation.
* Fixed bug reported by kernel test bot re: overlooked
  vma_desc_test_flags() -> vma_desc_test() in MTD driver for nommu.

v1:
https://lore.kernel.org/linux-mm/cover.1773346620.git.ljs@kernel.org/

Lorenzo Stoakes (Oracle) (16):
  mm: various small mmap_prepare cleanups
  mm: add documentation for the mmap_prepare file operation callback
  mm: document vm_operations_struct->open the same as close()
  mm: add vm_ops->mapped hook
  fs: afs: correctly drop reference count on mapping failure
  mm: add mmap_action_simple_ioremap()
  misc: open-dice: replace deprecated mmap hook with mmap_prepare
  hpet: replace deprecated mmap hook with mmap_prepare
  mtdchar: replace deprecated mmap hook with mmap_prepare, clean up
  stm: replace deprecated mmap hook with mmap_prepare
  staging: vme_user: replace deprecated mmap hook with mmap_prepare
  mm: allow handling of stacked mmap_prepare hooks in more drivers
  drivers: hv: vmbus: replace deprecated mmap hook with mmap_prepare
  uio: replace deprecated mmap hook with mmap_prepare in uio_info
  mm: add mmap_action_map_kernel_pages[_full]()
  mm: on remap assert that input range within the proposed VMA

 Documentation/driver-api/vme.rst           |   2 +-
 Documentation/filesystems/index.rst        |   1 +
 Documentation/filesystems/mmap_prepare.rst | 168 ++++++++++++++++
 drivers/char/hpet.c                        |  12 +-
 drivers/hv/hyperv_vmbus.h                  |   4 +-
 drivers/hv/vmbus_drv.c                     |  27 ++-
 drivers/hwtracing/stm/core.c               |  31 ++-
 drivers/misc/open-dice.c                   |  19 +-
 drivers/mtd/mtdchar.c                      |  21 +-
 drivers/staging/vme_user/vme.c             |  20 +-
 drivers/staging/vme_user/vme.h             |   2 +-
 drivers/staging/vme_user/vme_user.c        |  51 +++--
 drivers/target/target_core_user.c          |  26 ++-
 drivers/uio/uio.c                          |  10 +-
 drivers/uio/uio_hv_generic.c               |  11 +-
 fs/afs/file.c                              |  36 +++-
 include/linux/fs.h                         |  14 +-
 include/linux/hyperv.h                     |   4 +-
 include/linux/mm.h                         | 158 +++++++++++++--
 include/linux/mm_types.h                   |  17 +-
 include/linux/uio_driver.h                 |   4 +-
 mm/internal.h                              |  37 ++--
 mm/memory.c                                | 177 ++++++++++++-----
 mm/util.c                                  | 219 +++++++++++++++------
 mm/vma.c                                   |  59 ++++--
 mm/vma.h                                   |   2 +-
 tools/testing/vma/include/dup.h            | 141 +++++++++----
 tools/testing/vma/include/stubs.h          |   8 +-
 28 files changed, 970 insertions(+), 311 deletions(-)
 create mode 100644 Documentation/filesystems/mmap_prepare.rst

--
2.53.0

^ permalink raw reply

* [PATCH v2] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Long Li @ 2026-03-16 21:07 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Long Li, Rob Herring, Michael Kelley, linux-hyperv, linux-pci,
	linux-kernel

When hv_pci_assign_numa_node() processes a device that does not have
HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
virtual_numa_node, the device NUMA node is left unset. On x86_64,
the uninitialized default happens to be 0, but on ARM64 it is
NUMA_NO_NODE (-1).

Tests show that when no NUMA information is available from the Hyper-V
host, devices perform best when assigned to node 0. With NUMA_NO_NODE
the kernel may spread work across NUMA nodes, which degrades
performance on Hyper-V, particularly for high-throughput devices like
MANA.

Always set the device NUMA node to 0 before the conditional NUMA
affinity check, so that devices get a performant default when the host
provides no NUMA information, and behavior is consistent on both
x86_64 and ARM64.

Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
Signed-off-by: Long Li <longli@microsoft.com>
---
Changes in v2:
- Rewrite commit message to focus on performance as the primary
  motivation: NUMA_NO_NODE causes the kernel to spread work across
  NUMA nodes, degrading performance on Hyper-V

 drivers/pci/controller/pci-hyperv.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index 2c7a406b4ba8..38a790f642a1 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -2485,6 +2485,14 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
 		if (!hv_dev)
 			continue;
 
+		/*
+		 * If the Hyper-V host doesn't provide a NUMA node for the
+		 * device, default to node 0. With NUMA_NO_NODE the kernel
+		 * may spread work across NUMA nodes, which degrades
+		 * performance on Hyper-V.
+		 */
+		set_dev_node(&dev->dev, 0);
+
 		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
 		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
 			/*
-- 
2.43.0


^ permalink raw reply related

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Long Li @ 2026-03-16 20:53 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157BBDC4D3A535D55B7E31BD440A@SN6PR02MB4157.namprd02.prod.outlook.com>



> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Monday, March 16, 2026 12:16 PM
> To: Long Li <longli@microsoft.com>; Michael Kelley <mhklinux@outlook.com>;
> KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Lorenzo Pieralisi <lpieralisi@kernel.org>; Krzysztof
> Wilczyński <kwilczynski@kernel.org>; Manivannan Sadhasivam
> <mani@kernel.org>; Bjorn Helgaas <bhelgaas@google.com>
> Cc: Rob Herring <robh@kernel.org>; Michael Kelley <mikelley@microsoft.com>;
> linux-hyperv@vger.kernel.org; linux-pci@vger.kernel.org; linux-
> kernel@vger.kernel.org
> Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0 for
> devices without affinity info
> 
> From: Long Li <longli@microsoft.com> Sent: Monday, March 16, 2026 10:38
> AM
> >
> > > Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0
> > > for devices without affinity info
> > >
> > > From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026
> > > 3:33 PM
> > > >
> > > > When a Hyper-V PCI device does not have
> > > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> > > > virtual_numa_node, hv_pci_assign_numa_node() leaves the device
> > > > NUMA node unset. On x86_64, the default NUMA node happens to be 0,
> > > > but on
> > > > ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent behavior
> > > > across architectures.
> > > >
> > > > In Azure, when no NUMA information is available from the host,
> > > > devices perform best when assigned to node 0. Set the device NUMA
> > > > node to 0 unconditionally before the conditional NUMA affinity
> > > > check, so that devices always get a valid default and behavior is
> > > > consistent on both
> > > > x86_64 and ARM64.
> > >
> > > I'm wondering if this is the right overall approach to the inconsistency.
> > > Arguably, the arm64 value of NUMA_NO_NODE is more correct when the
> > > Hyper- V host has not provided any NUMA information to the guest.
> > > Maybe the x86/x64 side should be changed to default to NUMA_NO_NODE
> > > when there's no NUMA information provided.
> >
> > Tests have shown when Azure doesn't provide NUMA information for a PCI
> > device, workloads runs best when the node defaults to 0. NUMA_NO_NODE
> > results in performance degradation on ARM64. This affects most
> > high-performance devices like MANA when tested to line limit.
> >
> > >
> > > The observed x86/x64 default of NUMA node 0 does not come from
> > > x86/x64 architecture specific PCI code. It's a Hyper-V specific
> > > behavior due to how
> > > hv_pci_probe() allocates the struct hv_pcibus_device, with its
> > > embedded struct pci_sysdata. That struct pci_sysdata has a "node"
> > > field that the x86/x64
> > > __pcibus_to_node() function accesses when called from pci_device_add().
> > > If hv_pci_probe() were to initialize that "node" field to
> > > NUMA_NO_NODE at the same time that it sets the "domain" field,
> > > x86/x64 guests on Hyper-V would see the PCI device NUMA node default
> > > to NUMA_NO_NODE like on arm64. The current behavior of letting the
> > > sysdata "node" field stay zero as allocated might just be an historical
> oversight that no one noticed.
> >
> > I agree this was an oversight in the original X64 code, in that it
> > sets to numa node 0 by chance. But it turns out to be the ideal node
> > configuration for Azure when affinity information is not available
> > through the vPCI. (i.e. non isolated VM sizes). This results in
> > X64 perform better than ARM64 on multiple NUMA non-isolated VM sizes.
> >
> > >
> > > Are there any observed problems on arm64 with the default being
> > > NUMA_NO_NODE? If there are such problems, they should be fixed
> > > separately since that case needs to work for a kernel built with
> CONFIG_NUMA=n.
> > > pcibus_to_node() will return NUMA_NO_NODE, making the default on
> > > x86/x64 be NUMA_NO_NODE as well.
> > >
> > > I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(),
> > > and didn't see any obviously problems in an x86/x64 Azure VM with a
> > > MANA VF and multiple NVMe pass-thru devices. The NUMA node reported
> > > in /sys for these PCI devices is indeed NUMA_NO_NODE.
> > > But maybe there's some other issue that I'm not aware of.
> >
> > Extensive tests have shown defaulting NUMA node to 0 preserved the
> > existing behavior on X64, while improving performance on ARM64,
> > especially for MANA. This has been confirmed by the Hyper-V team, and
> Windows VM uses the same values for defaults.
> 
> Ah, OK.  That makes sense.  I'd suggest doing a new version of the patch with
> the commit message and the code comment describing performance as the
> main reason for the patch.  You somewhat said that in your current commit
> message, but it got muddled with the compatibility discussion, and the code
> comment just mentions compatibility. Compatibility between x86/x64 and
> arm64 isn't really the issue. The idea is that hv_pci_assign_numa_node() should
> always set the NUMA node to something, rather than depending on the default,
> which might be NUMA_NO_NODE. If the Hyper-V host provides a NUMA node,
> use that. But if not, use node 0 because that is usually where the underlying
> hardware actually has the physical device attached. Node 0 might not be right
> in certain situations, but if Hyper-V doesn't provide more information to the
> guest, guessing node 0 is better than letting the Linux kernel do something like
> load balancing across NUMA nodes, which could happen with
> NUMA_NO_NODE.  (At least, that's what I think happens!)
> 
> Michael

I'm adding the performance part to the commit message in v2. The compatibility part is still valid in that we want the consistent kernel behavior on X64 and on ARM.

Long

> 
> >
> > Thanks,
> >
> > Long
> >
> > >
> > > Michael
> > >
> > > >
> > > > Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and
> > > > support PCI_BUS_RELATIONS2")
> > > > Signed-off-by: Long Li <longli@microsoft.com>
> > > > ---
> > > >  drivers/pci/controller/pci-hyperv.c | 3 +++
> > > >  1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > b/drivers/pci/controller/pci-hyperv.c
> > > > index 2c7a406b4ba8..5c03b6e4cdab 100644
> > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct
> hv_pcibus_device *hbus)
> > > >  		if (!hv_dev)
> > > >  			continue;
> > > >
> > > > +		/* Default to node 0 for consistent behavior across
> architectures */
> > > > +		set_dev_node(&dev->dev, 0);
> > > > +
> > > >  		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
> > > >  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
> > > >  			/*
> > > > --
> > > > 2.43.0
> > > >
> >


^ permalink raw reply

* RE: [EXTERNAL] Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Long Li @ 2026-03-16 20:50 UTC (permalink / raw)
  To: Leon Romanovsky, Erni Sri Satya Vennela
  Cc: Konstantin Taranov, Jason Gunthorpe, linux-rdma@vger.kernel.org,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260316194929.GI61385@unreal>

> On Thu, Mar 12, 2026 at 11:16:41AM -0700, Erni Sri Satya Vennela wrote:
> > As part of MANA hardening for CVM, clamp hardware-reported adapter
> > capability values from the MANA_IB_GET_ADAPTER_CAP response before
> > they are used by the IB subsystem.
> >
> > The response fields (max_qp_count, max_cq_count, max_mr_count,
> > max_pd_count, max_inbound_read_limit, max_outbound_read_limit,
> > max_qp_wr, max_send_sge_count, max_recv_sge_count) are u32 but are
> > assigned to signed int members in struct ib_device_attr. If hardware
> > returns a value exceeding INT_MAX, the implicit u32-to-int conversion
> > produces a negative value, which can cause incorrect behavior in the
> > IB core and userspace applications.
> 
> This sentence does not make sense in the context of the Linux kernel.
> The fundamental assumption is that the underlying hardware behaves correctly,
> and driver code should not attempt to guard against purely hypothetical
> failures. The kernel only implements such self‑protection when there is a
> documented hardware issue accompanied by official errata.
> 
> Thanks

The idea is that a malicious hardware can't corrupt and steal other data from the kernel.

The assumption is that in a public cloud environment, you can't trust the hardware 100%.

^ permalink raw reply

* Re: [PATCH rdma-next 0/8] RDMA/mana_ib: Handle service reset for RDMA resources
From: Leon Romanovsky @ 2026-03-16 20:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
	Paolo Abeni, Eric Dumazet, Andrew Lunn, Haiyang Zhang,
	K . Y . Srinivasan, Wei Liu, Dexuan Cui, Simon Horman, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260313165928.GH1704121@ziepe.ca>

On Fri, Mar 13, 2026 at 01:59:28PM -0300, Jason Gunthorpe wrote:
> On Sat, Mar 07, 2026 at 07:38:14PM +0200, Leon Romanovsky wrote:
> > On Fri, Mar 06, 2026 at 05:47:14PM -0800, Long Li wrote:
> > > When the MANA hardware undergoes a service reset, the ETH auxiliary device
> > > (mana.eth) used by DPDK persists across the reset cycle — it is not removed
> > > and re-added like RC/UD/GSI QPs. This means userspace RDMA consumers such
> > > as DPDK have no way of knowing that firmware handles for their PD, CQ, WQ,
> > > QP and MR resources have become stale.
> > 
> > NAK to any of this.
> > 
> > In case of hardware reset, mana_ib AUX device needs to be destroyed and
> > recreated later.
> 
> Yeah, that is our general model for any serious RAS event where the
> driver's view of resources becomes out of sync with the HW.
> 
> You have tear down the ib_device by removing the aux and then bring
> back a new one.
> 
> There is an IB_EVENT_DEVICE_FATAL, but the purpose of that event is to
> tell userspace to close and re-open their uverbs FD.
> 
> We don't have a model where a uverbs FD in userspace can continue to
> work after the device has a catasrophic RAS event.
> 
> There may be room to have a model where the ib device doesn't fully
> unplug/replug so it retains its name and things, but that is core code
> not driver stuff.

Good luck with that model. It is going to break RDMA-CM hotplug support.

Thanks

> 
> Jason
> 

^ permalink raw reply

* Re: [PATCH rdma-next v2] RDMA/mana_ib: hardening: Clamp adapter capability values from MANA_IB_GET_ADAPTER_CAP
From: Leon Romanovsky @ 2026-03-16 19:49 UTC (permalink / raw)
  To: Erni Sri Satya Vennela
  Cc: longli, kotaranov, Jason Gunthorpe, linux-rdma, linux-hyperv,
	linux-kernel
In-Reply-To: <20260312181642.989735-1-ernis@linux.microsoft.com>

On Thu, Mar 12, 2026 at 11:16:41AM -0700, Erni Sri Satya Vennela wrote:
> As part of MANA hardening for CVM, clamp hardware-reported adapter
> capability values from the MANA_IB_GET_ADAPTER_CAP response before
> they are used by the IB subsystem.
> 
> The response fields (max_qp_count, max_cq_count, max_mr_count,
> max_pd_count, max_inbound_read_limit, max_outbound_read_limit,
> max_qp_wr, max_send_sge_count, max_recv_sge_count) are u32 but are
> assigned to signed int members in struct ib_device_attr. If hardware
> returns a value exceeding INT_MAX, the implicit u32-to-int conversion
> produces a negative value, which can cause incorrect behavior in the
> IB core and userspace applications.

This sentence does not make sense in the context of the Linux kernel.  
The fundamental assumption is that the underlying hardware behaves  
correctly, and driver code should not attempt to guard against purely  
hypothetical failures. The kernel only implements such self‑protection  
when there is a documented hardware issue accompanied by official  
errata.

Thanks

^ permalink raw reply

* Re: [PATCH 02/15] mm: add documentation for the mmap_prepare file operation callback
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 19:16 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpGd702=Xop3X5Aop9rrScdiAOQEEooTu1gcJqR9pmO5GA@mail.gmail.com>

On Sun, Mar 15, 2026 at 04:23:14PM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 12, 2026 at 1:27 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > This documentation makes it easier for a driver/file system implementer to
> > correctly use this callback.
> >
> > It covers the fundamentals, whilst intentionally leaving the less lovely
> > possible actions one might take undocumented (for instance - the
> > success_hook, error_hook fields in mmap_action).
> >
> > The document also covers the new VMA flags implementation which is the only
> > one which will work correctly with mmap_prepare.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  Documentation/filesystems/mmap_prepare.rst | 131 +++++++++++++++++++++
> >  1 file changed, 131 insertions(+)
> >  create mode 100644 Documentation/filesystems/mmap_prepare.rst
> >
> > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > new file mode 100644
> > index 000000000000..76908200f3a1
> > --- /dev/null
> > +++ b/Documentation/filesystems/mmap_prepare.rst
> > @@ -0,0 +1,131 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===========================
> > +mmap_prepare callback HOWTO
> > +===========================
> > +
> > +Introduction
> > +############
> > +
> > +The `struct file->f_op->mmap()` callback has been deprecated as it is both a
> > +stability and security risk, and doesn't always permit the merging of adjacent
> > +mappings resulting in unnecessary memory fragmentation.
> > +
> > +It has been replaced with the `file->f_op->mmap_prepare()` callback which solves
> > +these problems.
> > +
> > +## How To Use
> > +
> > +In your driver's `struct file_operations` struct, specify an `mmap_prepare`
> > +callback rather than an `mmap` one, e.g. for ext4:
> > +
> > +
> > +.. code-block:: C
> > +
> > +    const struct file_operations ext4_file_operations = {
> > +        ...
> > +        .mmap_prepare    = ext4_file_mmap_prepare,
> > +    };
> > +
> > +This has a signature of `int (*mmap_prepare)(struct vm_area_desc *)`.
> > +
> > +Examining the `struct vm_area_desc` type:
> > +
> > +.. code-block:: C
> > +
> > +    struct vm_area_desc {
> > +        /* Immutable state. */
> > +        const struct mm_struct *const mm;
> > +        struct file *const file; /* May vary from vm_file in stacked callers. */
> > +        unsigned long start;
> > +        unsigned long end;
> > +
> > +        /* Mutable fields. Populated with initial state. */
> > +        pgoff_t pgoff;
> > +        struct file *vm_file;
> > +        vma_flags_t vma_flags;
> > +        pgprot_t page_prot;
> > +
> > +        /* Write-only fields. */
> > +        const struct vm_operations_struct *vm_ops;
> > +        void *private_data;
> > +
> > +        /* Take further action? */
> > +        struct mmap_action action;
>
> So, action still belongs to /* Write-only fields. */ section? This is
> nitpicky, but it might be better to have this as:
>
>         /* Write-only fields. */
>         const struct vm_operations_struct *vm_ops;
>         void *private_data;
>         struct mmap_action action; /* Take further action? */

Absolutely not. This field is not to be written to by the user.

We sadly have to allow hugetlb to do some hacks, but these are things we don't
want to point out.

Users should use mmap_action_xxx() functions.

>
> > +    };
> > +
> > +This is straightforward - you have all the fields you need to set up the
> > +mapping, and you can update the mutable and writable fields, for instance:
> > +
> > +.. code-block:: Cw
> > +
> > +    static int ext4_file_mmap_prepare(struct vm_area_desc *desc)
> > +    {
> > +        int ret;
> > +        struct file *file = desc->file;
> > +        struct inode *inode = file->f_mapping->host;
> > +
> > +        ...
> > +
> > +        file_accessed(file);
> > +        if (IS_DAX(file_inode(file))) {
> > +            desc->vm_ops = &ext4_dax_vm_ops;
> > +            vma_desc_set_flags(desc, VMA_HUGEPAGE_BIT);
> > +        } else {
> > +            desc->vm_ops = &ext4_file_vm_ops;
> > +        }
> > +        return 0;
> > +    }
> > +
> > +Importantly, you no longer have to dance around with reference counts or locks
> > +when updating these fields - __you can simply go ahead and change them__.
> > +
> > +Everything is taken care of by the mapping code.
> > +
> > +VMA Flags
> > +=========
> > +
> > +Along with `mmap_prepare`, VMA flags have undergone an overhaul. Where before
> > +you would invoke one of `vm_flags_init()`, `vm_flags_reset()`, `vm_flags_set()`,
> > +`vm_flags_clear()`, and `vm_flags_mod()` to modify flags (and to have the
> > +locking done correctly for you, this is no longer necessary.
> > +
> > +Also, the legacy approach of specifying VMA flags via `VM_READ`, `VM_WRITE`,
> > +etc. - i.e. using a `VM_xxx` macro has changed too.
> > +
> > +When implementing `mmap_prepare()`, reference flags by their bit number, defined
> > +as a `VMA_xxx_BIT` macro, e.g. `VMA_READ_BIT`, `VMA_WRITE_BIT` etc., and use one
> > +of (where `desc` is a pointer to `struct vma_area_desc`):
> > +
> > +* `vma_desc_test_flags(desc, ...)` - Specify a comma-separated list of flags you
> > +  wish to test for (whether _any_ are set), e.g. - `vma_desc_test_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)` - returns `true` if either are set,
> > +  otherwise `false`.
> > +* `vma_desc_set_flags(desc, ...)` - Update the VMA descriptor flags to set
> > +  additional flags specified by a comma-separated list,
> > +  e.g. - `vma_desc_set_flags(desc, VMA_PFNMAP_BIT, VMA_IO_BIT)`.
> > +* `vma_desc_clear_flags(desc, ...)` - Update the VMA descriptor flags to clear
> > +  flags specified by a comma-separated list, e.g. - `vma_desc_clear_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)`.
> > +
> > +Actions
> > +=======
> > +
> > +You can now very easily have actions be performed upon a mapping once set up by
> > +utilising simple helper functions invoked upon the `struct vm_area_desc`
> > +pointer. These are:
> > +
> > +* `mmap_action_remap()` - Remaps a range consisting only of PFNs for a specific
> > +  range starting a virtual address and PFN number of a set size.
> > +
> > +* `mmap_action_remap_full()` - Same as `mmap_action_remap()`, only remaps the
> > +  entire mapping from `start_pfn` onward.
> > +
> > +* `mmap_action_ioremap()` - Same as `mmap_action_remap()`, only performs an I/O
> > +  remap.
> > +
> > +* `mmap_action_ioremap_full()` - Same as `mmap_action_ioremap()`, only remaps
> > +  the entire mapping from `start_pfn` onward.
> > +
> > +**NOTE:** The 'action' field should never normally be manipulated directly,
> > +rather you ought to use one of these helpers.
>
> I'm guessing the start and size parameters passed to
> mmap_action_remap() and such are restricted by vm_area_desc.start
> vm_area_desc.end. If so, should we document those restrictions and
> enforce them in the code?

I mean it's the same restrictions as all of the functions already apply if you
were to use them with a VMA descriptor.

I think implicitly a remap will fail if you try it out of the VMA range at the
point of applying the change.

But it might be worth adding range_in_vma_desc() checks at prepare time, will
see if I can do that for the respin.

I think it's pretty obvious that you shouldn't be trying to remap totally
unrelated memory, so I'm not sure that's at a level of granularity that's suited
to this document though.

>
> > +    struct vm_area_desc {
> > +        /* Immutable state. */
> > +        const struct mm_struct *const mm;
> > +        struct file *const file; /* May vary from vm_file in stacked callers. */
> > +        unsigned long start;
> > +        unsigned long end;
>
>
> > --
> > 2.53.0
> >

^ permalink raw reply

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Michael Kelley @ 2026-03-16 19:16 UTC (permalink / raw)
  To: Long Li, Michael Kelley, KY Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SA1PR21MB66837DDAF5F203E832DA5339CE40A@SA1PR21MB6683.namprd21.prod.outlook.com>

From: Long Li <longli@microsoft.com> Sent: Monday, March 16, 2026 10:38 AM
> 
> > Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices
> > without affinity info
> >
> > From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026 3:33 PM
> > >
> > > When a Hyper-V PCI device does not have
> > > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> > > virtual_numa_node, hv_pci_assign_numa_node() leaves the device NUMA
> > > node unset. On x86_64, the default NUMA node happens to be 0, but on
> > > ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent behavior across
> > > architectures.
> > >
> > > In Azure, when no NUMA information is available from the host, devices
> > > perform best when assigned to node 0. Set the device NUMA node to 0
> > > unconditionally before the conditional NUMA affinity check, so that
> > > devices always get a valid default and behavior is consistent on both
> > > x86_64 and ARM64.
> >
> > I'm wondering if this is the right overall approach to the inconsistency.
> > Arguably, the arm64 value of NUMA_NO_NODE is more correct when the Hyper-
> > V host has not provided any NUMA information to the guest. Maybe the x86/x64
> > side should be changed to default to NUMA_NO_NODE when there's no NUMA
> > information provided.
> 
> Tests have shown when Azure doesn't provide NUMA information for a PCI device,
> workloads runs best when the node defaults to 0. NUMA_NO_NODE results in
> performance degradation on ARM64. This affects most high-performance devices like
> MANA when tested to line limit.
> 
> >
> > The observed x86/x64 default of NUMA node 0 does not come from x86/x64
> > architecture specific PCI code. It's a Hyper-V specific behavior due to how
> > hv_pci_probe() allocates the struct hv_pcibus_device, with its embedded struct
> > pci_sysdata. That struct pci_sysdata has a "node" field that the x86/x64
> > __pcibus_to_node() function accesses when called from pci_device_add().
> > If hv_pci_probe() were to initialize that "node" field to NUMA_NO_NODE at the
> > same time that it sets the "domain" field, x86/x64 guests on Hyper-V would see
> > the PCI device NUMA node default to NUMA_NO_NODE like on arm64. The
> > current behavior of letting the sysdata "node" field stay zero as allocated might
> > just be an historical oversight that no one noticed.
> 
> I agree this was an oversight in the original X64 code, in that it sets to numa node 0 by
> chance. But it turns out to be the ideal node configuration for Azure when affinity
> information is not available through the vPCI. (i.e. non isolated VM sizes). This results in
> X64 perform better than ARM64 on multiple NUMA non-isolated VM sizes.
> 
> >
> > Are there any observed problems on arm64 with the default being
> > NUMA_NO_NODE? If there are such problems, they should be fixed separately
> > since that case needs to work for a kernel built with CONFIG_NUMA=n.
> > pcibus_to_node() will return NUMA_NO_NODE, making the default on x86/x64
> > be NUMA_NO_NODE as well.
> >
> > I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(), and
> > didn't see any obviously problems in an x86/x64 Azure VM with a MANA VF and
> > multiple NVMe pass-thru devices. The NUMA node reported in /sys for these PCI
> > devices is indeed NUMA_NO_NODE.
> > But maybe there's some other issue that I'm not aware of.
> 
> Extensive tests have shown defaulting NUMA node to 0 preserved the existing behavior
> on X64, while improving performance on ARM64, especially for MANA. This has been
> confirmed by the Hyper-V team, and Windows VM uses the same values for defaults.

Ah, OK.  That makes sense.  I'd suggest doing a new version of the patch with
the commit message and the code comment describing performance as the
main reason for the patch.  You somewhat said that in your current commit
message, but it got muddled with the compatibility discussion, and the code
comment just mentions compatibility. Compatibility between x86/x64 and
arm64 isn't really the issue. The idea is that hv_pci_assign_numa_node() should
always set the NUMA node to something, rather than depending on the default,
which might be NUMA_NO_NODE. If the Hyper-V host provides a NUMA node,
use that. But if not, use node 0 because that is usually where the underlying
hardware actually has the physical device attached. Node 0 might not be
right in certain situations, but if Hyper-V doesn't provide more information
to the guest, guessing node 0 is better than letting the Linux kernel do
something like load balancing across NUMA nodes, which could happen
with NUMA_NO_NODE.  (At least, that's what I think happens!)

Michael

> 
> Thanks,
> 
> Long
> 
> >
> > Michael
> >
> > >
> > > Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
> > > Signed-off-by: Long Li <longli@microsoft.com>
> > > ---
> > >  drivers/pci/controller/pci-hyperv.c | 3 +++
> > >  1 file changed, 3 insertions(+)
> > >
> > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > b/drivers/pci/controller/pci-hyperv.c
> > > index 2c7a406b4ba8..5c03b6e4cdab 100644
> > > --- a/drivers/pci/controller/pci-hyperv.c
> > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
> > >  		if (!hv_dev)
> > >  			continue;
> > >
> > > +		/* Default to node 0 for consistent behavior across architectures */
> > > +		set_dev_node(&dev->dev, 0);
> > > +
> > >  		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
> > >  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
> > >  			/*
> > > --
> > > 2.43.0
> > >
> 


^ permalink raw reply

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Long Li @ 2026-03-16 17:38 UTC (permalink / raw)
  To: Michael Kelley, KY Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415748A42DCBDD8AB635838DD440A@SN6PR02MB4157.namprd02.prod.outlook.com>

> Subject: [EXTERNAL] RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices
> without affinity info
> 
> From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026 3:33 PM
> >
> > When a Hyper-V PCI device does not have
> > HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> > virtual_numa_node, hv_pci_assign_numa_node() leaves the device NUMA
> > node unset. On x86_64, the default NUMA node happens to be 0, but on
> > ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent behavior across
> > architectures.
> >
> > In Azure, when no NUMA information is available from the host, devices
> > perform best when assigned to node 0. Set the device NUMA node to 0
> > unconditionally before the conditional NUMA affinity check, so that
> > devices always get a valid default and behavior is consistent on both
> > x86_64 and ARM64.
> 
> I'm wondering if this is the right overall approach to the inconsistency.
> Arguably, the arm64 value of NUMA_NO_NODE is more correct when the Hyper-
> V host has not provided any NUMA information to the guest. Maybe the x86/x64
> side should be changed to default to NUMA_NO_NODE when there's no NUMA
> information provided.

Tests have shown when Azure doesn't provide NUMA information for a PCI device, workloads runs best when the node defaults to 0. NUMA_NO_NODE results in performance degradation on ARM64. This affects most high-performance devices like MANA when tested to line limit.

> 
> The observed x86/x64 default of NUMA node 0 does not come from x86/x64
> architecture specific PCI code. It's a Hyper-V specific behavior due to how
> hv_pci_probe() allocates the struct hv_pcibus_device, with its embedded struct
> pci_sysdata. That struct pci_sysdata has a "node" field that the x86/x64
> __pcibus_to_node() function accesses when called from pci_device_add().
> If hv_pci_probe() were to initialize that "node" field to NUMA_NO_NODE at the
> same time that it sets the "domain" field, x86/x64 guests on Hyper-V would see
> the PCI device NUMA node default to NUMA_NO_NODE like on arm64. The
> current behavior of letting the sysdata "node" field stay zero as allocated might
> just be an historical oversight that no one noticed.

I agree this was an oversight in the original X64 code, in that it sets to numa node 0 by chance. But it turns out to be the ideal node configuration for Azure when affinity information is not available through the vPCI. (i.e. non isolated VM sizes). This results in X64 perform better than ARM64 on multiple NUMA non-isolated VM sizes.

> 
> Are there any observed problems on arm64 with the default being
> NUMA_NO_NODE? If there are such problems, they should be fixed separately
> since that case needs to work for a kernel built with CONFIG_NUMA=n.
> pcibus_to_node() will return NUMA_NO_NODE, making the default on x86/x64
> be NUMA_NO_NODE as well.
> 
> I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(), and
> didn't see any obviously problems in an x86/x64 Azure VM with a MANA VF and
> multiple NVMe pass-thru devices. The NUMA node reported in /sys for these PCI
> devices is indeed NUMA_NO_NODE.
> But maybe there's some other issue that I'm not aware of.

Extensive tests have shown defaulting NUMA node to 0 preserved the existing behavior on X64, while improving performance on ARM64, especially for MANA. This has been confirmed by the Hyper-V team, and Windows VM uses the same values for defaults.

Thanks,

Long

> 
> Michael
> 
> >
> > Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and
> > support PCI_BUS_RELATIONS2")
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> >  drivers/pci/controller/pci-hyperv.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/pci/controller/pci-hyperv.c
> > b/drivers/pci/controller/pci-hyperv.c
> > index 2c7a406b4ba8..5c03b6e4cdab 100644
> > --- a/drivers/pci/controller/pci-hyperv.c
> > +++ b/drivers/pci/controller/pci-hyperv.c
> > @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct
> hv_pcibus_device *hbus)
> >  		if (!hv_dev)
> >  			continue;
> >
> > +		/* Default to node 0 for consistent behavior across architectures
> */
> > +		set_dev_node(&dev->dev, 0);
> > +
> >  		if (hv_dev->desc.flags &
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
> >  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
> >  			/*
> > --
> > 2.43.0
> >


^ permalink raw reply

* RE: [PATCH] PCI: hv: Set default NUMA node to 0 for devices without affinity info
From: Michael Kelley @ 2026-03-16 17:11 UTC (permalink / raw)
  To: Long Li, K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Lorenzo Pieralisi, Krzysztof Wilczyński,
	Manivannan Sadhasivam, Bjorn Helgaas
  Cc: Rob Herring, Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260312223244.1006305-1-longli@microsoft.com>

From: Long Li <longli@microsoft.com> Sent: Thursday, March 12, 2026 3:33 PM
> 
> When a Hyper-V PCI device does not have
> HV_PCI_DEVICE_FLAG_NUMA_AFFINITY set or has an out-of-range
> virtual_numa_node, hv_pci_assign_numa_node() leaves the device
> NUMA node unset. On x86_64, the default NUMA node happens to be
> 0, but on ARM64 it is NUMA_NO_NODE (-1), leading to inconsistent
> behavior across architectures.
> 
> In Azure, when no NUMA information is available from the host,
> devices perform best when assigned to node 0. Set the device NUMA
> node to 0 unconditionally before the conditional NUMA affinity
> check, so that devices always get a valid default and behavior is
> consistent on both x86_64 and ARM64.

I'm wondering if this is the right overall approach to the inconsistency.
Arguably, the arm64 value of NUMA_NO_NODE is more correct when the
Hyper-V host has not provided any NUMA information to the guest. Maybe
the x86/x64 side should be changed to default to NUMA_NO_NODE when
there's no NUMA information provided.

The observed x86/x64 default of NUMA node 0 does not come from x86/x64
architecture specific PCI code. It's a Hyper-V specific behavior due to how
hv_pci_probe() allocates the struct hv_pcibus_device, with its embedded
struct pci_sysdata. That struct pci_sysdata has a "node" field that the x86/x64
__pcibus_to_node() function accesses when called from pci_device_add().
If hv_pci_probe() were to initialize that "node" field to NUMA_NO_NODE at
the same time that it sets the "domain" field, x86/x64 guests on Hyper-V
would see the PCI device NUMA node default to NUMA_NO_NODE like on
arm64. The current behavior of letting the sysdata "node" field stay zero
as allocated might just be an historical oversight that no one noticed.

Are there any observed problems on arm64 with the default being
NUMA_NO_NODE? If there are such problems, they should be fixed
separately since that case needs to work for a kernel built with
CONFIG_NUMA=n. pcibus_to_node() will return NUMA_NO_NODE,
making the default on x86/x64 be NUMA_NO_NODE as well.

I've tested setting sysdata->node to NUMA_NO_NODE in hv_pci_probe(),
and didn't see any obviously problems in an x86/x64 Azure VM with a
MANA VF and multiple NVMe pass-thru devices. The NUMA node
reported in /sys for these PCI devices is indeed NUMA_NO_NODE.
But maybe there's some other issue that I'm not aware of.

Michael

> 
> Fixes: 999dd956d838 ("PCI: hv: Add support for protocol 1.3 and support PCI_BUS_RELATIONS2")
> Signed-off-by: Long Li <longli@microsoft.com>
> ---
>  drivers/pci/controller/pci-hyperv.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c7a406b4ba8..5c03b6e4cdab 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -2485,6 +2485,9 @@ static void hv_pci_assign_numa_node(struct hv_pcibus_device *hbus)
>  		if (!hv_dev)
>  			continue;
> 
> +		/* Default to node 0 for consistent behavior across architectures */
> +		set_dev_node(&dev->dev, 0);
> +
>  		if (hv_dev->desc.flags & HV_PCI_DEVICE_FLAG_NUMA_AFFINITY &&
>  		    hv_dev->desc.virtual_numa_node < num_possible_nodes())
>  			/*
> --
> 2.43.0
> 


^ permalink raw reply

* Re: [PATCH 15/15] mm: add mmap_action_map_kernel_pages[_full]()
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:54 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <4fd15134-ae1e-4233-8d5a-9d1e0b9f94dc@infradead.org>

On Thu, Mar 12, 2026 at 04:15:26PM -0700, Randy Dunlap wrote:
>
> On 3/12/26 1:27 PM, Lorenzo Stoakes (Oracle) wrote:
>
> > Finally, we update the VMA tests accordingly to reflect the changes.
>
> IMO we could omit the word "we" 5 times above.
> (but no change is required)
>
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 88f42faeb377..88ad5649c02d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
>
> > +/**
> > + * range_is_subset - Is the specified inner range a subset of the outer range?
> > + * @outer_start: The start of the outer range.
> > + * @outer_end: The exclusive end of the outer range.
> > + * @inner_start: The start of the inner range.
> > + * @inner_end: The exclusive end of the inner range.
> > + *
> > + * Returns %true if [inner_start, inner_end) is a subset of [outer_start,
>
>     * Returns:
> (for kernel-doc)

Ack

>
> > + * outer_end), otherwise %false.
> > + */
> > +static inline bool range_is_subset(unsigned long outer_start,
> > +				   unsigned long outer_end,
> > +				   unsigned long inner_start,
> > +				   unsigned long inner_end)
> > +{
> > +	return outer_start <= inner_start && inner_end <= outer_end;
> > +}
> > +
> > +/**
> > + * range_in_vma - is the specified [@start, @end) range a subset of the VMA?
> > + * @vma: The VMA against which we want to check [@start, @end).
> > + * @start: The start of the range we wish to check.
> > + * @end: The exclusive end of the range we wish to check.
> > + *
> > + * Returns %true if [@start, @end) is a subset of [@vma->vm_start,
>
>     * Returns:

Ack

>
> > + * @vma->vm_end), %false otherwise.
> > + */
> >  static inline bool range_in_vma(const struct vm_area_struct *vma,
> >  				unsigned long start, unsigned long end)
> >  {
> > -	return (vma && vma->vm_start <= start && end <= vma->vm_end);
> > +	if (!vma)
> > +		return false;
> > +
> > +	return range_is_subset(vma->vm_start, vma->vm_end, start, end);
> > +}
> > +
> > +/**
> > + * range_in_vma_desc - is the specified [@start, @end) range a subset of the VMA
> > + * described by @desc, a VMA descriptor?
> > + * @desc: The VMA descriptor against which we want to check [@start, @end).
> > + * @start: The start of the range we wish to check.
> > + * @end: The exclusive end of the range we wish to check.
> > + *
> > + * Returns %true if [@start, @end) is a subset of [@desc->start, @desc->end),
>
>     * Returns:

Ack, I think in general I've seen (or believe I've seen :) other cases without
the colon, so was kinda imitating, but I may also be imagining that ;)

>
> > + * %false otherwise.
> > + */
> > +static inline bool range_in_vma_desc(const struct vm_area_desc *desc,
> > +				     unsigned long start, unsigned long end)
> > +{
> > +	if (!desc)
> > +		return false;
> > +
> > +	return range_is_subset(desc->start, desc->end, start, end);
> >  }
>
> --
> ~Randy
>

Will also fold these changes into the respin!

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 02/15] mm: add documentation for the mmap_prepare file operation callback
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:51 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko, Jann Horn,
	Pedro Falcato, linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <f0e33b51-d465-462d-b0f6-98a1db66bb15@infradead.org>

On Thu, Mar 12, 2026 at 05:12:04PM -0700, Randy Dunlap wrote:
> (Andrew: patch attached)
>
>
> On 3/12/26 1:27 PM, Lorenzo Stoakes (Oracle) wrote:
>
> Documentation/filesystems/mmap_prepare.rst: WARNING: document isn't included in any toctree [toc.not_included]
>
> Should be in some index.rst file. In filesystems I suppose.

Ack thanks.

>
> > ---
> >  Documentation/filesystems/mmap_prepare.rst | 131 +++++++++++++++++++++
> >  1 file changed, 131 insertions(+)
> >  create mode 100644 Documentation/filesystems/mmap_prepare.rst
> >
> > diff --git a/Documentation/filesystems/mmap_prepare.rst b/Documentation/filesystems/mmap_prepare.rst
> > new file mode 100644
> > index 000000000000..76908200f3a1
> > --- /dev/null
> > +++ b/Documentation/filesystems/mmap_prepare.rst
> > @@ -0,0 +1,131 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +===========================
> > +mmap_prepare callback HOWTO
> > +===========================
> > +
> > +Introduction
> > +############
>
> Kernel style is "=============" above instead of "############".

Ack

>
> > +
> > +The `struct file->f_op->mmap()` callback has been deprecated as it is both a
> > +stability and security risk, and doesn't always permit the merging of adjacent
> > +mappings resulting in unnecessary memory fragmentation.
> > +
> > +It has been replaced with the `file->f_op->mmap_prepare()` callback which solves
> > +these problems.
> > +
> > +## How To Use
> > +
> > +In your driver's `struct file_operations` struct, specify an `mmap_prepare`
> > +callback rather than an `mmap` one, e.g. for ext4:
> > +
> > +
> > +.. code-block:: C
> > +
> > +    const struct file_operations ext4_file_operations = {
> > +        ...
> > +        .mmap_prepare    = ext4_file_mmap_prepare,
> > +    };
> > +
> > +This has a signature of `int (*mmap_prepare)(struct vm_area_desc *)`.
> > +
> > +Examining the `struct vm_area_desc` type:
> > +
> > +.. code-block:: C
> > +
> > +    struct vm_area_desc {
> > +        /* Immutable state. */
> > +        const struct mm_struct *const mm;
> > +        struct file *const file; /* May vary from vm_file in stacked callers. */
> > +        unsigned long start;
> > +        unsigned long end;
> > +
> > +        /* Mutable fields. Populated with initial state. */
> > +        pgoff_t pgoff;
> > +        struct file *vm_file;
> > +        vma_flags_t vma_flags;
> > +        pgprot_t page_prot;
> > +
> > +        /* Write-only fields. */
> > +        const struct vm_operations_struct *vm_ops;
> > +        void *private_data;
> > +
> > +        /* Take further action? */
> > +        struct mmap_action action;
> > +    };
> > +
> > +This is straightforward - you have all the fields you need to set up the
> > +mapping, and you can update the mutable and writable fields, for instance:
> > +
> > +.. code-block:: Cw
>
>    .. code-block:: C
>
> Documentation/filesystems/mmap_prepare.rst:60: WARNING: Pygments lexer name 'Cw' is not known [misc.highlighting_failure]
>
> Maybe a typo?

Yeah is a typo thanks!

>
> > +
> > +    static int ext4_file_mmap_prepare(struct vm_area_desc *desc)
> > +    {
> > +        int ret;
> > +        struct file *file = desc->file;
> > +        struct inode *inode = file->f_mapping->host;
> > +
> > +        ...
> > +
> > +        file_accessed(file);
> > +        if (IS_DAX(file_inode(file))) {
> > +            desc->vm_ops = &ext4_dax_vm_ops;
> > +            vma_desc_set_flags(desc, VMA_HUGEPAGE_BIT);
> > +        } else {
> > +            desc->vm_ops = &ext4_file_vm_ops;
> > +        }
> > +        return 0;
> > +    }
> > +
> > +Importantly, you no longer have to dance around with reference counts or locks
> > +when updating these fields - __you can simply go ahead and change them__.
> > +
> > +Everything is taken care of by the mapping code.
> > +
> > +VMA Flags
> > +=========
>
> and then use "---------------" here instead of "==============".

Ack

>
> (from Documentation/doc-guide/sphinx.rst)
>
> > +
> > +Along with `mmap_prepare`, VMA flags have undergone an overhaul. Where before
> > +you would invoke one of `vm_flags_init()`, `vm_flags_reset()`, `vm_flags_set()`,
> > +`vm_flags_clear()`, and `vm_flags_mod()` to modify flags (and to have the
> > +locking done correctly for you, this is no longer necessary.
> > +
> > +Also, the legacy approach of specifying VMA flags via `VM_READ`, `VM_WRITE`,
> > +etc. - i.e. using a `VM_xxx` macro has changed too.
> > +
> > +When implementing `mmap_prepare()`, reference flags by their bit number, defined
> > +as a `VMA_xxx_BIT` macro, e.g. `VMA_READ_BIT`, `VMA_WRITE_BIT` etc., and use one
> > +of (where `desc` is a pointer to `struct vma_area_desc`):
> > +
> > +* `vma_desc_test_flags(desc, ...)` - Specify a comma-separated list of flags you
> > +  wish to test for (whether _any_ are set), e.g. - `vma_desc_test_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)` - returns `true` if either are set,
> > +  otherwise `false`.
> > +* `vma_desc_set_flags(desc, ...)` - Update the VMA descriptor flags to set
> > +  additional flags specified by a comma-separated list,
> > +  e.g. - `vma_desc_set_flags(desc, VMA_PFNMAP_BIT, VMA_IO_BIT)`.
> > +* `vma_desc_clear_flags(desc, ...)` - Update the VMA descriptor flags to clear
> > +  flags specified by a comma-separated list, e.g. - `vma_desc_clear_flags(desc,
> > +  VMA_WRITE_BIT, VMA_MAYWRITE_BIT)`.
> > +
> > +Actions
> > +=======
> > +
> > +You can now very easily have actions be performed upon a mapping once set up by
> > +utilising simple helper functions invoked upon the `struct vm_area_desc`
> > +pointer. These are:
> > +
> > +* `mmap_action_remap()` - Remaps a range consisting only of PFNs for a specific
> > +  range starting a virtual address and PFN number of a set size.
> > +
> > +* `mmap_action_remap_full()` - Same as `mmap_action_remap()`, only remaps the
> > +  entire mapping from `start_pfn` onward.
> > +
> > +* `mmap_action_ioremap()` - Same as `mmap_action_remap()`, only performs an I/O
> > +  remap.
> > +
> > +* `mmap_action_ioremap_full()` - Same as `mmap_action_ioremap()`, only remaps
> > +  the entire mapping from `start_pfn` onward.
> > +
> > +**NOTE:** The 'action' field should never normally be manipulated directly,
> > +rather you ought to use one of these helpers.
>
> I also see this warning, but I don't know what it is referring to:
>
> Documentation/filesystems/mmap_prepare.rst:132: ERROR: Anonymous hyperlink mismatch: 1 references but 0 targets.
> See "backrefs" attribute for IDs. [docutils]
>
> (OK, I found/fixed that also.)
>
> There are also lots of single ` marks which mean italics. I thought those were
> not what was intended, so I changed (most of) them to `` marks, which means
> "code block / monospace". I can fix those if needed.
>
> from the patch file:
> @Lorenzo: ISTR that you prefer explicit quoting on structs and
> functions. I didn't do that here since kernel automarkup does that,
> but if you prefer, I can redo the patch with those changes.

The issue was in another document it didn't seem to properly recognise the types
AFAICT (but I might have been mistaken anyway!) But I'm fine without.

>
> HTH.
> --
> ~Randy

Thanks for this, will fold the patch into the respin also!

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 01/15] mm: various small mmap_prepare cleanups
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpHcjFU1r7ixiJM4b_a5HTesxBmW6DiCreaWpJ8DLM5haQ@mail.gmail.com>

On Sun, Mar 15, 2026 at 04:06:48PM -0700, Suren Baghdasaryan wrote:
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -4116,10 +4116,10 @@ static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
> > >         mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
> > >  }
> > >
> > > -void mmap_action_prepare(struct mmap_action *action,
> > > -                        struct vm_area_desc *desc);
> > > -int mmap_action_complete(struct mmap_action *action,
> > > -                        struct vm_area_struct *vma);
> > > +int mmap_action_prepare(struct vm_area_desc *desc,
> > > +                       struct mmap_action *action);
> > > +int mmap_action_complete(struct vm_area_struct *vma,
> > > +                        struct mmap_action *action);
> > >
> > >  /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
> > >  static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 95b583e7e4f7..7bfa85b5e78b 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -1775,26 +1775,32 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> > >  void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> > >  int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> > >
> > > -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
> > > -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
> > > -               unsigned long pfn, unsigned long size, pgprot_t pgprot);
> > > +int remap_pfn_range_prepare(struct vm_area_desc *desc,
> > > +                           struct mmap_action *action);
> > > +int remap_pfn_range_complete(struct vm_area_struct *vma,
> > > +                            struct mmap_action *action);
> > >
> > > -static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > > -               unsigned long orig_pfn, unsigned long size)
> > > +static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > > +                                            struct mmap_action *action)
> > >  {
> > > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > > +       const unsigned long size = action->remap.size;
> > >         const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> > >
> > > -       return remap_pfn_range_prepare(desc, pfn);
> > > +       action->remap.start_pfn = pfn;
> > > +       return remap_pfn_range_prepare(desc, action);
> > >  }
> > >
> > >  static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
> > > -               unsigned long addr, unsigned long orig_pfn, unsigned long size,
> > > -               pgprot_t orig_prot)
> > > +                                             struct mmap_action *action)
> > >  {
> > > -       const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> > > -       const pgprot_t prot = pgprot_decrypted(orig_prot);
> > > +       const unsigned long size = action->remap.size;
> > > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > > +       const pgprot_t orig_prot = vma->vm_page_prot;
> > >
> > > -       return remap_pfn_range_complete(vma, addr, pfn, size, prot);
> > > +       action->remap.pgprot = pgprot_decrypted(orig_prot);
>
> I'm guessing it doesn't really matter but after this change
> action->remap.pgprot will store the decrypted value while before this
> change it was kept the way mmap_prepare() originally set it. We pass
> the action structure later to mmap_actpion_finish() but it does not use
> action->remap.pgprot, so this probably doesn't matter.

Yeah it doesn't really matter either way.

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 01/15] mm: various small mmap_prepare cleanups
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpEsCrFEYNkkTfRLGojGOYAAx1=WOojOhpBb_=WZBr6bnQ@mail.gmail.com>

On Sun, Mar 15, 2026 at 03:56:54PM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 12, 2026 at 1:27 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > Rather than passing arbitrary fields, pass an mmap_action field directly to
> > mmap prepare and complete helpers to put all the action-specific logic in
> > the function actually doing the work.
> >
> > Additionally, allow mmap prepare functions to return an error so we can
> > error out as soon as possible if there is something logically incorrect in
> > the input.
> >
> > Update remap_pfn_range_prepare() to properly check the input range for the
> > CoW case.
>
> By "properly check" do you mean the replacement of desc->start and
> desc->end with action->remap.start and action->remap.start +
> action->remap.size when calling get_remap_pgoff() from
> remap_pfn_range_prepare()?
>
> >
> > While we're here, make remap_pfn_range_prepare_vma() a little neater, and
> > pass mmap_action directly to call_action_complete().
> >
> > Then, update compat_vma_mmap() to perform its logic directly, as
> > __compat_vma_map() is not used by anything so we don't need to export it.
>
> Not directly related to this patch but while reviewing, I was also
> checking vma locking rules in this mmap_prepare() + mmap() sequence
> and I noticed that the new VMA flag modification functions like
> vma_set_flags_mask() do assert vma_assert_locked(vma). It would be

Do NOT? :)

I don't think it'd work, because in some cases you're setting flags for a
VMA that is not yet inserted in the tree, etc.

I don't think it's hugely useful to split out these functions in some way
in the way the vm_flags_*() stuff is split so we assert sometimes, not
others.

I'd rather keep this as clean an interface as possible.

In any case the majority of cases where flags are being set are not on the
VMA, so really only core code, that would likely otherwise assert when it
needs to, would already be asserting.

The cases where drivers will do it, all of them will be using
vma_desc_set_flags() etc.

> useful to add these but as a separate change. I will add it to my todo
> list.

So I don't think it'd be generally useful at this time.

>
> >
> > Also update compat_vma_mmap() to use vfs_mmap_prepare() rather than calling
> > the mmap_prepare op directly.
> >
> > Finally, update the VMA userland tests to reflect the changes.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  include/linux/fs.h                |   2 -
> >  include/linux/mm.h                |   8 +--
> >  mm/internal.h                     |  28 +++++---
> >  mm/memory.c                       |  45 +++++++-----
> >  mm/util.c                         | 112 +++++++++++++-----------------
> >  mm/vma.c                          |  21 +++---
> >  tools/testing/vma/include/dup.h   |   9 ++-
> >  tools/testing/vma/include/stubs.h |   9 +--
> >  8 files changed, 123 insertions(+), 111 deletions(-)
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 8b3dd145b25e..a2628a12bd2b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2058,8 +2058,6 @@ static inline bool can_mmap_file(struct file *file)
> >         return true;
> >  }
> >
> > -int __compat_vma_mmap(const struct file_operations *f_op,
> > -               struct file *file, struct vm_area_struct *vma);
> >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
> >
> >  static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 4c4fd55fc823..cc5960a84382 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -4116,10 +4116,10 @@ static inline void mmap_action_ioremap_full(struct vm_area_desc *desc,
> >         mmap_action_ioremap(desc, desc->start, start_pfn, vma_desc_size(desc));
> >  }
> >
> > -void mmap_action_prepare(struct mmap_action *action,
> > -                        struct vm_area_desc *desc);
> > -int mmap_action_complete(struct mmap_action *action,
> > -                        struct vm_area_struct *vma);
> > +int mmap_action_prepare(struct vm_area_desc *desc,
> > +                       struct mmap_action *action);
> > +int mmap_action_complete(struct vm_area_struct *vma,
> > +                        struct mmap_action *action);
> >
> >  /* Look up the first VMA which exactly match the interval vm_start ... vm_end */
> >  static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 95b583e7e4f7..7bfa85b5e78b 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -1775,26 +1775,32 @@ int walk_page_range_debug(struct mm_struct *mm, unsigned long start,
> >  void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm);
> >  int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm);
> >
> > -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn);
> > -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
> > -               unsigned long pfn, unsigned long size, pgprot_t pgprot);
> > +int remap_pfn_range_prepare(struct vm_area_desc *desc,
> > +                           struct mmap_action *action);
> > +int remap_pfn_range_complete(struct vm_area_struct *vma,
> > +                            struct mmap_action *action);
> >
> > -static inline void io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > -               unsigned long orig_pfn, unsigned long size)
> > +static inline int io_remap_pfn_range_prepare(struct vm_area_desc *desc,
> > +                                            struct mmap_action *action)
> >  {
> > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > +       const unsigned long size = action->remap.size;
> >         const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> >
> > -       return remap_pfn_range_prepare(desc, pfn);
> > +       action->remap.start_pfn = pfn;
> > +       return remap_pfn_range_prepare(desc, action);
> >  }
> >
> >  static inline int io_remap_pfn_range_complete(struct vm_area_struct *vma,
> > -               unsigned long addr, unsigned long orig_pfn, unsigned long size,
> > -               pgprot_t orig_prot)
> > +                                             struct mmap_action *action)
> >  {
> > -       const unsigned long pfn = io_remap_pfn_range_pfn(orig_pfn, size);
> > -       const pgprot_t prot = pgprot_decrypted(orig_prot);
> > +       const unsigned long size = action->remap.size;
> > +       const unsigned long orig_pfn = action->remap.start_pfn;
> > +       const pgprot_t orig_prot = vma->vm_page_prot;
> >
> > -       return remap_pfn_range_complete(vma, addr, pfn, size, prot);
> > +       action->remap.pgprot = pgprot_decrypted(orig_prot);
> > +       action->remap.start_pfn  = io_remap_pfn_range_pfn(orig_pfn, size);
> > +       return remap_pfn_range_complete(vma, action);
> >  }
> >
> >  #ifdef CONFIG_MMU_NOTIFIER
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6aa0ea4af1fc..364fa8a45360 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3099,26 +3099,34 @@ static int do_remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  #endif
> >
> > -void remap_pfn_range_prepare(struct vm_area_desc *desc, unsigned long pfn)
> > +int remap_pfn_range_prepare(struct vm_area_desc *desc,
> > +                           struct mmap_action *action)
> >  {
> > -       /*
> > -        * We set addr=VMA start, end=VMA end here, so this won't fail, but we
> > -        * check it again on complete and will fail there if specified addr is
> > -        * invalid.
> > -        */
> > -       get_remap_pgoff(vma_desc_is_cow_mapping(desc), desc->start, desc->end,
> > -                       desc->start, desc->end, pfn, &desc->pgoff);
> > +       const unsigned long start = action->remap.start;
> > +       const unsigned long end = start + action->remap.size;
> > +       const unsigned long pfn = action->remap.start_pfn;
> > +       const bool is_cow = vma_desc_is_cow_mapping(desc);
>
> I was trying to figure out who sets action->remap.start and
> action->remap.size and if they somehow guaranteed to be always equal
> to desc->start and (desc->end - desc->start). My understanding is that
> action->remap.start and action->remap.size are set by
> f_op->mmap_prepare() but I'm not sure if they are always the same as
> desc->start and (desc->end - desc->start) and if so, how do we enforce
> that.

They are set, and they might not always be the same, because the existing
implementation does not set them the same.

Once I've completed the change, I can check to ensure that nobody is doing
anything crazy with this.

I also plan to add specific discontiguous range handlers to handle the
cases where drivers wish to map that way.

In fact, I already implemented it (and DMA coherent stuff) but stripped it
out the series for now for time (the original series was ~27 patches :) as
I want to test that more etc.

Users have access to mmap_action_remap_full() to specify that they want to
remap the full range.

>
> > +       int err;
> > +
> > +       err = get_remap_pgoff(is_cow, start, end, desc->start, desc->end, pfn,
> > +                             &desc->pgoff);
> > +       if (err)
> > +               return err;
> > +
> >         vma_desc_set_flags_mask(desc, VMA_REMAP_FLAGS);
> > +       return 0;
> >  }
> >
> > -static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma, unsigned long addr,
> > -               unsigned long pfn, unsigned long size)
> > +static int remap_pfn_range_prepare_vma(struct vm_area_struct *vma,
> > +                                      unsigned long addr, unsigned long pfn,
> > +                                      unsigned long size)
> >  {
> > -       unsigned long end = addr + PAGE_ALIGN(size);
> > +       const unsigned long end = addr + PAGE_ALIGN(size);
> > +       const bool is_cow = is_cow_mapping(vma->vm_flags);
> >         int err;
> >
> > -       err = get_remap_pgoff(is_cow_mapping(vma->vm_flags), addr, end,
> > -                             vma->vm_start, vma->vm_end, pfn, &vma->vm_pgoff);
> > +       err = get_remap_pgoff(is_cow, addr, end, vma->vm_start, vma->vm_end,
> > +                             pfn, &vma->vm_pgoff);
> >         if (err)
> >                 return err;
> >
> > @@ -3151,10 +3159,15 @@ int remap_pfn_range(struct vm_area_struct *vma, unsigned long addr,
> >  }
> >  EXPORT_SYMBOL(remap_pfn_range);
> >
> > -int remap_pfn_range_complete(struct vm_area_struct *vma, unsigned long addr,
> > -               unsigned long pfn, unsigned long size, pgprot_t prot)
> > +int remap_pfn_range_complete(struct vm_area_struct *vma,
> > +                            struct mmap_action *action)
> >  {
> > -       return do_remap_pfn_range(vma, addr, pfn, size, prot);
> > +       const unsigned long start = action->remap.start;
> > +       const unsigned long pfn = action->remap.start_pfn;
> > +       const unsigned long size = action->remap.size;
> > +       const pgprot_t prot = action->remap.pgprot;
> > +
> > +       return do_remap_pfn_range(vma, start, pfn, size, prot);
> >  }
> >
> >  /**
> > diff --git a/mm/util.c b/mm/util.c
> > index ce7ae80047cf..dba1191725b6 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -1163,43 +1163,6 @@ void flush_dcache_folio(struct folio *folio)
> >  EXPORT_SYMBOL(flush_dcache_folio);
> >  #endif
> >
> > -/**
> > - * __compat_vma_mmap() - See description for compat_vma_mmap()
> > - * for details. This is the same operation, only with a specific file operations
> > - * struct which may or may not be the same as vma->vm_file->f_op.
> > - * @f_op: The file operations whose .mmap_prepare() hook is specified.
> > - * @file: The file which backs or will back the mapping.
> > - * @vma: The VMA to apply the .mmap_prepare() hook to.
> > - * Returns: 0 on success or error.
> > - */
> > -int __compat_vma_mmap(const struct file_operations *f_op,
> > -               struct file *file, struct vm_area_struct *vma)
> > -{
> > -       struct vm_area_desc desc = {
> > -               .mm = vma->vm_mm,
> > -               .file = file,
> > -               .start = vma->vm_start,
> > -               .end = vma->vm_end,
> > -
> > -               .pgoff = vma->vm_pgoff,
> > -               .vm_file = vma->vm_file,
> > -               .vma_flags = vma->flags,
> > -               .page_prot = vma->vm_page_prot,
> > -
> > -               .action.type = MMAP_NOTHING, /* Default */
> > -       };
> > -       int err;
> > -
> > -       err = f_op->mmap_prepare(&desc);
> > -       if (err)
> > -               return err;
> > -
> > -       mmap_action_prepare(&desc.action, &desc);
> > -       set_vma_from_desc(vma, &desc);
> > -       return mmap_action_complete(&desc.action, vma);
> > -}
> > -EXPORT_SYMBOL(__compat_vma_mmap);
> > -
> >  /**
> >   * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
> >   * existing VMA and execute any requested actions.
> > @@ -1228,7 +1191,31 @@ EXPORT_SYMBOL(__compat_vma_mmap);
> >   */
> >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> >  {
> > -       return __compat_vma_mmap(file->f_op, file, vma);
> > +       struct vm_area_desc desc = {
> > +               .mm = vma->vm_mm,
> > +               .file = file,
> > +               .start = vma->vm_start,
> > +               .end = vma->vm_end,
> > +
> > +               .pgoff = vma->vm_pgoff,
> > +               .vm_file = vma->vm_file,
> > +               .vma_flags = vma->flags,
> > +               .page_prot = vma->vm_page_prot,
> > +
> > +               .action.type = MMAP_NOTHING, /* Default */
> > +       };
> > +       int err;
> > +
> > +       err = vfs_mmap_prepare(file, &desc);
> > +       if (err)
> > +               return err;
> > +
> > +       err = mmap_action_prepare(&desc, &desc.action);
> > +       if (err)
> > +               return err;
> > +
> > +       set_vma_from_desc(vma, &desc);
> > +       return mmap_action_complete(vma, &desc.action);
> >  }
> >  EXPORT_SYMBOL(compat_vma_mmap);
> >
> > @@ -1320,8 +1307,8 @@ void snapshot_page(struct page_snapshot *ps, const struct page *page)
> >         }
> >  }
> >
> > -static int mmap_action_finish(struct mmap_action *action,
> > -               const struct vm_area_struct *vma, int err)
> > +static int mmap_action_finish(struct vm_area_struct *vma,
> > +                             struct mmap_action *action, int err)
> >  {
> >         /*
> >          * If an error occurs, unmap the VMA altogether and return an error. We
> > @@ -1355,35 +1342,36 @@ static int mmap_action_finish(struct mmap_action *action,
> >   * action which need to be performed.
> >   * @desc: The VMA descriptor to prepare for @action.
> >   * @action: The action to perform.
> > + *
> > + * Returns: 0 on success, otherwise error.
> >   */
> > -void mmap_action_prepare(struct mmap_action *action,
> > -                        struct vm_area_desc *desc)
> > +int mmap_action_prepare(struct vm_area_desc *desc,
> > +                       struct mmap_action *action)
>
> Any reason you are swapping the arguments?

For consistency with other functions to be added.

> It also looks like we always call mmap_action_prepare() with action ==
> desc->action, like this: mmap_action_prepare(&desc.action, &desc). Why
> don't we eliminate the action parameter altogether and use desc.action
> from inside the function?

I think in previous iterations I thought about overriding one action with
another and wanted to keep that flexibility, but then have never done that
in practice.

So probably I can just drop that yes, will try it on respin.

>
> > +
>
> extra new line.

Ack will fix

>
> >  {
> >         switch (action->type) {
> >         case MMAP_NOTHING:
> > -               break;
> > +               return 0;
> >         case MMAP_REMAP_PFN:
> > -               remap_pfn_range_prepare(desc, action->remap.start_pfn);
> > -               break;
> > +               return remap_pfn_range_prepare(desc, action);
> >         case MMAP_IO_REMAP_PFN:
> > -               io_remap_pfn_range_prepare(desc, action->remap.start_pfn,
> > -                                          action->remap.size);
> > -               break;
> > +               return io_remap_pfn_range_prepare(desc, action);
> >         }
> >  }
> >  EXPORT_SYMBOL(mmap_action_prepare);
> >
> >  /**
> >   * mmap_action_complete - Execute VMA descriptor action.
> > - * @action: The action to perform.
> >   * @vma: The VMA to perform the action upon.
> > + * @action: The action to perform.
> >   *

> >   * Similar to mmap_action_prepare().
> >   *
> >   * Return: 0 on success, or error, at which point the VMA will be unmapped.
> >   */
> > -int mmap_action_complete(struct mmap_action *action,
> > -                        struct vm_area_struct *vma)
> > +int mmap_action_complete(struct vm_area_struct *vma,
> > +                        struct mmap_action *action)
> > +
> >  {
> >         int err = 0;
> >
> > @@ -1391,23 +1379,19 @@ int mmap_action_complete(struct mmap_action *action,
> >         case MMAP_NOTHING:
> >                 break;
> >         case MMAP_REMAP_PFN:
> > -               err = remap_pfn_range_complete(vma, action->remap.start,
> > -                               action->remap.start_pfn, action->remap.size,
> > -                               action->remap.pgprot);
> > +               err = remap_pfn_range_complete(vma, action);
> >                 break;
> >         case MMAP_IO_REMAP_PFN:
> > -               err = io_remap_pfn_range_complete(vma, action->remap.start,
> > -                               action->remap.start_pfn, action->remap.size,
> > -                               action->remap.pgprot);
> > +               err = io_remap_pfn_range_complete(vma, action);
> >                 break;
> >         }
> >
> > -       return mmap_action_finish(action, vma, err);
> > +       return mmap_action_finish(vma, action, err);
> >  }
> >  EXPORT_SYMBOL(mmap_action_complete);
> >  #else
> > -void mmap_action_prepare(struct mmap_action *action,
> > -                       struct vm_area_desc *desc)
> > +int mmap_action_prepare(struct vm_area_desc *desc,
> > +                       struct mmap_action *action)
> >  {
> >         switch (action->type) {
> >         case MMAP_NOTHING:
> > @@ -1417,11 +1401,13 @@ void mmap_action_prepare(struct mmap_action *action,
> >                 WARN_ON_ONCE(1); /* nommu cannot handle these. */
> >                 break;
> >         }
> > +
> > +       return 0;
> >  }
> >  EXPORT_SYMBOL(mmap_action_prepare);
> >
> > -int mmap_action_complete(struct mmap_action *action,
> > -                       struct vm_area_struct *vma)
> > +int mmap_action_complete(struct vm_area_struct *vma,
> > +                        struct mmap_action *action)
> >  {
> >         int err = 0;
> >
> > @@ -1436,7 +1422,7 @@ int mmap_action_complete(struct mmap_action *action,
> >                 break;
> >         }
> >
> > -       return mmap_action_finish(action, vma, err);
> > +       return mmap_action_finish(vma, action, err);
> >  }
> >  EXPORT_SYMBOL(mmap_action_complete);
> >  #endif
> > diff --git a/mm/vma.c b/mm/vma.c
> > index be64f781a3aa..054cf1d262fb 100644
> > --- a/mm/vma.c
> > +++ b/mm/vma.c
> > @@ -2613,15 +2613,19 @@ static void __mmap_complete(struct mmap_state *map, struct vm_area_struct *vma)
> >         vma_set_page_prot(vma);
> >  }
> >
> > -static void call_action_prepare(struct mmap_state *map,
> > -                               struct vm_area_desc *desc)
> > +static int call_action_prepare(struct mmap_state *map,
> > +                              struct vm_area_desc *desc)
> >  {
> >         struct mmap_action *action = &desc->action;
> > +       int err;
> >
> > -       mmap_action_prepare(action, desc);
> > +       err = mmap_action_prepare(desc, action);
> > +       if (err)
> > +               return err;
> >
> >         if (action->hide_from_rmap_until_complete)
> >                 map->hold_file_rmap_lock = true;
> > +       return 0;
> >  }
> >
> >  /*
> > @@ -2645,7 +2649,9 @@ static int call_mmap_prepare(struct mmap_state *map,
> >         if (err)
> >                 return err;
> >
> > -       call_action_prepare(map, desc);
> > +       err = call_action_prepare(map, desc);
> > +       if (err)
> > +               return err;
> >
> >         /* Update fields permitted to be changed. */
> >         map->pgoff = desc->pgoff;
> > @@ -2700,13 +2706,12 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
> >  }
> >
> >  static int call_action_complete(struct mmap_state *map,
> > -                               struct vm_area_desc *desc,
> > +                               struct mmap_action *action,
> >                                 struct vm_area_struct *vma)
> >  {
> > -       struct mmap_action *action = &desc->action;
> >         int ret;
> >
> > -       ret = mmap_action_complete(action, vma);
> > +       ret = mmap_action_complete(vma, action);
> >
> >         /* If we held the file rmap we need to release it. */
> >         if (map->hold_file_rmap_lock) {
> > @@ -2768,7 +2773,7 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> >         __mmap_complete(&map, vma);
> >
> >         if (have_mmap_prepare && allocated_new) {
> > -               error = call_action_complete(&map, &desc, vma);
> > +               error = call_action_complete(&map, &desc.action, vma);
> >
> >                 if (error)
> >                         return error;
> > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > index 5eb313beb43d..908beb263307 100644
> > --- a/tools/testing/vma/include/dup.h
> > +++ b/tools/testing/vma/include/dup.h
> > @@ -1106,7 +1106,7 @@ static inline int __compat_vma_mmap(const struct file_operations *f_op,
> >
> >                 .pgoff = vma->vm_pgoff,
> >                 .vm_file = vma->vm_file,
> > -               .vm_flags = vma->vm_flags,
> > +               .vma_flags = vma->flags,
> >                 .page_prot = vma->vm_page_prot,
> >
> >                 .action.type = MMAP_NOTHING, /* Default */
> > @@ -1117,9 +1117,12 @@ static inline int __compat_vma_mmap(const struct file_operations *f_op,
> >         if (err)
> >                 return err;
> >
> > -       mmap_action_prepare(&desc.action, &desc);
> > +       err = mmap_action_prepare(&desc, &desc.action);
> > +       if (err)
> > +               return err;
> > +
> >         set_vma_from_desc(vma, &desc);
> > -       return mmap_action_complete(&desc.action, vma);
> > +       return mmap_action_complete(vma, &desc.action);
> >  }
> >
> >  static inline int compat_vma_mmap(struct file *file,
> > diff --git a/tools/testing/vma/include/stubs.h b/tools/testing/vma/include/stubs.h
> > index 947a3a0c2566..76c4b668bc62 100644
> > --- a/tools/testing/vma/include/stubs.h
> > +++ b/tools/testing/vma/include/stubs.h
> > @@ -81,13 +81,14 @@ static inline void free_anon_vma_name(struct vm_area_struct *vma)
> >  {
> >  }
> >
> > -static inline void mmap_action_prepare(struct mmap_action *action,
> > -                                          struct vm_area_desc *desc)
> > +static inline int mmap_action_prepare(struct vm_area_desc *desc,
> > +                                     struct mmap_action *action)
> >  {
> > +       return 0;
> >  }
> >
> > -static inline int mmap_action_complete(struct mmap_action *action,
> > -                                          struct vm_area_struct *vma)
> > +static inline int mmap_action_complete(struct vm_area_struct *vma,
> > +                                      struct mmap_action *action)
> >  {
> >         return 0;
> >  }
> > --
> > 2.53.0
> >

^ permalink raw reply

* Re: [PATCH 03/15] mm: document vm_operations_struct->open the same as close()
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:31 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Andrew Morton, Jonathan Corbet, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpHVN66abFrJgorXKBsjv7Ut=CP-E4NpLMC4SW613tJwtw@mail.gmail.com>

On Sun, Mar 15, 2026 at 05:43:41PM -0700, Suren Baghdasaryan wrote:
> On Thu, Mar 12, 2026 at 1:27 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > Describe when the operation is invoked and the context in which it is
> > invoked, matching the description already added for vm_op->close().
> >
> > While we're here, update all outdated references to an 'area' field for
> > VMAs to the more consistent 'vma'.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  include/linux/mm.h | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index cc5960a84382..12a0b4c63736 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -748,15 +748,20 @@ struct vm_uffd_ops;
> >   * to the functions called when a no-page or a wp-page exception occurs.
> >   */
> >  struct vm_operations_struct {
> > -       void (*open)(struct vm_area_struct * area);
> > +       /**
> > +        * @open: Called when a VMA is remapped or split. Not called upon first
> > +        * mapping a VMA.
>
> It's also called from dup_mmap() which is part of forking.

Ah yup :) will update thanks!

>
> > +        * Context: User context.  May sleep.  Caller holds mmap_lock.
> > +        */
> > +       void (*open)(struct vm_area_struct *vma);
> >         /**
> >          * @close: Called when the VMA is being removed from the MM.
> >          * Context: User context.  May sleep.  Caller holds mmap_lock.
> >          */
> > -       void (*close)(struct vm_area_struct * area);
> > +       void (*close)(struct vm_area_struct *vma);
> >         /* Called any time before splitting to check if it's allowed */
> > -       int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> > -       int (*mremap)(struct vm_area_struct *area);
> > +       int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
> > +       int (*mremap)(struct vm_area_struct *vma);
> >         /*
> >          * Called by mprotect() to make driver-specific permission
> >          * checks before mprotect() is finalised.   The VMA must not
> > @@ -768,7 +773,7 @@ struct vm_operations_struct {
> >         vm_fault_t (*huge_fault)(struct vm_fault *vmf, unsigned int order);
> >         vm_fault_t (*map_pages)(struct vm_fault *vmf,
> >                         pgoff_t start_pgoff, pgoff_t end_pgoff);
> > -       unsigned long (*pagesize)(struct vm_area_struct * area);
> > +       unsigned long (*pagesize)(struct vm_area_struct *vma);
> >
> >         /* notification that a previously read-only page is about to become
> >          * writable, if an error is returned it will cause a SIGBUS */
> > --
> > 2.53.0
> >

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 05/15] fs: afs: correctly drop reference count on mapping failure
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 14:29 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Usama Arif, Andrew Morton, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpFio6n-O-1NkPXrymV0o3UqvHYS8ZOyQtt=JXnZ5dTGhQ@mail.gmail.com>

On Sun, Mar 15, 2026 at 07:32:54PM -0700, Suren Baghdasaryan wrote:
> On Fri, Mar 13, 2026 at 5:00 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Fri, Mar 13, 2026 at 04:07:43AM -0700, Usama Arif wrote:
> > > On Thu, 12 Mar 2026 20:27:20 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > >
> > > > Commit 9d5403b1036c ("fs: convert most other generic_file_*mmap() users to
> > > > .mmap_prepare()") updated AFS to use the mmap_prepare callback in favour of
> > > > the deprecated mmap callback.
> > > >
> > > > However, it did not account for the fact that mmap_prepare can fail to map
> > > > due to an out of memory error, and thus should not be incrementing a
> > > > reference count on mmap_prepare.
>
> This is a bit confusing. I see the current implementation does
> afs_add_open_mmap() and then if generic_file_mmap_prepare() fails it
> does afs_drop_open_mmap(), therefore refcounting seems to be balanced.
> Is there really a problem?

Firstly, mmap_prepare is invoked before we try to merge, so the VMA could in
theory get merged and then the refcounting will be wrong.

Secondly, mmap_prepare occurs at such at time where it is _possible_ that
allocation failures as described below could happen.

I'll update the commit message to reflect the merge aspect actually.

>
> > > >
> > > > With the newly added vm_ops->mapped callback available, we can simply defer
> > > > this operation to that callback which is only invoked once the mapping is
> > > > successfully in place (but not yet visible to userspace as the mmap and VMA
> > > > write locks are held).
> > > >
> > > > Therefore add afs_mapped() to implement this callback for AFS.
> > > >
> > > > In practice the mapping allocations are 'too small to fail' so this is
> > > > something that realistically should never happen in practice (or would do
> > > > so in a case where the process is about to die anyway), but we should still
> > > > handle this.
>
> nit: I would drop the above paragraph. If it's impossible why are you
> handling it? If it's unlikely, then handling it is even more
> important.

Sure I can drop it, but it's an ongoing thing with these small allocations.

I wish we could just move to a scenario where we can simpy assume allocations
will always succeed :)

Vlasta - thoughts?

Cheers, Lorenzo

>
> > > >
> > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > > ---
> > > >  fs/afs/file.c | 20 ++++++++++++++++----
> > > >  1 file changed, 16 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/afs/file.c b/fs/afs/file.c
> > > > index f609366fd2ac..69ef86f5e274 100644
> > > > --- a/fs/afs/file.c
> > > > +++ b/fs/afs/file.c
> > > > @@ -28,6 +28,8 @@ static ssize_t afs_file_splice_read(struct file *in, loff_t *ppos,
> > > >  static void afs_vm_open(struct vm_area_struct *area);
> > > >  static void afs_vm_close(struct vm_area_struct *area);
> > > >  static vm_fault_t afs_vm_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff);
> > > > +static int afs_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data);
> > > >
> > > >  const struct file_operations afs_file_operations = {
> > > >     .open           = afs_open,
> > > > @@ -61,6 +63,7 @@ const struct address_space_operations afs_file_aops = {
> > > >  };
> > > >
> > > >  static const struct vm_operations_struct afs_vm_ops = {
> > > > +   .mapped         = afs_mapped,
> > > >     .open           = afs_vm_open,
> > > >     .close          = afs_vm_close,
> > > >     .fault          = filemap_fault,
> > > > @@ -500,13 +503,22 @@ static int afs_file_mmap_prepare(struct vm_area_desc *desc)
> > > >     afs_add_open_mmap(vnode);
> > >
> > > Is the above afs_add_open_mmap an additional one, which could cause a reference
> > > leak? Does the above one need to be removed and only the one in afs_mapped()
> > > needs to be kept?
> >
> > Ah yeah good spot, will fix thanks!
> >
> > >
> > > >
> > > >     ret = generic_file_mmap_prepare(desc);
> > > > -   if (ret == 0)
> > > > -           desc->vm_ops = &afs_vm_ops;
> > > > -   else
> > > > -           afs_drop_open_mmap(vnode);
> > > > +   if (ret)
> > > > +           return ret;
> > > > +
> > > > +   desc->vm_ops = &afs_vm_ops;
> > > >     return ret;
> > > >  }
> > > >
> > > > +static int afs_mapped(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data)
> > > > +{
> > > > +   struct afs_vnode *vnode = AFS_FS_I(file_inode(file));
> > > > +
> > > > +   afs_add_open_mmap(vnode);
> > > > +   return 0;
> > > > +}
> > > > +
> > > >  static void afs_vm_open(struct vm_area_struct *vma)
> > > >  {
> > > >     afs_add_open_mmap(AFS_FS_I(file_inode(vma->vm_file)));
> > > > --
> > > > 2.53.0
> > > >
> > > >
> >
> > Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 04/15] mm: add vm_ops->mapped hook
From: Lorenzo Stoakes (Oracle) @ 2026-03-16 13:39 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Usama Arif, Andrew Morton, Clemens Ladisch, Arnd Bergmann,
	Greg Kroah-Hartman, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Alexander Shishkin, Maxime Coquelin,
	Alexandre Torgue, Miquel Raynal, Richard Weinberger,
	Vignesh Raghavendra, Bodo Stroesser, Martin K . Petersen,
	David Howells, Marc Dionne, Alexander Viro, Christian Brauner,
	Jan Kara, David Hildenbrand, Liam R . Howlett, Vlastimil Babka,
	Mike Rapoport, Michal Hocko, Jann Horn, Pedro Falcato,
	linux-kernel, linux-doc, linux-hyperv, linux-stm32,
	linux-arm-kernel, linux-mtd, linux-staging, linux-scsi,
	target-devel, linux-afs, linux-fsdevel, linux-mm, Ryan Roberts
In-Reply-To: <CAJuCfpH1gzi50aWni7rh9=2gM8WwCzm=fY14DCFbjweAq82i6Q@mail.gmail.com>

On Sun, Mar 15, 2026 at 07:18:38PM -0700, Suren Baghdasaryan wrote:
> On Fri, Mar 13, 2026 at 4:58 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Fri, Mar 13, 2026 at 04:02:36AM -0700, Usama Arif wrote:
> > > On Thu, 12 Mar 2026 20:27:19 +0000 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:
> > >
> > > > Previously, when a driver needed to do something like establish a reference
> > > > count, it could do so in the mmap hook in the knowledge that the mapping
> > > > would succeed.
> > > >
> > > > With the introduction of f_op->mmap_prepare this is no longer the case, as
> > > > it is invoked prior to actually establishing the mapping.
> > > >
> > > > To take this into account, introduce a new vm_ops->mapped callback which is
> > > > invoked when the VMA is first mapped (though notably - not when it is
> > > > merged - which is correct and mirrors existing mmap/open/close behaviour).
> > > >
> > > > We do better that vm_ops->open() here, as this callback can return an
> > > > error, at which point the VMA will be unmapped.
> > > >
> > > > Note that vm_ops->mapped() is invoked after any mmap action is
> > > > complete (such as I/O remapping).
> > > >
> > > > We intentionally do not expose the VMA at this point, exposing only the
> > > > fields that could be used, and an output parameter in case the operation
> > > > needs to update the vma->vm_private_data field.
> > > >
> > > > In order to deal with stacked filesystems which invoke inner filesystem's
> > > > mmap() invocations, add __compat_vma_mapped() and invoke it on
> > > > vfs_mmap() (via compat_vma_mmap()) to ensure that the mapped callback is
> > > > handled when an mmap() caller invokes a nested filesystem's mmap_prepare()
> > > > callback.
> > > >
> > > > We can now also remove call_action_complete() and invoke
> > > > mmap_action_complete() directly, as we separate out the rmap lock logic to
> > > > be called in __mmap_region() instead via maybe_drop_file_rmap_lock().
> > > >
> > > > We also abstract unmapping of a VMA on mmap action completion into its own
> > > > helper function, unmap_vma_locked().
> > > >
> > > > Additionally, update VMA userland test headers to reflect the change.
> > > >
> > > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > > ---
> > > >  include/linux/fs.h              |  9 +++-
> > > >  include/linux/mm.h              | 17 +++++++
> > > >  mm/internal.h                   | 10 ++++
> > > >  mm/util.c                       | 86 ++++++++++++++++++++++++---------
> > > >  mm/vma.c                        | 41 +++++++++++-----
> > > >  tools/testing/vma/include/dup.h | 34 ++++++++++++-
> > > >  6 files changed, 158 insertions(+), 39 deletions(-)
> > > >
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index a2628a12bd2b..c390f5c667e3 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -2059,13 +2059,20 @@ static inline bool can_mmap_file(struct file *file)
> > > >  }
> > > >
> > > >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma);
> > > > +int __vma_check_mmap_hook(struct vm_area_struct *vma);
> > > >
> > > >  static inline int vfs_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  {
> > > > +   int err;
> > > > +
> > > >     if (file->f_op->mmap_prepare)
> > > >             return compat_vma_mmap(file, vma);
> > > >
> > > > -   return file->f_op->mmap(file, vma);
> > > > +   err = file->f_op->mmap(file, vma);
> > > > +   if (err)
> > > > +           return err;
> > > > +
> > > > +   return __vma_check_mmap_hook(vma);
> > > >  }
> > > >
> > > >  static inline int vfs_mmap_prepare(struct file *file, struct vm_area_desc *desc)
> > > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > > index 12a0b4c63736..7333d5db1221 100644
> > > > --- a/include/linux/mm.h
> > > > +++ b/include/linux/mm.h
> > > > @@ -759,6 +759,23 @@ struct vm_operations_struct {
> > > >      * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > >      */
> > > >     void (*close)(struct vm_area_struct *vma);
> > > > +   /**
> > > > +    * @mapped: Called when the VMA is first mapped in the MM. Not called if
> > > > +    * the new VMA is merged with an adjacent VMA.
> > > > +    *
> > > > +    * The @vm_private_data field is an output field allowing the user to
> > > > +    * modify vma->vm_private_data as necessary.
> > > > +    *
> > > > +    * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
> > > > +    * set from f_op->mmap.
> > > > +    *
> > > > +    * Returns %0 on success, or an error otherwise. On error, the VMA will
> > > > +    * be unmapped.
> > > > +    *
> > > > +    * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > +    */
> > > > +   int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data);
> > > >     /* Called any time before splitting to check if it's allowed */
> > > >     int (*may_split)(struct vm_area_struct *vma, unsigned long addr);
> > > >     int (*mremap)(struct vm_area_struct *vma);
> > > > diff --git a/mm/internal.h b/mm/internal.h
> > > > index 7bfa85b5e78b..f0f2cf1caa36 100644
> > > > --- a/mm/internal.h
> > > > +++ b/mm/internal.h
> > > > @@ -158,6 +158,8 @@ static inline void *folio_raw_mapping(const struct folio *folio)
> > > >   * mmap hook and safely handle error conditions. On error, VMA hooks will be
> > > >   * mutated.
> > > >   *
> > > > + * IMPORTANT: f_op->mmap() is deprecated, prefer f_op->mmap_prepare().
> > > > + *
>
> What exactly would one do to "prefer f_op->mmap_prepare()"?

I'm saying a person should implement f_op->mmap_prepare() rather than
f_op->mmap(), since the latter is deprecated :)

I think that's pretty clear no?

> Since you are adding this comment for mmap_file(), I think you need to
> describe more specifically what one should call instead.

I think it'd be a complete distraction, since if you're at the point of calling
mmap_file() you're already not implement mmap_prepare except as a compatbility
layer.

I mean maybe I'll just drop this as it seems to be causing confusion.

>
> > > >   * @file: File which backs the mapping.
> > > >   * @vma:  VMA which we are mapping.
> > > >   *
> > > > @@ -201,6 +203,14 @@ static inline void vma_close(struct vm_area_struct *vma)
> > > >  /* unmap_vmas is in mm/memory.c */
> > > >  void unmap_vmas(struct mmu_gather *tlb, struct unmap_desc *unmap);
> > > >
> > > > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > > > +{
> > > > +   const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > +
> > > > +   mmap_assert_locked(vma->vm_mm);
>
> You must hold the mmap write lock when unmapping. Would be better to
> assert mmap_assert_write_locked() or even vma_assert_write_locked(),
> which implies mmap_assert_write_locked().

I'm not sure why we don't assert this in those paths.

I think I assumed we could only assert readonly because one of those paths
downgrades the mmap write lock to a read lock.

I don't think we can do a VMA write lock assert here, since at the point of
do_munmap() all callers can't possibly have the VMA write lock, since they are
_looking up_ the VMA at the specified address.

But I can convert this to an mmap_assert_write_locked()!

>
> > > > +   do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > > > +}
> > > > +
> > > >  #ifdef CONFIG_MMU
> > > >
> > > >  static inline void get_anon_vma(struct anon_vma *anon_vma)
> > > > diff --git a/mm/util.c b/mm/util.c
> > > > index dba1191725b6..2b0ed54008d6 100644
> > > > --- a/mm/util.c
> > > > +++ b/mm/util.c
> > > > @@ -1163,6 +1163,55 @@ void flush_dcache_folio(struct folio *folio)
> > > >  EXPORT_SYMBOL(flush_dcache_folio);
> > > >  #endif
> > > >
> > > > +static int __compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > > > +{
> > > > +   struct vm_area_desc desc = {
> > > > +           .mm = vma->vm_mm,
> > > > +           .file = file,
> > > > +           .start = vma->vm_start,
> > > > +           .end = vma->vm_end,
> > > > +
> > > > +           .pgoff = vma->vm_pgoff,
> > > > +           .vm_file = vma->vm_file,
> > > > +           .vma_flags = vma->flags,
> > > > +           .page_prot = vma->vm_page_prot,
> > > > +
> > > > +           .action.type = MMAP_NOTHING, /* Default */
> > > > +   };
> > > > +   int err;
> > > > +
> > > > +   err = vfs_mmap_prepare(file, &desc);
> > > > +   if (err)
> > > > +           return err;
> > > > +
> > > > +   err = mmap_action_prepare(&desc, &desc.action);
> > > > +   if (err)
> > > > +           return err;
> > > > +
> > > > +   set_vma_from_desc(vma, &desc);
> > > > +   return mmap_action_complete(vma, &desc.action);
> > > > +}
> > > > +
> > > > +static int __compat_vma_mapped(struct file *file, struct vm_area_struct *vma)
> > > > +{
> > > > +   const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > > > +   void *vm_private_data = vma->vm_private_data;
> > > > +   int err;
> > > > +
> > > > +   if (!vm_ops->mapped)
> > > > +           return 0;
> > > > +
> > >
> > > Hello!
> > >
> > > Can vm_ops be NULL here?  __compat_vma_mapped() is called from
> > > compat_vma_mmap(), which is reached when a filesystem provides
> > > mmap_prepare.  If the mmap_prepare hook does not set desc->vm_ops,
> > > vma->vm_ops will be NULL and this dereferences a NULL pointer.
> >
> > I _think_ for this to ever be invoked, you would need to be dealing with a
> > file-backed VMA so vm_ops->fault would HAVE to be defined.
> >
> > But you're right anyway as a matter of principle we should check it! Will fix.
> >
> > >
> > > For e.g. drivers/char/mem.c, mmap_zero_prepare() would trigger
> > > a NULL pointer dereference here.
> > >
> > > Would need to do
> > >       if (!vm_ops || !vm_ops->mapped)
> > >               return 0;
> > >
> > > here
> >
> > Yes.
> >
> > >
> > >
> > > > +   err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff, file,
> > > > +                        &vm_private_data);
> > > > +   if (err)
> > > > +           unmap_vma_locked(vma);
> > >
> > > when mapped() returns an error, unmap_vma_locked(vma) is called
> > > but execution continues into the vm_private_data update below.  After
> > > unmap_vma_locked() the VMA may be freed (do_munmap can remove the VMA
> > > entirely), so accessing vma->vm_private_data after that is a
> > > use-after-free.
> >
> > Very good point :) will fix thanks!
> >
> > Probably:
> >
> >         if (err)
> >                 unmap_vma_locked(vma);
> >         else if (vm_private_data != vma->vm_private_data)
> >                 vma->vm_private_data = vm_private_data;
> >
> >         return err;
> >
> > Would be fine.
> >
> > >
> > > Probably need to do:
> > >       if (err) {
> > >               unmap_vma_locked(vma);
> > >               return err;
> > >       }
> > >
> > > > +   /* Update private data if changed. */
> > > > +   if (vm_private_data != vma->vm_private_data)
> > > > +           vma->vm_private_data = vm_private_data;
> > > > +
> > > > +   return err;
> > > > +}
> > > > +
> > > >  /**
> > > >   * compat_vma_mmap() - Apply the file's .mmap_prepare() hook to an
> > > >   * existing VMA and execute any requested actions.
> > > > @@ -1191,34 +1240,26 @@ EXPORT_SYMBOL(flush_dcache_folio);
> > > >   */
> > > >  int compat_vma_mmap(struct file *file, struct vm_area_struct *vma)
> > > >  {
> > > > -   struct vm_area_desc desc = {
> > > > -           .mm = vma->vm_mm,
> > > > -           .file = file,
> > > > -           .start = vma->vm_start,
> > > > -           .end = vma->vm_end,
> > > > -
> > > > -           .pgoff = vma->vm_pgoff,
> > > > -           .vm_file = vma->vm_file,
> > > > -           .vma_flags = vma->flags,
> > > > -           .page_prot = vma->vm_page_prot,
> > > > -
> > > > -           .action.type = MMAP_NOTHING, /* Default */
> > > > -   };
> > > >     int err;
> > > >
> > > > -   err = vfs_mmap_prepare(file, &desc);
> > > > -   if (err)
> > > > -           return err;
> > > > -
> > > > -   err = mmap_action_prepare(&desc, &desc.action);
> > > > +   err = __compat_vma_mmap(file, vma);
> > > >     if (err)
> > > >             return err;
> > > >
> > > > -   set_vma_from_desc(vma, &desc);
> > > > -   return mmap_action_complete(vma, &desc.action);
> > > > +   return __compat_vma_mapped(file, vma);
> > > >  }
> > > >  EXPORT_SYMBOL(compat_vma_mmap);
> > > >
> > > > +int __vma_check_mmap_hook(struct vm_area_struct *vma)
> > > > +{
> > > > +   /* vm_ops->mapped is not valid if mmap() is specified. */
> > > > +   if (WARN_ON_ONCE(vma->vm_ops->mapped))
> > > > +           return -EINVAL;
> > >
> > > I think vma->vm_ops can be NULL here. Should be:
> > >
> > >       if (vma->vm_ops && WARN_ON_ONCE(vma->vm_ops->mapped))
> > >               return -EINVAL;
> >
> > I think again you'd probably only invoke this on file-backed so be ok, but again
> > as a matter of principle we should check it so will fix, thanks!
> >
> > >
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(__vma_check_mmap_hook);
>
> nit: Any reason __vma_check_mmap_hook() is not inlined next to its
> user vfs_mmap()?

Headers fun, fs.h is a 'before mm.h' header, so vm_operations_struct is not
declared yet here, so we can't actually do the check there.

>
> > > > +
> > > >  static void set_ps_flags(struct page_snapshot *ps, const struct folio *folio,
> > > >                      const struct page *page)
> > > >  {
> > > > @@ -1316,10 +1357,7 @@ static int mmap_action_finish(struct vm_area_struct *vma,
> > > >      * invoked if we do NOT merge, so we only clean up the VMA we created.
> > > >      */
> > > >     if (err) {
> > > > -           const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > -
> > > > -           do_munmap(current->mm, vma->vm_start, len, NULL);
> > > > -
> > > > +           unmap_vma_locked(vma);
> > > >             if (action->error_hook) {
> > > >                     /* We may want to filter the error. */
> > > >                     err = action->error_hook(err);
> > > > diff --git a/mm/vma.c b/mm/vma.c
> > > > index 054cf1d262fb..ef9f5a5365d1 100644
> > > > --- a/mm/vma.c
> > > > +++ b/mm/vma.c
> > > > @@ -2705,21 +2705,35 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
> > > >     return false;
> > > >  }
> > > >
> > > > -static int call_action_complete(struct mmap_state *map,
> > > > -                           struct mmap_action *action,
> > > > -                           struct vm_area_struct *vma)
> > > > +static int call_mapped_hook(struct vm_area_struct *vma)
> > > >  {
> > > > -   int ret;
> > > > +   const struct vm_operations_struct *vm_ops = vma->vm_ops;
> > > > +   void *vm_private_data = vma->vm_private_data;
> > > > +   int err;
> > > >
> > > > -   ret = mmap_action_complete(vma, action);
> > > > +   if (!vm_ops || !vm_ops->mapped)
> > > > +           return 0;
> > > > +   err = vm_ops->mapped(vma->vm_start, vma->vm_end, vma->vm_pgoff,
> > > > +                        vma->vm_file, &vm_private_data);
> > > > +   if (err) {
> > > > +           unmap_vma_locked(vma);
> > > > +           return err;
> > > > +   }
> > > > +   /* Update private data if changed. */
> > > > +   if (vm_private_data != vma->vm_private_data)
> > > > +           vma->vm_private_data = vm_private_data;
> > > > +   return 0;
> > > > +}
> > > >
> > > > -   /* If we held the file rmap we need to release it. */
> > > > -   if (map->hold_file_rmap_lock) {
> > > > -           struct file *file = vma->vm_file;
> > > > +static void maybe_drop_file_rmap_lock(struct mmap_state *map,
> > > > +                                 struct vm_area_struct *vma)
> > > > +{
> > > > +   struct file *file;
> > > >
> > > > -           i_mmap_unlock_write(file->f_mapping);
> > > > -   }
> > > > -   return ret;
> > > > +   if (!map->hold_file_rmap_lock)
> > > > +           return;
> > > > +   file = vma->vm_file;
> > > > +   i_mmap_unlock_write(file->f_mapping);
> > > >  }
> > > >
> > > >  static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > > @@ -2773,8 +2787,11 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
> > > >     __mmap_complete(&map, vma);
> > > >
> > > >     if (have_mmap_prepare && allocated_new) {
> > > > -           error = call_action_complete(&map, &desc.action, vma);
> > > > +           error = mmap_action_complete(vma, &desc.action);
> > > > +           if (!error)
> > > > +                   error = call_mapped_hook(vma);
> > > >
> > > > +           maybe_drop_file_rmap_lock(&map, vma);
> > > >             if (error)
> > > >                     return error;
> > > >     }
> > > > diff --git a/tools/testing/vma/include/dup.h b/tools/testing/vma/include/dup.h
> > > > index 908beb263307..47d8db809f31 100644
> > > > --- a/tools/testing/vma/include/dup.h
> > > > +++ b/tools/testing/vma/include/dup.h
> > > > @@ -606,12 +606,34 @@ struct vm_area_struct {
> > > >  } __randomize_layout;
> > > >
> > > >  struct vm_operations_struct {
> > > > -   void (*open)(struct vm_area_struct * area);
> > > > +   /**
> > > > +    * @open: Called when a VMA is remapped or split. Not called upon first
> > > > +    * mapping a VMA.
> > > > +    * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > +    */
>
> This comment should have been introduced in the previous patch.

It's the testing code, it's not really important. But if I respin I'll fix... :)

>
> > > > +   void (*open)(struct vm_area_struct *vma);
> > > >     /**
> > > >      * @close: Called when the VMA is being removed from the MM.
> > > >      * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > >      */
> > > > -   void (*close)(struct vm_area_struct * area);
> > > > +   void (*close)(struct vm_area_struct *vma);
> > > > +   /**
> > > > +    * @mapped: Called when the VMA is first mapped in the MM. Not called if
> > > > +    * the new VMA is merged with an adjacent VMA.
> > > > +    *
> > > > +    * The @vm_private_data field is an output field allowing the user to
> > > > +    * modify vma->vm_private_data as necessary.
> > > > +    *
> > > > +    * ONLY valid if set from f_op->mmap_prepare. Will result in an error if
> > > > +    * set from f_op->mmap.
> > > > +    *
> > > > +    * Returns %0 on success, or an error otherwise. On error, the VMA will
> > > > +    * be unmapped.
> > > > +    *
> > > > +    * Context: User context.  May sleep.  Caller holds mmap_lock.
> > > > +    */
> > > > +   int (*mapped)(unsigned long start, unsigned long end, pgoff_t pgoff,
> > > > +                 const struct file *file, void **vm_private_data);
> > > >     /* Called any time before splitting to check if it's allowed */
> > > >     int (*may_split)(struct vm_area_struct *area, unsigned long addr);
> > > >     int (*mremap)(struct vm_area_struct *area);
> > > > @@ -1345,3 +1367,11 @@ static inline void vma_set_file(struct vm_area_struct *vma, struct file *file)
> > > >     swap(vma->vm_file, file);
> > > >     fput(file);
> > > >  }
> > > > +
> > > > +static inline void unmap_vma_locked(struct vm_area_struct *vma)
> > > > +{
> > > > +   const size_t len = vma_pages(vma) << PAGE_SHIFT;
> > > > +
> > > > +   mmap_assert_locked(vma->vm_mm);
> > > > +   do_munmap(vma->vm_mm, vma->vm_start, len, NULL);
> > > > +}
> > > > --
> > > > 2.53.0
> > > >
> > > >
> >
> > Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH 50/61] iommu: Prefer IS_ERR_OR_NULL over manual NULL check
From: Robin Murphy @ 2026-03-16 13:30 UTC (permalink / raw)
  To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
	dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
	linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
	linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
	linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
	linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
	linux-nfs, linux-omap, linux-phy, linux-pm, linux-rockchip,
	linux-s390, linux-scsi, linux-sctp, linux-security-module,
	linux-sh, linux-sound, linux-stm32, linux-trace-kernel, linux-usb,
	linux-wireless, netdev, ntfs3, samba-technical, sched-ext,
	target-devel, tipc-discussion, v9fs
  Cc: Joerg Roedel, Will Deacon
In-Reply-To: <20260310-b4-is_err_or_null-v1-50-bd63b656022d@avm.de>

On 2026-03-10 11:49 am, Philipp Hahn wrote:
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.

AFAICS it doesn't look possible for the argument to be anything other 
than valid at both callsites, so *both* conditions here seem in fact to 
be entirely redundant.

> Change generated with coccinelle.

Please use coccinelle responsibly. Mechanical changes are great for 
scripted API updates, but for cleanup, whilst it's ideal for *finding* 
areas of code that are worth looking at, the code then wants actually 
looking at, in its whole context, because meaningful cleanup often goes 
deeper than trivial replacement.

In particular, anywhere IS_ERR_OR_NULL() is genuinely relevant is 
usually a sign of bad interface design, so if you're looking at this 
then you really should be looking first and foremost to remove any 
checks that are already unnecessary, and for the remainder, to see if 
the thing being checked can be improved to not mix the two different 
styles. That would be constructive and (usually) welcome cleanup. Simply 
churning a bunch of code with this ugly macro that's arguably less 
readable than what it replaces, not so much.

Thanks,
Robin.

> To: Joerg Roedel <joro@8bytes.org>
> To: Will Deacon <will@kernel.org>
> To: Robin Murphy <robin.murphy@arm.com>
> Cc: iommu@lists.linux.dev
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
>   drivers/iommu/omap-iommu.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/omap-iommu.c b/drivers/iommu/omap-iommu.c
> index 8231d7d6bb6a9202025643639a6b28e6faa84659..500a42b57a997696ff37c76f028a717ab71d01f9 100644
> --- a/drivers/iommu/omap-iommu.c
> +++ b/drivers/iommu/omap-iommu.c
> @@ -881,7 +881,7 @@ static int omap_iommu_attach(struct omap_iommu *obj, u32 *iopgd)
>    **/
>   static void omap_iommu_detach(struct omap_iommu *obj)
>   {
> -	if (!obj || IS_ERR(obj))
> +	if (IS_ERR_OR_NULL(obj))
>   		return;
>   
>   	spin_lock(&obj->iommu_lock);
> 


^ permalink raw reply

* [PATCH 11/11] Drivers: hv: Kconfig: Add ARM64 support for MSHV_VTL
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

Enable ARM64 support in MSHV_VTL Kconfig now that all the necessary
support is present.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 drivers/hv/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
index 7937ac0cbd0f..393cef272590 100644
--- a/drivers/hv/Kconfig
+++ b/drivers/hv/Kconfig
@@ -87,7 +87,7 @@ config MSHV_ROOT
 
 config MSHV_VTL
 	tristate "Microsoft Hyper-V VTL driver"
-	depends on X86_64 && HYPERV_VTL_MODE
+	depends on (X86_64 || ARM64) && HYPERV_VTL_MODE
 	depends on HYPERV_VMBUS
 	# Mapping VTL0 memory to a userspace process in VTL2 is supported in OpenHCL.
 	# VTL2 for OpenHCL makes use of Huge Pages to improve performance on VMs,
-- 
2.43.0


^ permalink raw reply related

* [PATCH 10/11] Drivers: hv: Add support for arm64 in MSHV_VTL
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

Add necessary support to make MSHV_VTL work for arm64 architecture.
* Add stub implementation for mshv_vtl_return_call_init(): not required
  for arm64
* Remove fpu/legacy.h header inclusion, as this is not required
* handle HV_REGISTER_VSM_CODE_PAGE_OFFSETS register: not supported
  in arm64
* Configure custom percpu_vmbus_handler by using
  hv_setup_percpu_vmbus_handler()
* Handle hugepage functions by config checks

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 arch/arm64/include/asm/mshyperv.h |  2 ++
 drivers/hv/mshv_vtl_main.c        | 21 ++++++++++++++-------
 2 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index 36803f0386cc..027a7f062d70 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -83,6 +83,8 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
 	return 1;
 }
 
+static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
+
 void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
 bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
 #endif
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index 4c9ae65ad3e8..5702fe258500 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -23,8 +23,6 @@
 #include <trace/events/ipi.h>
 #include <uapi/linux/mshv.h>
 #include <hyperv/hvhdk.h>
-
-#include "../../kernel/fpu/legacy.h"
 #include "mshv.h"
 #include "mshv_vtl.h"
 #include "hyperv_vmbus.h"
@@ -206,18 +204,21 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
 static int mshv_vtl_get_vsm_regs(void)
 {
 	struct hv_register_assoc registers[2];
-	int ret, count = 2;
+	int ret, count = 0;
 
-	registers[0].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
-	registers[1].name = HV_REGISTER_VSM_CAPABILITIES;
+	registers[count++].name = HV_REGISTER_VSM_CAPABILITIES;
+	/* Code page offset register is not supported on ARM */
+	if (IS_ENABLED(CONFIG_X86_64))
+		registers[count++].name = HV_REGISTER_VSM_CODE_PAGE_OFFSETS;
 
 	ret = hv_call_get_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
 				       count, input_vtl_zero, registers);
 	if (ret)
 		return ret;
 
-	mshv_vsm_page_offsets.as_uint64 = registers[0].value.reg64;
-	mshv_vsm_capabilities.as_uint64 = registers[1].value.reg64;
+	mshv_vsm_capabilities.as_uint64 = registers[0].value.reg64;
+	if (IS_ENABLED(CONFIG_X86_64))
+		mshv_vsm_page_offsets.as_uint64 = registers[1].value.reg64;
 
 	return ret;
 }
@@ -280,10 +281,13 @@ static int hv_vtl_setup_synic(void)
 
 	/* Use our isr to first filter out packets destined for userspace */
 	hv_setup_vmbus_handler(mshv_vtl_vmbus_isr);
+	/* hv_setup_vmbus_handler() is stubbed for ARM64, add per-cpu VMBus handlers instead */
+	hv_setup_percpu_vmbus_handler(mshv_vtl_vmbus_isr);
 
 	ret = cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "hyperv/vtl:online",
 				mshv_vtl_alloc_context, NULL);
 	if (ret < 0) {
+		hv_setup_percpu_vmbus_handler(vmbus_isr);
 		hv_setup_vmbus_handler(vmbus_isr);
 		return ret;
 	}
@@ -296,6 +300,7 @@ static int hv_vtl_setup_synic(void)
 static void hv_vtl_remove_synic(void)
 {
 	cpuhp_remove_state(mshv_vtl_cpuhp_online);
+	hv_setup_percpu_vmbus_handler(vmbus_isr);
 	hv_setup_vmbus_handler(vmbus_isr);
 }
 
@@ -1080,10 +1085,12 @@ static vm_fault_t mshv_vtl_low_huge_fault(struct vm_fault *vmf, unsigned int ord
 			ret = vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 		return ret;
 
+#if defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 	case PUD_ORDER:
 		if (can_fault(vmf, PUD_SIZE, &pfn))
 			ret = vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
 		return ret;
+#endif
 
 	default:
 		return VM_FAULT_SIGBUS;
-- 
2.43.0


^ permalink raw reply related

* [PATCH 09/11] Drivers: hv: mshv_vtl: Let userspace do VSM configuration
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

The kernel currently sets the VSM configuration register, thereby
imposing certain VSM configuration on the userspace (OpenVMM).

The userspace (OpenVMM) has the capability to configure this register,
and it is already doing it using the generic hypercall interface.
The configuration can vary based on the use case or architectures, so
let userspace take care of configuring it and remove this logic in the
kernel driver.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 drivers/hv/mshv_vtl_main.c | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index c79d24317b8e..4c9ae65ad3e8 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -222,30 +222,6 @@ static int mshv_vtl_get_vsm_regs(void)
 	return ret;
 }
 
-static int mshv_vtl_configure_vsm_partition(struct device *dev)
-{
-	union hv_register_vsm_partition_config config;
-	struct hv_register_assoc reg_assoc;
-
-	config.as_uint64 = 0;
-	config.default_vtl_protection_mask = HV_MAP_GPA_PERMISSIONS_MASK;
-	config.enable_vtl_protection = 1;
-	config.zero_memory_on_reset = 1;
-	config.intercept_vp_startup = 1;
-	config.intercept_cpuid_unimplemented = 1;
-
-	if (mshv_vsm_capabilities.intercept_page_available) {
-		dev_dbg(dev, "using intercept page\n");
-		config.intercept_page = 1;
-	}
-
-	reg_assoc.name = HV_REGISTER_VSM_PARTITION_CONFIG;
-	reg_assoc.value.reg64 = config.as_uint64;
-
-	return hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
-				       1, input_vtl_zero, &reg_assoc);
-}
-
 static void mshv_vtl_vmbus_isr(void)
 {
 	struct hv_per_cpu_context *per_cpu;
@@ -1168,11 +1144,6 @@ static int __init mshv_vtl_init(void)
 		ret = -ENODEV;
 		goto free_dev;
 	}
-	if (mshv_vtl_configure_vsm_partition(dev)) {
-		dev_emerg(dev, "VSM configuration failed !!\n");
-		ret = -ENODEV;
-		goto free_dev;
-	}
 
 	mshv_vtl_return_call_init(mshv_vsm_page_offsets.vtl_return_offset);
 	ret = hv_vtl_setup_synic();
-- 
2.43.0


^ permalink raw reply related

* [PATCH 08/11] Drivers: hv: mshv_vtl: Move register page config to arch-specific files
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

Move mshv_vtl_configure_reg_page() implementation from
drivers/hv/mshv_vtl_main.c to arch-specific files:
- arch/x86/hyperv/hv_vtl.c: full implementation with register page setup
- arch/arm64/hyperv/hv_vtl.c: stub implementation (unsupported)

Move common type definitions to include/asm-generic/mshyperv.h:
- struct mshv_vtl_per_cpu
- union hv_synic_overlay_page_msr

Move hv_call_get_vp_registers() and hv_call_set_vp_registers()
declarations to include/asm-generic/mshyperv.h since these functions
are used by multiple modules.

While at it, remove the unnecessary stub implementations in #else
case for mshv_vtl_return* functions in arch/x86/include/asm/mshyperv.h.

This is essential for adding support for ARM64 in MSHV_VTL.

Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 arch/arm64/hyperv/hv_vtl.c        |  8 +++++
 arch/arm64/include/asm/mshyperv.h |  3 ++
 arch/x86/hyperv/hv_vtl.c          | 32 ++++++++++++++++++++
 arch/x86/include/asm/mshyperv.h   |  7 ++---
 drivers/hv/mshv.h                 |  8 -----
 drivers/hv/mshv_vtl_main.c        | 49 +++----------------------------
 include/asm-generic/mshyperv.h    | 42 ++++++++++++++++++++++++++
 7 files changed, 92 insertions(+), 57 deletions(-)

diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
index 66318672c242..d699138427c1 100644
--- a/arch/arm64/hyperv/hv_vtl.c
+++ b/arch/arm64/hyperv/hv_vtl.c
@@ -10,6 +10,7 @@
 #include <asm/boot.h>
 #include <asm/mshyperv.h>
 #include <asm/cpu_ops.h>
+#include <linux/export.h>
 
 void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
 {
@@ -142,3 +143,10 @@ void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
 		"v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31");
 }
 EXPORT_SYMBOL(mshv_vtl_return_call);
+
+bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
+{
+	pr_debug("Register page not supported on ARM64\n");
+	return false;
+}
+EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index de7f3a41a8ea..36803f0386cc 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -61,6 +61,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 				ARM_SMCCC_OWNER_VENDOR_HYP,	\
 				HV_SMCCC_FUNC_NUMBER)
 
+struct mshv_vtl_per_cpu;
+
 struct mshv_vtl_cpu_context {
 /*
  * NOTE: x18 is managed by the hypervisor. It won't be reloaded from this array.
@@ -82,6 +84,7 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
 }
 
 void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
+bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
 #endif
 
 #include <asm-generic/mshyperv.h>
diff --git a/arch/x86/hyperv/hv_vtl.c b/arch/x86/hyperv/hv_vtl.c
index 72a0bb4ae0c7..ede290985d41 100644
--- a/arch/x86/hyperv/hv_vtl.c
+++ b/arch/x86/hyperv/hv_vtl.c
@@ -20,6 +20,7 @@
 #include <uapi/asm/mtrr.h>
 #include <asm/debugreg.h>
 #include <linux/export.h>
+#include <linux/hyperv.h>
 #include <../kernel/smpboot.h>
 #include "../../kernel/fpu/legacy.h"
 
@@ -259,6 +260,37 @@ int __init hv_vtl_early_init(void)
 	return 0;
 }
 
+static const union hv_input_vtl input_vtl_zero;
+
+bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
+{
+	struct hv_register_assoc reg_assoc = {};
+	union hv_synic_overlay_page_msr overlay = {};
+	struct page *reg_page;
+
+	reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
+	if (!reg_page) {
+		WARN(1, "failed to allocate register page\n");
+		return false;
+	}
+
+	overlay.enabled = 1;
+	overlay.pfn = page_to_hvpfn(reg_page);
+	reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
+	reg_assoc.value.reg64 = overlay.as_uint64;
+
+	if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
+				     1, input_vtl_zero, &reg_assoc)) {
+		WARN(1, "failed to setup register page\n");
+		__free_page(reg_page);
+		return false;
+	}
+
+	per_cpu->reg_page = reg_page;
+	return true;
+}
+EXPORT_SYMBOL_GPL(hv_vtl_configure_reg_page);
+
 DEFINE_STATIC_CALL_NULL(__mshv_vtl_return_hypercall, void (*)(void));
 
 void mshv_vtl_return_call_init(u64 vtl_return_offset)
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index d5355a5b7517..d592fea49cdb 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -271,6 +271,8 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg) { return 0; }
 static inline int hv_apicid_to_vp_index(u32 apic_id) { return -EINVAL; }
 #endif /* CONFIG_HYPERV */
 
+struct mshv_vtl_per_cpu;
+
 struct mshv_vtl_cpu_context {
 	union {
 		struct {
@@ -305,13 +307,10 @@ void mshv_vtl_return_call_init(u64 vtl_return_offset);
 void mshv_vtl_return_hypercall(void);
 void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
 int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u64 shared);
+bool hv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu);
 #else
 static inline void __init hv_vtl_init_platform(void) {}
 static inline int __init hv_vtl_early_init(void) { return 0; }
-static inline void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
-static inline void mshv_vtl_return_call_init(u64 vtl_return_offset) {}
-static inline void mshv_vtl_return_hypercall(void) {}
-static inline void __mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0) {}
 #endif
 
 #include <asm-generic/mshyperv.h>
diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
index d4813df92b9c..0fcb7f9ba6a9 100644
--- a/drivers/hv/mshv.h
+++ b/drivers/hv/mshv.h
@@ -14,14 +14,6 @@
 	memchr_inv(&((STRUCT).MEMBER), \
 		   0, sizeof_field(typeof(STRUCT), MEMBER))
 
-int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
-			     union hv_input_vtl input_vtl,
-			     struct hv_register_assoc *registers);
-
-int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
-			     union hv_input_vtl input_vtl,
-			     struct hv_register_assoc *registers);
-
 int hv_call_get_partition_property(u64 partition_id, u64 property_code,
 				   u64 *property_value);
 
diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index 91517b45d526..c79d24317b8e 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -78,21 +78,6 @@ struct mshv_vtl {
 	u64 id;
 };
 
-struct mshv_vtl_per_cpu {
-	struct mshv_vtl_run *run;
-	struct page *reg_page;
-};
-
-/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
-union hv_synic_overlay_page_msr {
-	u64 as_uint64;
-	struct {
-		u64 enabled: 1;
-		u64 reserved: 11;
-		u64 pfn: 52;
-	} __packed;
-};
-
 static struct mutex mshv_vtl_poll_file_lock;
 static union hv_register_vsm_page_offsets mshv_vsm_page_offsets;
 static union hv_register_vsm_capabilities mshv_vsm_capabilities;
@@ -201,34 +186,6 @@ static struct page *mshv_vtl_cpu_reg_page(int cpu)
 	return *per_cpu_ptr(&mshv_vtl_per_cpu.reg_page, cpu);
 }
 
-static void mshv_vtl_configure_reg_page(struct mshv_vtl_per_cpu *per_cpu)
-{
-	struct hv_register_assoc reg_assoc = {};
-	union hv_synic_overlay_page_msr overlay = {};
-	struct page *reg_page;
-
-	reg_page = alloc_page(GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL);
-	if (!reg_page) {
-		WARN(1, "failed to allocate register page\n");
-		return;
-	}
-
-	overlay.enabled = 1;
-	overlay.pfn = page_to_hvpfn(reg_page);
-	reg_assoc.name = HV_X64_REGISTER_REG_PAGE;
-	reg_assoc.value.reg64 = overlay.as_uint64;
-
-	if (hv_call_set_vp_registers(HV_VP_INDEX_SELF, HV_PARTITION_ID_SELF,
-				     1, input_vtl_zero, &reg_assoc)) {
-		WARN(1, "failed to setup register page\n");
-		__free_page(reg_page);
-		return;
-	}
-
-	per_cpu->reg_page = reg_page;
-	mshv_has_reg_page = true;
-}
-
 static void mshv_vtl_synic_enable_regs(unsigned int cpu)
 {
 	union hv_synic_sint sint;
@@ -329,8 +286,10 @@ static int mshv_vtl_alloc_context(unsigned int cpu)
 	if (!per_cpu->run)
 		return -ENOMEM;
 
-	if (mshv_vsm_capabilities.intercept_page_available)
-		mshv_vtl_configure_reg_page(per_cpu);
+	if (mshv_vsm_capabilities.intercept_page_available) {
+		if (hv_vtl_configure_reg_page(per_cpu))
+			mshv_has_reg_page = true;
+	}
 
 	mshv_vtl_synic_enable_regs(cpu);
 
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index b147a12085e4..b53fcc071596 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -383,8 +383,50 @@ static inline int hv_deposit_memory(u64 partition_id, u64 status)
 	return hv_deposit_memory_node(NUMA_NO_NODE, partition_id, status);
 }
 
+#if IS_ENABLED(CONFIG_MSHV_ROOT) || IS_ENABLED(CONFIG_MSHV_VTL)
+int hv_call_get_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+			     union hv_input_vtl input_vtl,
+			     struct hv_register_assoc *registers);
+
+int hv_call_set_vp_registers(u32 vp_index, u64 partition_id, u16 count,
+			     union hv_input_vtl input_vtl,
+			     struct hv_register_assoc *registers);
+#else
+static inline int hv_call_get_vp_registers(u32 vp_index, u64 partition_id,
+					   u16 count,
+					   union hv_input_vtl input_vtl,
+					   struct hv_register_assoc *registers)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int hv_call_set_vp_registers(u32 vp_index, u64 partition_id,
+					   u16 count,
+					   union hv_input_vtl input_vtl,
+					   struct hv_register_assoc *registers)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_MSHV_ROOT || CONFIG_MSHV_VTL */
+
 #define HV_VP_ASSIST_PAGE_ADDRESS_SHIFT	12
+
 #if IS_ENABLED(CONFIG_HYPERV_VTL_MODE)
+struct mshv_vtl_per_cpu {
+	struct mshv_vtl_run *run;
+	struct page *reg_page;
+};
+
+/* SYNIC_OVERLAY_PAGE_MSR - internal, identical to hv_synic_simp */
+union hv_synic_overlay_page_msr {
+	u64 as_uint64;
+	struct {
+		u64 enabled: 1;
+		u64 reserved: 11;
+		u64 pfn: 52;
+	} __packed;
+};
+
 u8 __init get_vtl(void);
 #else
 static inline u8 get_vtl(void) { return 0; }
-- 
2.43.0


^ permalink raw reply related

* [PATCH 07/11] arch: arm64: Add support for mshv_vtl_return_call
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

Add support for arm64 specific variant of mshv_vtl_return_call function
to be able to add support for arm64 in MSHV_VTL driver. This would
help enable the transition between Virtual Trust Levels (VTL) in
MSHV_VTL when the kernel acts as a paravisor.

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 arch/arm64/hyperv/Makefile        |   1 +
 arch/arm64/hyperv/hv_vtl.c        | 144 ++++++++++++++++++++++++++++++
 arch/arm64/include/asm/mshyperv.h |  13 +++
 3 files changed, 158 insertions(+)
 create mode 100644 arch/arm64/hyperv/hv_vtl.c

diff --git a/arch/arm64/hyperv/Makefile b/arch/arm64/hyperv/Makefile
index 87c31c001da9..9701a837a6e1 100644
--- a/arch/arm64/hyperv/Makefile
+++ b/arch/arm64/hyperv/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-y		:= hv_core.o mshyperv.o
+obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
diff --git a/arch/arm64/hyperv/hv_vtl.c b/arch/arm64/hyperv/hv_vtl.c
new file mode 100644
index 000000000000..66318672c242
--- /dev/null
+++ b/arch/arm64/hyperv/hv_vtl.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2026, Microsoft, Inc.
+ *
+ * Authors:
+ *     Roman Kisel <romank@linux.microsoft.com>
+ *     Naman Jain <namjain@linux.microsoft.com>
+ */
+
+#include <asm/boot.h>
+#include <asm/mshyperv.h>
+#include <asm/cpu_ops.h>
+
+void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0)
+{
+	u64 base_ptr = (u64)vtl0->x;
+
+	/*
+	 * VTL switch for ARM64 platform - managing VTL0's CPU context.
+	 * We explicitly use the stack to save the base pointer, and use x16
+	 * as our working register for accessing the context structure.
+	 *
+	 * Register Handling:
+	 * - X0-X17: Saved/restored (general-purpose, shared for VTL communication)
+	 * - X18: NOT touched - hypervisor-managed per-VTL (platform register)
+	 * - X19-X30: Saved/restored (part of VTL0's execution context)
+	 * - Q0-Q31: Saved/restored (128-bit NEON/floating-point registers, shared)
+	 * - SP: Not in structure, hypervisor-managed per-VTL
+	 *
+	 * Note: X29 (FP) and X30 (LR) are in the structure and must be saved/restored
+	 * as part of VTL0's complete execution state.
+	 */
+	asm __volatile__ (
+		/* Save base pointer to stack explicitly, then load into x16 */
+		"str %0, [sp, #-16]!\n\t"     /* Push base pointer onto stack */
+		"mov x16, %0\n\t"             /* Load base pointer into x16 */
+		/* Volatile registers (Windows ARM64 ABI: x0-x15) */
+		"ldp x0, x1, [x16]\n\t"
+		"ldp x2, x3, [x16, #(2*8)]\n\t"
+		"ldp x4, x5, [x16, #(4*8)]\n\t"
+		"ldp x6, x7, [x16, #(6*8)]\n\t"
+		"ldp x8, x9, [x16, #(8*8)]\n\t"
+		"ldp x10, x11, [x16, #(10*8)]\n\t"
+		"ldp x12, x13, [x16, #(12*8)]\n\t"
+		"ldp x14, x15, [x16, #(14*8)]\n\t"
+		/* x16 will be loaded last, after saving base pointer */
+		"ldr x17, [x16, #(17*8)]\n\t"
+		/* x18 is hypervisor-managed per-VTL - DO NOT LOAD */
+
+		/* General-purpose registers: x19-x30 */
+		"ldp x19, x20, [x16, #(19*8)]\n\t"
+		"ldp x21, x22, [x16, #(21*8)]\n\t"
+		"ldp x23, x24, [x16, #(23*8)]\n\t"
+		"ldp x25, x26, [x16, #(25*8)]\n\t"
+		"ldp x27, x28, [x16, #(27*8)]\n\t"
+
+		/* Frame pointer and link register */
+		"ldp x29, x30, [x16, #(29*8)]\n\t"
+
+		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
+		"ldp q0, q1, [x16, #(32*8)]\n\t"
+		"ldp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
+		"ldp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
+		"ldp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
+		"ldp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
+		"ldp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
+		"ldp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
+		"ldp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
+		"ldp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
+		"ldp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
+		"ldp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
+		"ldp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
+		"ldp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
+		"ldp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
+		"ldp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
+		"ldp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
+
+		/* Now load x16 itself */
+		"ldr x16, [x16, #(16*8)]\n\t"
+
+		/* Return to the lower VTL */
+		"hvc #3\n\t"
+
+		/* Save context after return - reload base pointer from stack */
+		"stp x16, x17, [sp, #-16]!\n\t" /* Save x16, x17 temporarily */
+		"ldr x16, [sp, #16]\n\t"        /* Reload base pointer (skip saved x16,x17) */
+
+		/* Volatile registers */
+		"stp x0, x1, [x16]\n\t"
+		"stp x2, x3, [x16, #(2*8)]\n\t"
+		"stp x4, x5, [x16, #(4*8)]\n\t"
+		"stp x6, x7, [x16, #(6*8)]\n\t"
+		"stp x8, x9, [x16, #(8*8)]\n\t"
+		"stp x10, x11, [x16, #(10*8)]\n\t"
+		"stp x12, x13, [x16, #(12*8)]\n\t"
+		"stp x14, x15, [x16, #(14*8)]\n\t"
+		"ldp x0, x1, [sp], #16\n\t"      /* Recover saved x16, x17 */
+		"stp x0, x1, [x16, #(16*8)]\n\t"
+		/* x18 is hypervisor-managed - DO NOT SAVE */
+
+		/* General-purpose registers: x19-x30 */
+		"stp x19, x20, [x16, #(19*8)]\n\t"
+		"stp x21, x22, [x16, #(21*8)]\n\t"
+		"stp x23, x24, [x16, #(23*8)]\n\t"
+		"stp x25, x26, [x16, #(25*8)]\n\t"
+		"stp x27, x28, [x16, #(27*8)]\n\t"
+		"stp x29, x30, [x16, #(29*8)]\n\t"  /* Frame pointer and link register */
+
+		/* Shared NEON/FP registers: Q0-Q31 (128-bit) */
+		"stp q0, q1, [x16, #(32*8)]\n\t"
+		"stp q2, q3, [x16, #(32*8 + 2*16)]\n\t"
+		"stp q4, q5, [x16, #(32*8 + 4*16)]\n\t"
+		"stp q6, q7, [x16, #(32*8 + 6*16)]\n\t"
+		"stp q8, q9, [x16, #(32*8 + 8*16)]\n\t"
+		"stp q10, q11, [x16, #(32*8 + 10*16)]\n\t"
+		"stp q12, q13, [x16, #(32*8 + 12*16)]\n\t"
+		"stp q14, q15, [x16, #(32*8 + 14*16)]\n\t"
+		"stp q16, q17, [x16, #(32*8 + 16*16)]\n\t"
+		"stp q18, q19, [x16, #(32*8 + 18*16)]\n\t"
+		"stp q20, q21, [x16, #(32*8 + 20*16)]\n\t"
+		"stp q22, q23, [x16, #(32*8 + 22*16)]\n\t"
+		"stp q24, q25, [x16, #(32*8 + 24*16)]\n\t"
+		"stp q26, q27, [x16, #(32*8 + 26*16)]\n\t"
+		"stp q28, q29, [x16, #(32*8 + 28*16)]\n\t"
+		"stp q30, q31, [x16, #(32*8 + 30*16)]\n\t"
+
+		/* Clean up stack - pop base pointer */
+		"add sp, sp, #16\n\t"
+
+		: /* No outputs */
+		: /* Input */ "r"(base_ptr)
+		: /* Clobber list - x16 used as base, x18 is hypervisor-managed (not touched) */
+		"memory", "cc",
+		"x0", "x1", "x2", "x3", "x4", "x5",
+		"x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13",
+		"x14", "x15", "x16", "x17", "x19", "x20", "x21",
+		"x22", "x23", "x24", "x25", "x26", "x27", "x28",
+		"x29", "x30",
+		"v0", "v1", "v2", "v3", "v4", "v5", "v6", "v7",
+		"v8", "v9", "v10", "v11", "v12", "v13", "v14", "v15",
+		"v16", "v17", "v18", "v19", "v20", "v21", "v22", "v23",
+		"v24", "v25", "v26", "v27", "v28", "v29", "v30", "v31");
+}
+EXPORT_SYMBOL(mshv_vtl_return_call);
diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index 804068e0941b..de7f3a41a8ea 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -60,6 +60,17 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 				ARM_SMCCC_SMC_64,		\
 				ARM_SMCCC_OWNER_VENDOR_HYP,	\
 				HV_SMCCC_FUNC_NUMBER)
+
+struct mshv_vtl_cpu_context {
+/*
+ * NOTE: x18 is managed by the hypervisor. It won't be reloaded from this array.
+ * It is included here for convenience in the common case.
+ */
+	__u64 x[31];
+	__u64 rsvd;
+	__uint128_t q[32];
+};
+
 #ifdef CONFIG_HYPERV_VTL_MODE
 /*
  * Get/Set the register. If the function returns `1`, that must be done via
@@ -69,6 +80,8 @@ static inline int hv_vtl_get_set_reg(struct hv_register_assoc *regs, bool set, u
 {
 	return 1;
 }
+
+void mshv_vtl_return_call(struct mshv_vtl_cpu_context *vtl0);
 #endif
 
 #include <asm-generic/mshyperv.h>
-- 
2.43.0


^ permalink raw reply related

* [PATCH 06/11] Drivers: hv: Make sint vector architecture neutral in MSHV_VTL
From: Naman Jain @ 2026-03-16 12:12 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Catalin Marinas, Will Deacon, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin, Arnd Bergmann,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, Alexandre Ghiti
  Cc: Marc Zyngier, Timothy Hayes, Lorenzo Pieralisi, mrigendrachaubey,
	Naman Jain, ssengar, Michael Kelley, linux-hyperv,
	linux-arm-kernel, linux-kernel, linux-arch, linux-riscv
In-Reply-To: <20260316121241.910764-1-namjain@linux.microsoft.com>

Generalize Synthetic interrupt source vector (sint) to use
vmbus_interrupt variable instead, which automatically takes care of
architectures where HYPERVISOR_CALLBACK_VECTOR is not present (arm64).

Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
 drivers/hv/mshv_vtl_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index b607b6e7e121..91517b45d526 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -234,7 +234,7 @@ static void mshv_vtl_synic_enable_regs(unsigned int cpu)
 	union hv_synic_sint sint;
 
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = vmbus_interrupt;
 	sint.masked = false;
 	sint.auto_eoi = hv_recommend_using_aeoi();
 
@@ -753,7 +753,7 @@ static void mshv_vtl_synic_mask_vmbus_sint(void *info)
 	const u8 *mask = info;
 
 	sint.as_uint64 = 0;
-	sint.vector = HYPERVISOR_CALLBACK_VECTOR;
+	sint.vector = vmbus_interrupt;
 	sint.masked = (*mask != 0);
 	sint.auto_eoi = hv_recommend_using_aeoi();
 
-- 
2.43.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox