[PATCH 0/5] mm/vfio: huge pfnmaps with !MAP

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings
@ 2025-06-13 13:41 Peter Xu
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 13:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, David Hildenbrand, Nico Pache, peterx

[based on latest akpm/mm-new as of June 12th 2025, commit 19d47edf9]

This series enables !MAP_FIXED huge pfnmaps for vfio-pci.

Before this series, an userapp in most cases need to be modified to benefit
from huge mappings to provide huge size aligned VA using MAP_FIXED.  After
this series, the userapp can benefit from huge pfnmap automatically after
the kernel upgrades, with no userspace modifications.

It's still best-effort, because the auto-alignment will require a larger VA
range to be allocated via the per-arch allocator, hence if the huge-mapping
aligned VA cannot be allocated then it'll still fallback to small mappings
like before.  However that's really from theory POV: in reality I don't yet
know when it'll fail on any 64bits system due to it.

So far, only vfio-pci is supported.  But the logic should be applicable to
all the drivers that support or will support huge pfnmaps.

Kudos goes to Jason on the suggestion:

  https://lore.kernel.org/r/20250530131050.GA233377@nvidia.com

Though instead of refactoring shmem, I found we already have a function we
can directly reuse for THP calculations.

The idea is fairly simple too, which is to make sure whatever virtual
address got returned from an mmap() request of the MMIO BAR regions to be
huge-size-aligned with the physical address of the corresponding BARs.

It contains minimum mm changes, in reality only to rename and export the
THP function that can be reused.  That is patch 3.

Patch 1 & 2 are trivial small cleanups that I found while I'm looking at
this problem.  They can even be posted separately if anyone would like me
to.

Patch 4 is a tunneling needed to wire vfio-pci over to the mmap()
operations of vfio_device.  Then, patch 5 is the real meat.

For testing: besides checkpatch and my daily cross-build harness, unit
tests working all fine from either myself [1] (based on another Alex's test
program) or Alex, checking the alignments look all sane with
mmap(!MAP_FIXED), and huge mappings properly installed.

Alex Mastro: please feel free to try this out with your internal tests. The
hope is that after this series applied your app should get huge pfnmaps
without any changes (with any pgoff specified).  Logically there should be
minimal dependency on stable branches whenever huge pfnmap is available.

Comments welcomed, thanks.

[1] https://github.com/xzpeter/clibs/blob/master/misc/vfio-pci-nofix.c
[2] https://github.com/awilliam/tests/blob/vfio-pci-device-map-alignment/vfio-pci-device-map-alignment.c

Peter Xu (5):
  mm: Deduplicate mm_get_unmapped_area()
  mm/hugetlb: Remove prepare_hugepage_range()
  mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  vfio: Introduce vfio_device_ops.get_unmapped_area hook
  vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings

 arch/loongarch/include/asm/hugetlb.h | 14 ------
 arch/mips/include/asm/hugetlb.h      | 14 ------
 drivers/vfio/pci/vfio_pci.c          |  3 ++
 drivers/vfio/pci/vfio_pci_core.c     | 65 ++++++++++++++++++++++++++++
 drivers/vfio/vfio_main.c             | 18 ++++++++
 fs/hugetlbfs/inode.c                 |  8 +---
 include/asm-generic/hugetlb.h        |  8 ----
 include/linux/huge_mm.h              | 14 +++++-
 include/linux/hugetlb.h              |  6 ---
 include/linux/vfio.h                 |  7 +++
 include/linux/vfio_pci_core.h        |  6 +++
 mm/huge_memory.c                     |  6 ++-
 mm/mmap.c                            |  5 +--
 13 files changed, 120 insertions(+), 54 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings Peter Xu
@ 2025-06-13 13:41 ` Peter Xu
  2025-06-13 14:12   ` Jason Gunthorpe
                     ` (5 more replies)
  2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 6 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 13:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, David Hildenbrand, Nico Pache, peterx

Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
the helper instead to dedup the lines.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/mmap.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 09c563c95112..422f5b9d9660 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -871,9 +871,8 @@ mm_get_unmapped_area(struct mm_struct *mm, struct file *file,
 		     unsigned long addr, unsigned long len,
 		     unsigned long pgoff, unsigned long flags)
 {
-	if (test_bit(MMF_TOPDOWN, &mm->flags))
-		return arch_get_unmapped_area_topdown(file, addr, len, pgoff, flags, 0);
-	return arch_get_unmapped_area(file, addr, len, pgoff, flags, 0);
+	return mm_get_unmapped_area_vmflags(mm, file, addr, len,
+					    pgoff, flags, 0);
 }
 EXPORT_SYMBOL(mm_get_unmapped_area);
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 13:41 [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings Peter Xu
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
@ 2025-06-13 13:41 ` Peter Xu
  2025-06-13 14:12   ` Jason Gunthorpe
                     ` (3 more replies)
  2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 4 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 13:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, David Hildenbrand, Nico Pache, peterx, Huacai Chen,
	Thomas Bogendoerfer, Muchun Song, Oscar Salvador, loongarch,
	linux-mips

Only mips and loongarch implemented this API, however what it does was
checking against stack overflow for either len or addr.  That's already
done in arch's arch_get_unmapped_area*() functions, hence not needed.

It means the whole API is pretty much obsolete at least now, remove it
completely.

Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: loongarch@lists.linux.dev
Cc: linux-mips@vger.kernel.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 arch/loongarch/include/asm/hugetlb.h | 14 --------------
 arch/mips/include/asm/hugetlb.h      | 14 --------------
 fs/hugetlbfs/inode.c                 |  8 ++------
 include/asm-generic/hugetlb.h        |  8 --------
 include/linux/hugetlb.h              |  6 ------
 5 files changed, 2 insertions(+), 48 deletions(-)

diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h
index 4dc4b3e04225..ab68b594f889 100644
--- a/arch/loongarch/include/asm/hugetlb.h
+++ b/arch/loongarch/include/asm/hugetlb.h
@@ -10,20 +10,6 @@
 
 uint64_t pmd_to_entrylo(unsigned long pmd_val);
 
-#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
-static inline int prepare_hugepage_range(struct file *file,
-					 unsigned long addr,
-					 unsigned long len)
-{
-	unsigned long task_size = STACK_TOP;
-
-	if (len > task_size)
-		return -ENOMEM;
-	if (task_size - len < addr)
-		return -EINVAL;
-	return 0;
-}
-
 #define __HAVE_ARCH_HUGE_PTE_CLEAR
 static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 				  pte_t *ptep, unsigned long sz)
diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
index fbc71ddcf0f6..8c460ce01ffe 100644
--- a/arch/mips/include/asm/hugetlb.h
+++ b/arch/mips/include/asm/hugetlb.h
@@ -11,20 +11,6 @@
 
 #include <asm/page.h>
 
-#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
-static inline int prepare_hugepage_range(struct file *file,
-					 unsigned long addr,
-					 unsigned long len)
-{
-	unsigned long task_size = STACK_TOP;
-
-	if (len > task_size)
-		return -ENOMEM;
-	if (task_size - len < addr)
-		return -EINVAL;
-	return 0;
-}
-
 #define __HAVE_ARCH_HUGE_PTEP_GET_AND_CLEAR
 static inline pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 					    unsigned long addr, pte_t *ptep,
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index fc03dd541b4d..32dff13463d2 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -179,12 +179,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 
 	if (len & ~huge_page_mask(h))
 		return -EINVAL;
-	if (flags & MAP_FIXED) {
-		if (addr & ~huge_page_mask(h))
-			return -EINVAL;
-		if (prepare_hugepage_range(file, addr, len))
-			return -EINVAL;
-	}
+	if ((flags & MAP_FIXED) && (addr & ~huge_page_mask(h)))
+		return -EINVAL;
 	if (addr)
 		addr0 = ALIGN(addr, huge_page_size(h));
 
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index 3e0a8fe9b108..4bce4f07f44f 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -114,14 +114,6 @@ static inline int huge_pte_none_mostly(pte_t pte)
 }
 #endif
 
-#ifndef __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
-static inline int prepare_hugepage_range(struct file *file,
-		unsigned long addr, unsigned long len)
-{
-	return 0;
-}
-#endif
-
 #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 		unsigned long addr, pte_t *ptep)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 42f374e828a2..85acdfdbe9f0 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -359,12 +359,6 @@ static inline void hugetlb_show_meminfo_node(int nid)
 {
 }
 
-static inline int prepare_hugepage_range(struct file *file,
-				unsigned long addr, unsigned long len)
-{
-	return -EINVAL;
-}
-
 static inline void hugetlb_vma_lock_read(struct vm_area_struct *vma)
 {
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 13:41 [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings Peter Xu
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
  2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
@ 2025-06-13 13:41 ` Peter Xu
  2025-06-13 14:17   ` Jason Gunthorpe
                     ` (3 more replies)
  2025-06-13 13:41 ` [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook Peter Xu
  2025-06-13 13:41 ` [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Peter Xu
  4 siblings, 4 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 13:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, David Hildenbrand, Nico Pache, peterx, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

This function is pretty handy for any type of VMA to provide a size-aligned
VMA address when mmap().  Rename the function and export it.

About the rename:

  - Dropping "THP" because it doesn't really have much to do with THP
    internally.

  - The suffix "_aligned" imply it is a helper to generate aligned virtual
    address based on what is specified (which can be not PMD_SIZE).

Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 14 +++++++++++++-
 mm/huge_memory.c        |  6 ++++--
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..706488d92bb6 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags,
 		vm_flags_t vm_flags);
-
+unsigned long mm_get_unmapped_area_aligned(struct file *filp,
+		unsigned long addr, unsigned long len,
+		loff_t off, unsigned long flags, unsigned long size,
+		vm_flags_t vm_flags);
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
@@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 	return 0;
 }
 
+static inline unsigned long
+mm_get_unmapped_area_aligned(struct file *filp,
+			     unsigned long addr, unsigned long len,
+			     loff_t off, unsigned long flags, unsigned long size,
+			     vm_flags_t vm_flags)
+{
+	return 0;
+}
+
 static inline bool
 can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4734de1dc0ae..52f13a70562f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
 		folio_test_large_rmappable(folio);
 }
 
-static unsigned long __thp_get_unmapped_area(struct file *filp,
+unsigned long mm_get_unmapped_area_aligned(struct file *filp,
 		unsigned long addr, unsigned long len,
 		loff_t off, unsigned long flags, unsigned long size,
 		vm_flags_t vm_flags)
@@ -1132,6 +1132,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
 	ret += off_sub;
 	return ret;
 }
+EXPORT_SYMBOL_GPL(mm_get_unmapped_area_aligned);
 
 unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags,
@@ -1140,7 +1141,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
-	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
+	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+					   PMD_SIZE, vm_flags);
 	if (ret)
 		return ret;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-13 13:41 [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings Peter Xu
                   ` (2 preceding siblings ...)
  2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
@ 2025-06-13 13:41 ` Peter Xu
  2025-06-13 14:18   ` Jason Gunthorpe
                     ` (2 more replies)
  2025-06-13 13:41 ` [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Peter Xu
  4 siblings, 3 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 13:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, David Hildenbrand, Nico Pache, peterx

Add a hook to vfio_device_ops to allow sub-modules provide virtual
addresses for an mmap() request.

Note that the fallback will be mm_get_unmapped_area(), which should
maintain the old behavior of generic VA allocation (__get_unmapped_area).
It's a bit unfortunate that is needed, as the current get_unmapped_area()
file ops cannot support a retval which fallbacks to the default.  So that
is needed both here and whenever sub-module will opt-in with its own.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/vfio_main.c | 18 ++++++++++++++++++
 include/linux/vfio.h     |  7 +++++++
 2 files changed, 25 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582..19db8e58d223 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1354,6 +1354,23 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	return device->ops->mmap(device, vma);
 }
 
+static unsigned long vfio_device_get_unmapped_area(struct file *file,
+						   unsigned long addr,
+						   unsigned long len,
+						   unsigned long pgoff,
+						   unsigned long flags)
+{
+	struct vfio_device_file *df = file->private_data;
+	struct vfio_device *device = df->device;
+
+	if (!device->ops->get_unmapped_area)
+		return mm_get_unmapped_area(current->mm, file, addr,
+					    len, pgoff, flags);
+
+	return device->ops->get_unmapped_area(device, file, addr, len,
+					      pgoff, flags);
+}
+
 const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
 	.open		= vfio_device_fops_cdev_open,
@@ -1363,6 +1380,7 @@ const struct file_operations vfio_device_fops = {
 	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
 	.mmap		= vfio_device_fops_mmap,
+	.get_unmapped_area = vfio_device_get_unmapped_area,
 };
 
 static struct vfio_device *vfio_device_from_file(struct file *file)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 707b00772ce1..48fe71c61ed2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -108,6 +108,7 @@ struct vfio_device {
  * @dma_unmap: Called when userspace unmaps IOVA from the container
  *             this device is attached to.
  * @device_feature: Optional, fill in the VFIO_DEVICE_FEATURE ioctl
+ * @get_unmapped_area: Optional, provide virtual address hint for mmap()
  */
 struct vfio_device_ops {
 	char	*name;
@@ -135,6 +136,12 @@ struct vfio_device_ops {
 	void	(*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length);
 	int	(*device_feature)(struct vfio_device *device, u32 flags,
 				  void __user *arg, size_t argsz);
+	unsigned long (*get_unmapped_area)(struct vfio_device *device,
+					   struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags);
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 13:41 [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings Peter Xu
                   ` (3 preceding siblings ...)
  2025-06-13 13:41 ` [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook Peter Xu
@ 2025-06-13 13:41 ` Peter Xu
  2025-06-13 14:29   ` Jason Gunthorpe
                     ` (2 more replies)
  4 siblings, 3 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 13:41 UTC (permalink / raw)
  To: linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, David Hildenbrand, Nico Pache, peterx

This patch enables best-effort mmap() for vfio-pci bars even without
MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
addresses) to start enabling huge pfnmaps on VFIO bars.

Here the trick is making sure the MMIO PFNs will be aligned with the VAs
allocated from mmap() when !MAP_FIXED, so that whatever returned from
mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
for huge pfnmaps as much as possible.

To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
devices is needed.

Note that MMIO physical addresses should normally be guaranteed to be
always bar-size aligned, hence the bar offset can logically be directly
used to do the calculation.  However to make it strict and clear (rather
than relying on spec details), we still try to fetch the bar's physical
addresses from pci_dev.resource[].

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c      |  3 ++
 drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    |  6 +++
 3 files changed, 74 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5ba39f7623bb..d9ae6cdbea28 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
 	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
+#endif
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcd..835bc168f8b7 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+/*
+ * Hint function to provide mmap() virtual address candidate so as to be
+ * able to map huge pfnmaps as much as possible.  It is done by aligning
+ * the VA to the PFN to be mapped in the specific bar.
+ *
+ * Note that this function does the minimum check on mmap() parameters to
+ * make the PFN calculation valid only. The majority of mmap() sanity check
+ * will be done later in mmap().
+ */
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+					      struct file *file,
+					      unsigned long addr,
+					      unsigned long len,
+					      unsigned long pgoff,
+					      unsigned long flags)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned long ret, phys_len, req_start, phys_addr;
+	unsigned int index;
+
+	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+
+	/* Currently, only bars 0-5 supports huge pfnmap */
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		goto fallback;
+
+	/* Bar offset */
+	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
+	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
+
+	/*
+	 * Make sure we at least can get a valid physical address to do the
+	 * math.  If this happens, it will probably fail mmap() later..
+	 */
+	if (req_start >= phys_len)
+		goto fallback;
+
+	phys_len = MIN(phys_len, len);
+	/* Calculate the start of physical address to be mapped */
+	phys_addr = pci_resource_start(pdev, index) + req_start;
+
+	/* Choose the alignment */
+	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
+						   flags, PUD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+	if (phys_len >= PMD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
+						   flags, PMD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+fallback:
+	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_get_unmapped_area);
+#endif
+
 static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 					   unsigned int order)
 {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b3..e59699e01901 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -119,6 +119,12 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
 		size_t count, loff_t *ppos);
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+					      struct file *file,
+					      unsigned long addr,
+					      unsigned long len,
+					      unsigned long pgoff,
+					      unsigned long flags);
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
 void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
 int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
@ 2025-06-13 14:12   ` Jason Gunthorpe
  2025-06-13 14:55   ` Oscar Salvador
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 14:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 09:41:07AM -0400, Peter Xu wrote:
> Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> the helper instead to dedup the lines.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/mmap.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
@ 2025-06-13 14:12   ` Jason Gunthorpe
  2025-06-13 14:59   ` Oscar Salvador
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 14:12 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache, Huacai Chen,
	Thomas Bogendoerfer, Muchun Song, Oscar Salvador, loongarch,
	linux-mips

On Fri, Jun 13, 2025 at 09:41:08AM -0400, Peter Xu wrote:
> Only mips and loongarch implemented this API, however what it does was
> checking against stack overflow for either len or addr.  That's already
> done in arch's arch_get_unmapped_area*() functions, hence not needed.
> 
> It means the whole API is pretty much obsolete at least now, remove it
> completely.
> 
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/loongarch/include/asm/hugetlb.h | 14 --------------
>  arch/mips/include/asm/hugetlb.h      | 14 --------------
>  fs/hugetlbfs/inode.c                 |  8 ++------
>  include/asm-generic/hugetlb.h        |  8 --------
>  include/linux/hugetlb.h              |  6 ------
>  5 files changed, 2 insertions(+), 48 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
@ 2025-06-13 14:17   ` Jason Gunthorpe
  2025-06-13 15:13     ` Peter Xu
  2025-06-13 15:19   ` Zi Yan
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 14:17 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
>  		folio_test_large_rmappable(folio);
>  }
>  
> -static unsigned long __thp_get_unmapped_area(struct file *filp,
> +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
>  		unsigned long addr, unsigned long len,
>  		loff_t off, unsigned long flags, unsigned long size,
>  		vm_flags_t vm_flags)

Please add a kdoc for this since it is going to be exported..

I didn't intuitively guess how it works or why there are two
length/size arguments. It seems to have an exciting return code as
well.

I suppose size is the alignment target? Maybe rename the parameter too?

For the purposes of VFIO do we need to be careful about math overflow here:

	loff_t off_end = off + len;
	loff_t off_align = round_up(off, size);

?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-13 13:41 ` [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook Peter Xu
@ 2025-06-13 14:18   ` Jason Gunthorpe
  2025-06-13 18:03   ` David Hildenbrand
  2025-06-14 14:46   ` kernel test robot
  2 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 14:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 09:41:10AM -0400, Peter Xu wrote:
> Add a hook to vfio_device_ops to allow sub-modules provide virtual
> addresses for an mmap() request.
> 
> Note that the fallback will be mm_get_unmapped_area(), which should
> maintain the old behavior of generic VA allocation (__get_unmapped_area).
> It's a bit unfortunate that is needed, as the current get_unmapped_area()
> file ops cannot support a retval which fallbacks to the default.  So that
> is needed both here and whenever sub-module will opt-in with its own.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  drivers/vfio/vfio_main.c | 18 ++++++++++++++++++
>  include/linux/vfio.h     |  7 +++++++
>  2 files changed, 25 insertions(+)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 13:41 ` [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Peter Xu
@ 2025-06-13 14:29   ` Jason Gunthorpe
  2025-06-13 15:26     ` Peter Xu
  2025-06-13 18:09   ` David Hildenbrand
       [not found]   ` <20250613174442.1589882-1-amastro@fb.com>
  2 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 14:29 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:

> +	/* Choose the alignment */
> +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PUD_SIZE, 0);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (phys_len >= PMD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PMD_SIZE, 0);
> +		if (ret)
> +			return ret;
> +	}

Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
contiguity to get a 32*16k=1GB option.

Forcing to only align to the PMD or PUD seems suboptimal..

> +fallback:
> +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);

Why not put this into mm_get_unmapped_area_vmflags() and get rid of
thp_get_unmapped_area_vmflags() too?

Is there any reason the caller should have to do a retry?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
  2025-06-13 14:12   ` Jason Gunthorpe
@ 2025-06-13 14:55   ` Oscar Salvador
  2025-06-13 14:58   ` Zi Yan
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: Oscar Salvador @ 2025-06-13 14:55 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache

On Fri, Jun 13, 2025 at 09:41:07AM -0400, Peter Xu wrote:
> Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> the helper instead to dedup the lines.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
  2025-06-13 14:12   ` Jason Gunthorpe
  2025-06-13 14:55   ` Oscar Salvador
@ 2025-06-13 14:58   ` Zi Yan
  2025-06-13 15:57   ` Lorenzo Stoakes
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 77+ messages in thread
From: Zi Yan @ 2025-06-13 14:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Jason Gunthorpe, Alex Mastro, David Hildenbrand, Nico Pache

On 13 Jun 2025, at 9:41, Peter Xu wrote:

> Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> the helper instead to dedup the lines.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/mmap.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
Reviewed-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
  2025-06-13 14:12   ` Jason Gunthorpe
@ 2025-06-13 14:59   ` Oscar Salvador
  2025-06-13 15:13   ` Zi Yan
  2025-06-14  4:11   ` Liam R. Howlett
  3 siblings, 0 replies; 77+ messages in thread
From: Oscar Salvador @ 2025-06-13 14:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Huacai Chen, Thomas Bogendoerfer, Muchun Song,
	loongarch, linux-mips

On Fri, Jun 13, 2025 at 09:41:08AM -0400, Peter Xu wrote:
> Only mips and loongarch implemented this API, however what it does was
> checking against stack overflow for either len or addr.  That's already
> done in arch's arch_get_unmapped_area*() functions, hence not needed.
> 
> It means the whole API is pretty much obsolete at least now, remove it
> completely.
> 
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>

I think I forgot to clean these up when I unified the unmapped_area for
hugetlb.

Reviewed-by: Oscar Salvador <osalvador@suse.de>

Thanks Peter!
 

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
  2025-06-13 14:12   ` Jason Gunthorpe
  2025-06-13 14:59   ` Oscar Salvador
@ 2025-06-13 15:13   ` Zi Yan
  2025-06-13 16:24     ` Peter Xu
  2025-06-14  4:11   ` Liam R. Howlett
  3 siblings, 1 reply; 77+ messages in thread
From: Zi Yan @ 2025-06-13 15:13 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Jason Gunthorpe, Alex Mastro, David Hildenbrand, Nico Pache,
	Huacai Chen, Thomas Bogendoerfer, Muchun Song, Oscar Salvador,
	loongarch, linux-mips

On 13 Jun 2025, at 9:41, Peter Xu wrote:

> Only mips and loongarch implemented this API, however what it does was
> checking against stack overflow for either len or addr.  That's already
> done in arch's arch_get_unmapped_area*() functions, hence not needed.
>
> It means the whole API is pretty much obsolete at least now, remove it
> completely.
>
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/loongarch/include/asm/hugetlb.h | 14 --------------
>  arch/mips/include/asm/hugetlb.h      | 14 --------------
>  fs/hugetlbfs/inode.c                 |  8 ++------
>  include/asm-generic/hugetlb.h        |  8 --------
>  include/linux/hugetlb.h              |  6 ------
>  5 files changed, 2 insertions(+), 48 deletions(-)
>
> diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h
> index 4dc4b3e04225..ab68b594f889 100644
> --- a/arch/loongarch/include/asm/hugetlb.h
> +++ b/arch/loongarch/include/asm/hugetlb.h
> @@ -10,20 +10,6 @@
>
>  uint64_t pmd_to_entrylo(unsigned long pmd_val);
>
> -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> -static inline int prepare_hugepage_range(struct file *file,
> -					 unsigned long addr,
> -					 unsigned long len)
> -{
> -	unsigned long task_size = STACK_TOP;
> -
> -	if (len > task_size)
> -		return -ENOMEM;
> -	if (task_size - len < addr)
> -		return -EINVAL;
> -	return 0;
> -}
> -
>  #define __HAVE_ARCH_HUGE_PTE_CLEAR
>  static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
>  				  pte_t *ptep, unsigned long sz)
> diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
> index fbc71ddcf0f6..8c460ce01ffe 100644
> --- a/arch/mips/include/asm/hugetlb.h
> +++ b/arch/mips/include/asm/hugetlb.h
> @@ -11,20 +11,6 @@
>
>  #include <asm/page.h>
>
> -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> -static inline int prepare_hugepage_range(struct file *file,
> -					 unsigned long addr,
> -					 unsigned long len)
> -{
> -	unsigned long task_size = STACK_TOP;
> -
> -	if (len > task_size)
> -		return -ENOMEM;

arch_get_unmapped_area_topdown() has this check.

> -	if (task_size - len < addr)
> -		return -EINVAL;

For this one, arch_get_unmapped_area_topdown() instead will try to
provide a different addr if the check fails.

So this patch changes the original code behavior, right?
If yes, it is worth spelling it out in the commit log.

Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 14:17   ` Jason Gunthorpe
@ 2025-06-13 15:13     ` Peter Xu
  2025-06-13 16:00       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-13 15:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 11:17:45AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> > @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
> >  		folio_test_large_rmappable(folio);
> >  }
> >  
> > -static unsigned long __thp_get_unmapped_area(struct file *filp,
> > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> >  		unsigned long addr, unsigned long len,
> >  		loff_t off, unsigned long flags, unsigned long size,
> >  		vm_flags_t vm_flags)
> 
> Please add a kdoc for this since it is going to be exported..

Will do.  And thanks for the super fast feedbacks. :)

> 
> I didn't intuitively guess how it works or why there are two
> length/size arguments. It seems to have an exciting return code as
> well.
> 
> I suppose size is the alignment target? Maybe rename the parameter too?

Yes, when the kdoc is there it'll be more obvious.  So far "size" is ok to
me, but if you have better suggestion please shoot - whatever I came up
with so far seems to be too long, and maybe not necessary when kdoc will be
available too.

> 
> For the purposes of VFIO do we need to be careful about math overflow here:
> 
> 	loff_t off_end = off + len;
> 	loff_t off_align = round_up(off, size);
> 
> ?

IIUC the 1st one was covered by the latter check here:

        (off + len_pad) < off

Indeed I didn't see what makes sure the 2nd won't overflow.

How about I add it within this patch?  A whole fixup could look like this:

From 4d71d1fc905da23786e1252774e42a1051253176 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 13 Jun 2025 10:55:35 -0400
Subject: [PATCH] fixup! mm: Rename __thp_get_unmapped_area to
 mm_get_unmapped_area_aligned

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 52f13a70562f..5cbe45405623 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1088,6 +1088,24 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
 		folio_test_large_rmappable(folio);
 }
 
+/**
+ * mm_get_unmapped_area_aligned - Allocate an aligned virtual address
+ * @filp: file target of the mmap() request
+ * @addr: hint address from mmap() request
+ * @len: len of the mmap() request
+ * @off: file offset of the mmap() request
+ * @flags: flags of the mmap() request
+ * @size: the size of alignment the caller requests
+ * @vm_flags: the vm_flags passed from get_unmapped_area() caller
+ *
+ * This function should normally be used by a driver's specific
+ * get_unmapped_area() handler to provide a properly aligned virtual
+ * address for a specific mmap() request.  The caller should pass in most
+ * of the parameters from the get_unmapped_area() request, but properly
+ * specify @size as the alignment needed.
+ *
+ * Return: non-zero if a valid virtual address is found, zero if fails
+ */
 unsigned long mm_get_unmapped_area_aligned(struct file *filp,
 		unsigned long addr, unsigned long len,
 		loff_t off, unsigned long flags, unsigned long size,
@@ -1104,7 +1122,7 @@ unsigned long mm_get_unmapped_area_aligned(struct file *filp,
 		return 0;
 
 	len_pad = len + size;
-	if (len_pad < len || (off + len_pad) < off)
+	if (len_pad < len || (off + len_pad) < off || off_align < off)
 		return 0;
 
 	ret = mm_get_unmapped_area_vmflags(current->mm, filp, addr, len_pad,
-- 
2.49.0


-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
  2025-06-13 14:17   ` Jason Gunthorpe
@ 2025-06-13 15:19   ` Zi Yan
  2025-06-13 18:33     ` Peter Xu
  2025-06-13 15:36   ` Lorenzo Stoakes
  2025-06-14  5:23   ` Liam R. Howlett
  3 siblings, 1 reply; 77+ messages in thread
From: Zi Yan @ 2025-06-13 15:19 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Jason Gunthorpe, Alex Mastro, David Hildenbrand, Nico Pache,
	Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts,
	Dev Jain, Barry Song

On 13 Jun 2025, at 9:41, Peter Xu wrote:

> This function is pretty handy for any type of VMA to provide a size-aligned
> VMA address when mmap().  Rename the function and export it.
>
> About the rename:
>
>   - Dropping "THP" because it doesn't really have much to do with THP
>     internally.
>
>   - The suffix "_aligned" imply it is a helper to generate aligned virtual
>     address based on what is specified (which can be not PMD_SIZE).
>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/huge_mm.h | 14 +++++++++++++-
>  mm/huge_memory.c        |  6 ++++--
>  2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..706488d92bb6 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>  		unsigned long len, unsigned long pgoff, unsigned long flags,
>  		vm_flags_t vm_flags);
> -
> +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> +		unsigned long addr, unsigned long len,
> +		loff_t off, unsigned long flags, unsigned long size,
> +		vm_flags_t vm_flags);
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> @@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>  	return 0;
>  }
>
> +static inline unsigned long
> +mm_get_unmapped_area_aligned(struct file *filp,
> +			     unsigned long addr, unsigned long len,
> +			     loff_t off, unsigned long flags, unsigned long size,
> +			     vm_flags_t vm_flags)
> +{
> +	return 0;
> +}
> +
>  static inline bool
>  can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4734de1dc0ae..52f13a70562f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
>  		folio_test_large_rmappable(folio);
>  }
>
> -static unsigned long __thp_get_unmapped_area(struct file *filp,
> +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
>  		unsigned long addr, unsigned long len,
>  		loff_t off, unsigned long flags, unsigned long size,

Since you added aligned suffix, renaming size to alignment might
help improve readability.

Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>


Best Regards,
Yan, Zi

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 14:29   ` Jason Gunthorpe
@ 2025-06-13 15:26     ` Peter Xu
  2025-06-13 16:09       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-13 15:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:
> 
> > +	/* Choose the alignment */
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PUD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (phys_len >= PMD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PMD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> 
> Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> contiguity to get a 32*16k=1GB option.
> 
> Forcing to only align to the PMD or PUD seems suboptimal..

Right, however the cont-pte / cont-pmd are still not supported in huge
pfnmaps in general?  It'll definitely be nice if someone could look at that
from ARM perspective, then provide support of both in one shot.

> 
> > +fallback:
> > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> 
> Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> thp_get_unmapped_area_vmflags() too?
> 
> Is there any reason the caller should have to do a retry?

We would still need thp_get_unmapped_area_vmflags() because that encodes
PMD_SIZE for THPs; we need the flexibility of providing any size alignment
as a generic helper.

But I get your point.  For example, mm_get_unmapped_area_aligned() can
still fallback to mm_get_unmapped_area_vmflags() automatically.

That was ok, however that loses some flexibility when the caller wants to
try with different alignments, exactly like above: currently, it was trying
to do a first attempt of PUD mapping then fallback to PMD if that fails.

Indeed I don't know whether such fallback would help in our unit tests. But
logically speaking we'll need to look into every arch's va allocator to
know when it might fail with bigger allocations, and if PUD fails it's
still sensible one wants to retry with PMD if available.  From that POV, we
don't want to immediately fallback to 4K if 1G fails.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
  2025-06-13 14:17   ` Jason Gunthorpe
  2025-06-13 15:19   ` Zi Yan
@ 2025-06-13 15:36   ` Lorenzo Stoakes
  2025-06-13 18:45     ` Peter Xu
  2025-06-14  5:23   ` Liam R. Howlett
  3 siblings, 1 reply; 77+ messages in thread
From: Lorenzo Stoakes @ 2025-06-13 15:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Baolin Wang, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> This function is pretty handy for any type of VMA to provide a size-aligned
> VMA address when mmap().  Rename the function and export it.

This isn't a great commit message, 'to provide a size-aligned VMA address when
mmap()' is super unclear - do you mean 'to provide an unmapped address that is
also aligned to the specified size'?

I think you should also specify your motive, renaming and exporting something
because it seems handy isn't sufficient justifiation.

Also why would we need to export this? What modules might want to use this? I'm
generally not a huge fan of exporting things unless we strictly have to.

>
> About the rename:
>
>   - Dropping "THP" because it doesn't really have much to do with THP
>     internally.

Well the function seems specifically tailored to the THP use. I think you'll
need to further adjust this.

>
>   - The suffix "_aligned" imply it is a helper to generate aligned virtual
>     address based on what is specified (which can be not PMD_SIZE).

Ack this is sensible!

>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/huge_mm.h | 14 +++++++++++++-
>  mm/huge_memory.c        |  6 ++++--
>  2 files changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2f190c90192d..706488d92bb6 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h

Why are we keeping everything in huge_mm.h, huge_memory.c if this is being made
generic?

Surely this should be moved out into mm/mmap.c no?

> @@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
>  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>  		unsigned long len, unsigned long pgoff, unsigned long flags,
>  		vm_flags_t vm_flags);
> -
> +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> +		unsigned long addr, unsigned long len,
> +		loff_t off, unsigned long flags, unsigned long size,
> +		vm_flags_t vm_flags);

I echo Jason's comments about a kdoc and explanation of what this function does.

>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> @@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>  	return 0;
>  }
>
> +static inline unsigned long
> +mm_get_unmapped_area_aligned(struct file *filp,
> +			     unsigned long addr, unsigned long len,
> +			     loff_t off, unsigned long flags, unsigned long size,
> +			     vm_flags_t vm_flags)
> +{
> +	return 0;
> +}
> +
>  static inline bool
>  can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 4734de1dc0ae..52f13a70562f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
>  		folio_test_large_rmappable(folio);
>  }
>
> -static unsigned long __thp_get_unmapped_area(struct file *filp,
> +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
>  		unsigned long addr, unsigned long len,
>  		loff_t off, unsigned long flags, unsigned long size,
>  		vm_flags_t vm_flags)
> @@ -1132,6 +1132,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
>  	ret += off_sub;
>  	return ret;
>  }
> +EXPORT_SYMBOL_GPL(mm_get_unmapped_area_aligned);

I'm not convinced about exporting this... shouldn't be export only if we
explicitly have a user?

I'd rather we didn't unless we needed to.

>
>  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
>  		unsigned long len, unsigned long pgoff, unsigned long flags,
> @@ -1140,7 +1141,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  	unsigned long ret;
>  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>
> -	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> +	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
> +					   PMD_SIZE, vm_flags);
>  	if (ret)
>  		return ret;
>
> --
> 2.49.0
>

So, you don't touch the original function but there's stuff there I think we
need to think about if this is generalised.

E.g.:

	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
		return 0;

This still valid?

	/*
	 * The failure might be due to length padding. The caller will retry
	 * without the padding.
	 */
	if (IS_ERR_VALUE(ret))
		return 0;

This is assuming things the (currently single) caller will do, that is no longer
an assumption you can make, especially if exported.

Actually you maybe want to abstract the whole of thp_get_unmapped_area_vmflags()
no? As this has a fallback mode?

	/*
	 * Do not try to align to THP boundary if allocation at the address
	 * hint succeeds.
	 */
	if (ret == addr)
		return addr;

What was that about this no longer being relevant to THP? :>)

Are all of these 'return 0' cases expected by any sensible caller? It seems like
it's a way for thp_get_unmapped_area_vmflags() to recognise when to fall back to
non-aligned?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
                     ` (2 preceding siblings ...)
  2025-06-13 14:58   ` Zi Yan
@ 2025-06-13 15:57   ` Lorenzo Stoakes
  2025-06-13 17:00     ` Pedro Falcato
  2025-06-13 18:00   ` David Hildenbrand
  2025-06-16  8:01   ` David Laight
  5 siblings, 1 reply; 77+ messages in thread
From: Lorenzo Stoakes @ 2025-06-13 15:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Liam R. Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato

You've not cc'd maintainers/reviewers of mm/mmap.c, please make sure to do so.

+cc Liam
+cc Vlastimiil
+cc Jann
+cc Pedro

...!

On Fri, Jun 13, 2025 at 09:41:07AM -0400, Peter Xu wrote:
> Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> the helper instead to dedup the lines.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>

This looks fine though, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/mmap.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 09c563c95112..422f5b9d9660 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -871,9 +871,8 @@ mm_get_unmapped_area(struct mm_struct *mm, struct file *file,
>  		     unsigned long addr, unsigned long len,
>  		     unsigned long pgoff, unsigned long flags)
>  {
> -	if (test_bit(MMF_TOPDOWN, &mm->flags))
> -		return arch_get_unmapped_area_topdown(file, addr, len, pgoff, flags, 0);
> -	return arch_get_unmapped_area(file, addr, len, pgoff, flags, 0);
> +	return mm_get_unmapped_area_vmflags(mm, file, addr, len,
> +					    pgoff, flags, 0);
>  }
>  EXPORT_SYMBOL(mm_get_unmapped_area);
>
> --
> 2.49.0
>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 15:13     ` Peter Xu
@ 2025-06-13 16:00       ` Jason Gunthorpe
  2025-06-13 18:31         ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 16:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 11:13:58AM -0400, Peter Xu wrote:
> > I didn't intuitively guess how it works or why there are two
> > length/size arguments. It seems to have an exciting return code as
> > well.
> > 
> > I suppose size is the alignment target? Maybe rename the parameter too?
> 
> Yes, when the kdoc is there it'll be more obvious.  So far "size" is ok to
> me, but if you have better suggestion please shoot - whatever I came up
> with so far seems to be too long, and maybe not necessary when kdoc will be
> available too.

I would call it align not size

> > For the purposes of VFIO do we need to be careful about math overflow here:
> > 
> > 	loff_t off_end = off + len;
> > 	loff_t off_align = round_up(off, size);
> > 
> > ?
> 
> IIUC the 1st one was covered by the latter check here:
> 
>         (off + len_pad) < off
> 
> Indeed I didn't see what makes sure the 2nd won't overflow.

I'm not sure the < tests are safe in this modern world. I would use
the overflow helpers directly and remove the < overflow checks.

> +/**
> + * mm_get_unmapped_area_aligned - Allocate an aligned virtual address
> + * @filp: file target of the mmap() request
> + * @addr: hint address from mmap() request
> + * @len: len of the mmap() request
> + * @off: file offset of the mmap() request
> + * @flags: flags of the mmap() request
> + * @size: the size of alignment the caller requests

Just "the alignment the caller requests"

> + * @vm_flags: the vm_flags passed from get_unmapped_area() caller
> + *
> + * This function should normally be used by a driver's specific
> + * get_unmapped_area() handler to provide a properly aligned virtual
> + * address for a specific mmap() request.  The caller should pass in most
> + * of the parameters from the get_unmapped_area() request, but properly
> + * specify @size as the alignment needed.

 .. "The function willl try to return a VMA starting address such that
 ret % size == 0"

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 15:26     ` Peter Xu
@ 2025-06-13 16:09       ` Jason Gunthorpe
  2025-06-13 19:15         ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 16:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 11:26:40AM -0400, Peter Xu wrote:
> On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:
> > 
> > > +	/* Choose the alignment */
> > > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > +						   flags, PUD_SIZE, 0);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > > +
> > > +	if (phys_len >= PMD_SIZE) {
> > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > +						   flags, PMD_SIZE, 0);
> > > +		if (ret)
> > > +			return ret;
> > > +	}
> > 
> > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > contiguity to get a 32*16k=1GB option.
> > 
> > Forcing to only align to the PMD or PUD seems suboptimal..
> 
> Right, however the cont-pte / cont-pmd are still not supported in huge
> pfnmaps in general?  It'll definitely be nice if someone could look at that
> from ARM perspective, then provide support of both in one shot.

Maybe leave behind a comment about this. I've been poking around if
somone would do the ARM PFNMAP support but can't report any commitment.

> > > +fallback:
> > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > 
> > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > thp_get_unmapped_area_vmflags() too?
> > 
> > Is there any reason the caller should have to do a retry?
> 
> We would still need thp_get_unmapped_area_vmflags() because that encodes
> PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> as a generic helper.

There is only one caller for thp_get_unmapped_area_vmflags(), just
open code PMD_SIZE there and thin this whole thing out. It reads
better like that anyhow:

	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
		   && !addr /* no hint */
		   && IS_ALIGNED(len, PMD_SIZE)) {
		/* Ensures that larger anonymous mappings are THP aligned. */
		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
						    flags, vm_flags, PMD_SIZE);

> That was ok, however that loses some flexibility when the caller wants to
> try with different alignments, exactly like above: currently, it was trying
> to do a first attempt of PUD mapping then fallback to PMD if that fails.

Oh, that's a good point, I didn't notice that subtle bit.

But then maybe that is showing the API is just wrong and the core code
should be trying to find the best alignment not the caller. Like we
can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?

VFIO would just pass the BAR size, implying the best alignment, and
the core implementation will try to get the largest VMA alignment that
snaps to an arch supported page contiguity, testing each of the arches
page size possibilities in turn.

That sounds like a much better API than pushing this into drivers??

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 15:13   ` Zi Yan
@ 2025-06-13 16:24     ` Peter Xu
  2025-06-13 18:01       ` David Hildenbrand
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-13 16:24 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Jason Gunthorpe, Alex Mastro, David Hildenbrand, Nico Pache,
	Huacai Chen, Thomas Bogendoerfer, Muchun Song, Oscar Salvador,
	loongarch, linux-mips

On Fri, Jun 13, 2025 at 11:13:50AM -0400, Zi Yan wrote:
> On 13 Jun 2025, at 9:41, Peter Xu wrote:
> 
> > Only mips and loongarch implemented this API, however what it does was
> > checking against stack overflow for either len or addr.  That's already
> > done in arch's arch_get_unmapped_area*() functions, hence not needed.
> >
> > It means the whole API is pretty much obsolete at least now, remove it
> > completely.
> >
> > Cc: Huacai Chen <chenhuacai@kernel.org>
> > Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: loongarch@lists.linux.dev
> > Cc: linux-mips@vger.kernel.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/loongarch/include/asm/hugetlb.h | 14 --------------
> >  arch/mips/include/asm/hugetlb.h      | 14 --------------
> >  fs/hugetlbfs/inode.c                 |  8 ++------
> >  include/asm-generic/hugetlb.h        |  8 --------
> >  include/linux/hugetlb.h              |  6 ------
> >  5 files changed, 2 insertions(+), 48 deletions(-)
> >
> > diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h
> > index 4dc4b3e04225..ab68b594f889 100644
> > --- a/arch/loongarch/include/asm/hugetlb.h
> > +++ b/arch/loongarch/include/asm/hugetlb.h
> > @@ -10,20 +10,6 @@
> >
> >  uint64_t pmd_to_entrylo(unsigned long pmd_val);
> >
> > -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> > -static inline int prepare_hugepage_range(struct file *file,
> > -					 unsigned long addr,
> > -					 unsigned long len)
> > -{
> > -	unsigned long task_size = STACK_TOP;
> > -
> > -	if (len > task_size)
> > -		return -ENOMEM;
> > -	if (task_size - len < addr)
> > -		return -EINVAL;
> > -	return 0;
> > -}
> > -
> >  #define __HAVE_ARCH_HUGE_PTE_CLEAR
> >  static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
> >  				  pte_t *ptep, unsigned long sz)
> > diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
> > index fbc71ddcf0f6..8c460ce01ffe 100644
> > --- a/arch/mips/include/asm/hugetlb.h
> > +++ b/arch/mips/include/asm/hugetlb.h
> > @@ -11,20 +11,6 @@
> >
> >  #include <asm/page.h>
> >
> > -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> > -static inline int prepare_hugepage_range(struct file *file,
> > -					 unsigned long addr,
> > -					 unsigned long len)
> > -{
> > -	unsigned long task_size = STACK_TOP;
> > -
> > -	if (len > task_size)
> > -		return -ENOMEM;
> 
> arch_get_unmapped_area_topdown() has this check.
> 
> > -	if (task_size - len < addr)
> > -		return -EINVAL;
> 
> For this one, arch_get_unmapped_area_topdown() instead will try to
> provide a different addr if the check fails.
> 
> So this patch changes the original code behavior, right?

It almost shouldn't change.  Note that prepare_hugepage_range() is only
used for MAP_FIXED before this patch:

hugetlb_get_unmapped_area():
        if (flags & MAP_FIXED) {
                if (addr & ~huge_page_mask(h))
                        return -EINVAL;
                if (prepare_hugepage_range(file, addr, len))
                        return -EINVAL;
        }

Then for MAP_FIXED, on MIPS:

arch_get_unmapped_area_common():
        ...
	if (flags & MAP_FIXED) {
		/* Even MAP_FIXED mappings must reside within TASK_SIZE */
		if (TASK_SIZE - len < addr)
			return -EINVAL;
                ...
        }

But if we want to be super accurate, it's indeed different, in that the old
hugetlb code was checking stack top with STACK_TOP, which is
mips_stack_top() for MIPS: it's a value that might be slightly less than
TASK_SIZE..

So strictly speaking, there's indeed a trivial difference on the oddity of
defining stack top, but my guess is nothing will be affected.  I can add
some explanation into the commit message in that case.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 15:57   ` Lorenzo Stoakes
@ 2025-06-13 17:00     ` Pedro Falcato
  0 siblings, 0 replies; 77+ messages in thread
From: Pedro Falcato @ 2025-06-13 17:00 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Peter Xu, linux-kernel, linux-mm, kvm, Andrew Morton,
	Alex Williamson, Zi Yan, Jason Gunthorpe, Alex Mastro,
	David Hildenbrand, Nico Pache, Liam R. Howlett, Vlastimil Babka,
	Jann Horn

On Fri, Jun 13, 2025 at 04:57:12PM +0100, Lorenzo Stoakes wrote:
> You've not cc'd maintainers/reviewers of mm/mmap.c, please make sure to do so.
> 
> +cc Liam
> +cc Vlastimiil
> +cc Jann
> +cc Pedro
> 
> ...!
> 
> On Fri, Jun 13, 2025 at 09:41:07AM -0400, Peter Xu wrote:
> > Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> > the helper instead to dedup the lines.
> >
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This looks fine though, so:
> 
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Reviewed-by: Pedro Falcato <pfalcato@suse.de>

Looks good, thanks!

-- 
Pedro

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
                     ` (3 preceding siblings ...)
  2025-06-13 15:57   ` Lorenzo Stoakes
@ 2025-06-13 18:00   ` David Hildenbrand
  2025-06-16  8:01   ` David Laight
  5 siblings, 0 replies; 77+ messages in thread
From: David Hildenbrand @ 2025-06-13 18:00 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, Nico Pache

On 13.06.25 15:41, Peter Xu wrote:
> Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> the helper instead to dedup the lines.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 16:24     ` Peter Xu
@ 2025-06-13 18:01       ` David Hildenbrand
  0 siblings, 0 replies; 77+ messages in thread
From: David Hildenbrand @ 2025-06-13 18:01 UTC (permalink / raw)
  To: Peter Xu, Zi Yan
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Jason Gunthorpe, Alex Mastro, Nico Pache, Huacai Chen,
	Thomas Bogendoerfer, Muchun Song, Oscar Salvador, loongarch,
	linux-mips

> 
> But if we want to be super accurate, it's indeed different, in that the old
> hugetlb code was checking stack top with STACK_TOP, which is
> mips_stack_top() for MIPS: it's a value that might be slightly less than
> TASK_SIZE..
> 
> So strictly speaking, there's indeed a trivial difference on the oddity of
> defining stack top, but my guess is nothing will be affected.  I can add
> some explanation into the commit message in that case.

Yeah, that would be good.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-13 13:41 ` [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook Peter Xu
  2025-06-13 14:18   ` Jason Gunthorpe
@ 2025-06-13 18:03   ` David Hildenbrand
  2025-06-14 14:46   ` kernel test robot
  2 siblings, 0 replies; 77+ messages in thread
From: David Hildenbrand @ 2025-06-13 18:03 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, Nico Pache

On 13.06.25 15:41, Peter Xu wrote:
> Add a hook to vfio_device_ops to allow sub-modules provide virtual
> addresses for an mmap() request.
> 
> Note that the fallback will be mm_get_unmapped_area(), which should
> maintain the old behavior of generic VA allocation (__get_unmapped_area).
> It's a bit unfortunate that is needed, as the current get_unmapped_area()
> file ops cannot support a retval which fallbacks to the default.  So that
> is needed both here and whenever sub-module will opt-in with its own.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 13:41 ` [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Peter Xu
  2025-06-13 14:29   ` Jason Gunthorpe
@ 2025-06-13 18:09   ` David Hildenbrand
  2025-06-13 19:21     ` Peter Xu
       [not found]   ` <20250613174442.1589882-1-amastro@fb.com>
  2 siblings, 1 reply; 77+ messages in thread
From: David Hildenbrand @ 2025-06-13 18:09 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm, kvm
  Cc: Andrew Morton, Alex Williamson, Zi Yan, Jason Gunthorpe,
	Alex Mastro, Nico Pache

On 13.06.25 15:41, Peter Xu wrote:
> This patch enables best-effort mmap() for vfio-pci bars even without
> MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
> also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
> addresses) to start enabling huge pfnmaps on VFIO bars.
> 
> Here the trick is making sure the MMIO PFNs will be aligned with the VAs
> allocated from mmap() when !MAP_FIXED, so that whatever returned from
> mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
> for huge pfnmaps as much as possible.
> 
> To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
> devices is needed.
> 
> Note that MMIO physical addresses should normally be guaranteed to be
> always bar-size aligned, hence the bar offset can logically be directly
> used to do the calculation.  However to make it strict and clear (rather
> than relying on spec details), we still try to fetch the bar's physical
> addresses from pci_dev.resource[].
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

There is likely a

Co-developed-by: Alex Williamson <alex.williamson@redhat.com>

missing?

> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   drivers/vfio/pci/vfio_pci.c      |  3 ++
>   drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
>   include/linux/vfio_pci_core.h    |  6 +++
>   3 files changed, 74 insertions(+)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 5ba39f7623bb..d9ae6cdbea28 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
>   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
>   	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
>   	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> +	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
> +#endif
>   };
>   
>   static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index 6328c3a05bcd..835bc168f8b7 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
>   	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
>   }
>   
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> +/*
> + * Hint function to provide mmap() virtual address candidate so as to be
> + * able to map huge pfnmaps as much as possible.  It is done by aligning
> + * the VA to the PFN to be mapped in the specific bar.
> + *
> + * Note that this function does the minimum check on mmap() parameters to
> + * make the PFN calculation valid only. The majority of mmap() sanity check
> + * will be done later in mmap().
> + */
> +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> +					      struct file *file,
> +					      unsigned long addr,
> +					      unsigned long len,
> +					      unsigned long pgoff,
> +					      unsigned long flags)

A very suboptimal way to indent this many parameters; just use two tabs 
at the beginning.

> +{
> +	struct vfio_pci_core_device *vdev =
> +		container_of(device, struct vfio_pci_core_device, vdev);
> +	struct pci_dev *pdev = vdev->pdev;
> +	unsigned long ret, phys_len, req_start, phys_addr;
> +	unsigned int index;
> +
> +	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

Could do

unsigned int index =  pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);

at the very top.

> +
> +	/* Currently, only bars 0-5 supports huge pfnmap */
> +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> +		goto fallback;
> +
> +	/* Bar offset */
> +	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
> +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> +
> +	/*
> +	 * Make sure we at least can get a valid physical address to do the
> +	 * math.  If this happens, it will probably fail mmap() later..
> +	 */
> +	if (req_start >= phys_len)
> +		goto fallback;
> +
> +	phys_len = MIN(phys_len, len);
> +	/* Calculate the start of physical address to be mapped */
> +	phys_addr = pci_resource_start(pdev, index) + req_start;
> +
> +	/* Choose the alignment */
> +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PUD_SIZE, 0);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	if (phys_len >= PMD_SIZE) {
> +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> +						   flags, PMD_SIZE, 0);
> +		if (ret)
> +			return ret;

Similar to Jason, I wonder if that logic should reside in the core, and 
we only indicate the maximum page table level we support.

			   unsigned int order)
>   {
> diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
> index fbb472dd99b3..e59699e01901 100644
> --- a/include/linux/vfio_pci_core.h
> +++ b/include/linux/vfio_pci_core.h
> @@ -119,6 +119,12 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
>   		size_t count, loff_t *ppos);
>   ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
>   		size_t count, loff_t *ppos);
> +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> +					      struct file *file,
> +					      unsigned long addr,
> +					      unsigned long len,
> +					      unsigned long pgoff,
> +					      unsigned long flags);

Dito.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 16:00       ` Jason Gunthorpe
@ 2025-06-13 18:31         ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 18:31 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache, Baolin Wang,
	Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 01:00:20PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 11:13:58AM -0400, Peter Xu wrote:
> > > I didn't intuitively guess how it works or why there are two
> > > length/size arguments. It seems to have an exciting return code as
> > > well.
> > > 
> > > I suppose size is the alignment target? Maybe rename the parameter too?
> > 
> > Yes, when the kdoc is there it'll be more obvious.  So far "size" is ok to
> > me, but if you have better suggestion please shoot - whatever I came up
> > with so far seems to be too long, and maybe not necessary when kdoc will be
> > available too.
> 
> I would call it align not size

Sure thing.

> 
> > > For the purposes of VFIO do we need to be careful about math overflow here:
> > > 
> > > 	loff_t off_end = off + len;
> > > 	loff_t off_align = round_up(off, size);
> > > 
> > > ?
> > 
> > IIUC the 1st one was covered by the latter check here:
> > 
> >         (off + len_pad) < off
> > 
> > Indeed I didn't see what makes sure the 2nd won't overflow.
> 
> I'm not sure the < tests are safe in this modern world. I would use
> the overflow helpers directly and remove the < overflow checks.

Good to learn the traps, and I also wasn't aware of the helpers.  I'll
switch to that, thanks!

> 
> > +/**
> > + * mm_get_unmapped_area_aligned - Allocate an aligned virtual address
> > + * @filp: file target of the mmap() request
> > + * @addr: hint address from mmap() request
> > + * @len: len of the mmap() request
> > + * @off: file offset of the mmap() request
> > + * @flags: flags of the mmap() request
> > + * @size: the size of alignment the caller requests
> 
> Just "the alignment the caller requests"

Sure.

> 
> > + * @vm_flags: the vm_flags passed from get_unmapped_area() caller
> > + *
> > + * This function should normally be used by a driver's specific
> > + * get_unmapped_area() handler to provide a properly aligned virtual
> > + * address for a specific mmap() request.  The caller should pass in most
> > + * of the parameters from the get_unmapped_area() request, but properly
> > + * specify @size as the alignment needed.
> 
>  .. "The function willl try to return a VMA starting address such that
>  ret % size == 0"

This is not true though when pgoff isn't aligned..

For example, an allocation with (len=32M, size=2M, pgoff=1M) will return an
address that is N*2M+1M, so that starting from pgoff=2M it'll be completely
aligned.  In this case the returned mmap() address must not be aligned to
make it happen, and the range within pgoff=1M-2M will be mapped with 4K.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 15:19   ` Zi Yan
@ 2025-06-13 18:33     ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 18:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Jason Gunthorpe, Alex Mastro, David Hildenbrand, Nico Pache,
	Baolin Wang, Lorenzo Stoakes, Liam R. Howlett, Ryan Roberts,
	Dev Jain, Barry Song

On Fri, Jun 13, 2025 at 11:19:30AM -0400, Zi Yan wrote:
> > -static unsigned long __thp_get_unmapped_area(struct file *filp,
> > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> >  		unsigned long addr, unsigned long len,
> >  		loff_t off, unsigned long flags, unsigned long size,
> 
> Since you added aligned suffix, renaming size to alignment might
> help improve readability.

I'll use "align" per Jason's suggestion, assuming it's ok and shorter.

> 
> Otherwise, Reviewed-by: Zi Yan <ziy@nvidia.com>

I'll take this though, thanks Zi.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 15:36   ` Lorenzo Stoakes
@ 2025-06-13 18:45     ` Peter Xu
  2025-06-13 19:18       ` Lorenzo Stoakes
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-13 18:45 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Baolin Wang, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 04:36:57PM +0100, Lorenzo Stoakes wrote:
> On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> > This function is pretty handy for any type of VMA to provide a size-aligned
> > VMA address when mmap().  Rename the function and export it.
> 
> This isn't a great commit message, 'to provide a size-aligned VMA address when
> mmap()' is super unclear - do you mean 'to provide an unmapped address that is
> also aligned to the specified size'?

I sincerely don't know the difference, not a native speaker here..
Suggestions welcomed, I can update to whatever both of us agree on.

> 
> I think you should also specify your motive, renaming and exporting something
> because it seems handy isn't sufficient justifiation.
> 
> Also why would we need to export this? What modules might want to use this? I'm
> generally not a huge fan of exporting things unless we strictly have to.

It's one of the major reasons why I sent this together with the VFIO
patches.  It'll be used in VFIO patches that is in the same series.  I will
mention it in the commit message when repost.

> 
> >
> > About the rename:
> >
> >   - Dropping "THP" because it doesn't really have much to do with THP
> >     internally.
> 
> Well the function seems specifically tailored to the THP use. I think you'll
> need to further adjust this.

Actually.. it is almost exactly what I need so far.  I can justify it below.

> 
> >
> >   - The suffix "_aligned" imply it is a helper to generate aligned virtual
> >     address based on what is specified (which can be not PMD_SIZE).
> 
> Ack this is sensible!
> 
> >
> > Cc: Zi Yan <ziy@nvidia.com>
> > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Dev Jain <dev.jain@arm.com>
> > Cc: Barry Song <baohua@kernel.org>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/huge_mm.h | 14 +++++++++++++-
> >  mm/huge_memory.c        |  6 ++++--
> >  2 files changed, 17 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > index 2f190c90192d..706488d92bb6 100644
> > --- a/include/linux/huge_mm.h
> > +++ b/include/linux/huge_mm.h
> 
> Why are we keeping everything in huge_mm.h, huge_memory.c if this is being made
> generic?
> 
> Surely this should be moved out into mm/mmap.c no?

No objections, but I suggest a separate discussion and patch submission
when the original function resides in huge_memory.c.  Hope it's ok for you.

> 
> > @@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> >  		vm_flags_t vm_flags);
> > -
> > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > +		unsigned long addr, unsigned long len,
> > +		loff_t off, unsigned long flags, unsigned long size,
> > +		vm_flags_t vm_flags);
> 
> I echo Jason's comments about a kdoc and explanation of what this function does.
> 
> >  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >  		unsigned int new_order);
> > @@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> >  	return 0;
> >  }
> >
> > +static inline unsigned long
> > +mm_get_unmapped_area_aligned(struct file *filp,
> > +			     unsigned long addr, unsigned long len,
> > +			     loff_t off, unsigned long flags, unsigned long size,
> > +			     vm_flags_t vm_flags)
> > +{
> > +	return 0;
> > +}
> > +
> >  static inline bool
> >  can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> >  {
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 4734de1dc0ae..52f13a70562f 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
> >  		folio_test_large_rmappable(folio);
> >  }
> >
> > -static unsigned long __thp_get_unmapped_area(struct file *filp,
> > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> >  		unsigned long addr, unsigned long len,
> >  		loff_t off, unsigned long flags, unsigned long size,
> >  		vm_flags_t vm_flags)
> > @@ -1132,6 +1132,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> >  	ret += off_sub;
> >  	return ret;
> >  }
> > +EXPORT_SYMBOL_GPL(mm_get_unmapped_area_aligned);
> 
> I'm not convinced about exporting this... shouldn't be export only if we
> explicitly have a user?
> 
> I'd rather we didn't unless we needed to.
> 
> >
> >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > @@ -1140,7 +1141,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> >  	unsigned long ret;
> >  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> >
> > -	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> > +	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
> > +					   PMD_SIZE, vm_flags);
> >  	if (ret)
> >  		return ret;
> >
> > --
> > 2.49.0
> >
> 
> So, you don't touch the original function but there's stuff there I think we
> need to think about if this is generalised.
> 
> E.g.:
> 
> 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
> 		return 0;
> 
> This still valid?

Yes.  I want this feature (for VFIO) to not be enabled on 32bits, and not
enabled with compat syscals.

> 
> 	/*
> 	 * The failure might be due to length padding. The caller will retry
> 	 * without the padding.
> 	 */
> 	if (IS_ERR_VALUE(ret))
> 		return 0;
> 
> This is assuming things the (currently single) caller will do, that is no longer
> an assumption you can make, especially if exported.

It's part of core function we want from a generic helper.  We want to know
when the va allocation, after padded, would fail due to the padding. Then
the caller can decide what to do next.  It needs to fail here properly.

> 
> Actually you maybe want to abstract the whole of thp_get_unmapped_area_vmflags()
> no? As this has a fallback mode?
> 
> 	/*
> 	 * Do not try to align to THP boundary if allocation at the address
> 	 * hint succeeds.
> 	 */
> 	if (ret == addr)
> 		return addr;

This is not a fallback. This is when user specified a hint address (no
matter with / without MAP_FIXED), if that address works then we should
reuse that address, ignoring the alignment requirement from the driver.
This is exactly the behavior VFIO needs, and this should also be the
suggested behavior for whatever new drivers that would like to start using
this generic helper.

> 
> What was that about this no longer being relevant to THP? :>)
> 
> Are all of these 'return 0' cases expected by any sensible caller? It seems like
> it's a way for thp_get_unmapped_area_vmflags() to recognise when to fall back to
> non-aligned?

Hope above justfies everything.  It's my intention to reuse everything
here.  If you have any concern on any of the "return 0" cases in the
function being exported, please shoot, we can discuss.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
       [not found]   ` <20250613174442.1589882-1-amastro@fb.com>
@ 2025-06-13 18:53     ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 18:53 UTC (permalink / raw)
  To: Alex Mastro
  Cc: akpm, alex.williamson, david, jgg, kvm, linux-kernel, linux-mm,
	npache, ziy

On Fri, Jun 13, 2025 at 10:44:42AM -0700, Alex Mastro wrote:
> Thank you Peter!
> 
> I packported this series to our 6.13.2 tree and validated that it does indeed
> provide equivalent, optimal faulting to our manual alignment approach when we
> mmap with !MAP_FIXED. This addresses the issue we discovered in [1].
> 
> The test case is performing mmap with offset=0x40006000000, size=0xdf9e00000,
> and we see that the head and tail (975) are faulted at 2M, and middle (54) at
> 1G. The vma returned by mmap looks nice: 0x7f8646000000.
> 
> $ sudo bpftrace -q -e 'fexit:vfio_pci_mmap_huge_fault { printf("order=%d, ret=0x%x\n", args.order, retval); }' 2>&1 > ~/dump
> $ cat ~/dump | sort | uniq -c | sort -nr
>     975 order=9, ret=0x100
>      54 order=18, ret=0x100
>       2 order=18, ret=0x800
> 
> [1] https://lore.kernel.org/linux-pci/20250529214414.1508155-1-amastro@fb.com/
> 
> Tested-by: Alex Mastro <amastro@fb.com>

Great to know it works as expected, thanks for the quick feedback!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 16:09       ` Jason Gunthorpe
@ 2025-06-13 19:15         ` Peter Xu
  2025-06-13 23:16           ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-13 19:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 01:09:56PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 11:26:40AM -0400, Peter Xu wrote:
> > On Fri, Jun 13, 2025 at 11:29:03AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 13, 2025 at 09:41:11AM -0400, Peter Xu wrote:
> > > 
> > > > +	/* Choose the alignment */
> > > > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > +						   flags, PUD_SIZE, 0);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > > +
> > > > +	if (phys_len >= PMD_SIZE) {
> > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > +						   flags, PMD_SIZE, 0);
> > > > +		if (ret)
> > > > +			return ret;
> > > > +	}
> > > 
> > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > > contiguity to get a 32*16k=1GB option.
> > > 
> > > Forcing to only align to the PMD or PUD seems suboptimal..
> > 
> > Right, however the cont-pte / cont-pmd are still not supported in huge
> > pfnmaps in general?  It'll definitely be nice if someone could look at that
> > from ARM perspective, then provide support of both in one shot.
> 
> Maybe leave behind a comment about this. I've been poking around if
> somone would do the ARM PFNMAP support but can't report any commitment.

I didn't know what's the best part to take a note for the whole pfnmap
effort, but I added a note into the commit message on this patch:

        Note 2: Currently continuous pgtable entries (for example, cont-pte) is not
        yet supported for huge pfnmaps in general.  It also is not considered in
        this patch so far.  Separate work will be needed to enable continuous
        pgtable entries on archs that support it.

> 
> > > > +fallback:
> > > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > > 
> > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > > thp_get_unmapped_area_vmflags() too?
> > > 
> > > Is there any reason the caller should have to do a retry?
> > 
> > We would still need thp_get_unmapped_area_vmflags() because that encodes
> > PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> > as a generic helper.
> 
> There is only one caller for thp_get_unmapped_area_vmflags(), just
> open code PMD_SIZE there and thin this whole thing out. It reads
> better like that anyhow:
> 
> 	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
> 		   && !addr /* no hint */
> 		   && IS_ALIGNED(len, PMD_SIZE)) {
> 		/* Ensures that larger anonymous mappings are THP aligned. */
> 		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
> 						    flags, vm_flags, PMD_SIZE);
> 
> > That was ok, however that loses some flexibility when the caller wants to
> > try with different alignments, exactly like above: currently, it was trying
> > to do a first attempt of PUD mapping then fallback to PMD if that fails.
> 
> Oh, that's a good point, I didn't notice that subtle bit.
> 
> But then maybe that is showing the API is just wrong and the core code
> should be trying to find the best alignment not the caller. Like we
> can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?
> 
> VFIO would just pass the BAR size, implying the best alignment, and
> the core implementation will try to get the largest VMA alignment that
> snaps to an arch supported page contiguity, testing each of the arches
> page size possibilities in turn.
> 
> That sounds like a much better API than pushing this into drivers??

Yes it would be nice if the core mm can evolve to make supporting such
easier.  Though the question is how to pass information over to core mm.

For example, currently a vfio device file represents the whole device, and
it's also VFIO that defines what the MMIO region offsets means. So core mm
has no simple idea which BAR VFIO is mapping if it only receives a mmap()
request.  So even if we assume the core mm provides some vma flag showing
that, it won't be per-vma, but need to be case by case of the mmap()
request at least relevant to pgoff and len being mapped.

And it's definitely the case that for one device its BAR sizes are
different, hence it asks for different alignments when mmap() even if on
the same device fd.

It's similar to many other use cases of get_unmapped_area() users.  For
example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at
least knowing which part of the file was being mapped:

	if (offset < DST_QUEUE_OFF_BASE) {
		vq = v4l2_m2m_get_src_vq(fh->m2m_ctx);
	} else {
		vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx);
		pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT);
	}

Such flexibility might still be needed for now until we know how to provide
the abstraction.

Meanwhile, there can be other constraints to existing get_unmapped_area()
users that a decision might be done with any parameter passed into it
besides the pgoff.. so even if we provide the whole pgoff info, it might
not be enough.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 18:45     ` Peter Xu
@ 2025-06-13 19:18       ` Lorenzo Stoakes
  2025-06-13 20:34         ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Lorenzo Stoakes @ 2025-06-13 19:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Baolin Wang, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 02:45:31PM -0400, Peter Xu wrote:
> On Fri, Jun 13, 2025 at 04:36:57PM +0100, Lorenzo Stoakes wrote:
> > On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> > > This function is pretty handy for any type of VMA to provide a size-aligned
> > > VMA address when mmap().  Rename the function and export it.
> >
> > This isn't a great commit message, 'to provide a size-aligned VMA address when
> > mmap()' is super unclear - do you mean 'to provide an unmapped address that is
> > also aligned to the specified size'?
>
> I sincerely don't know the difference, not a native speaker here..
> Suggestions welcomed, I can update to whatever both of us agree on.

Sure, sorry I don't mean to be pedantic I just think it would be clearer to
sort of expand upon this, as the commit message is rather short.

I think saying something like this function allows you to locate an
unmapped region which is aligned to the specified size should suffice.

>
> >
> > I think you should also specify your motive, renaming and exporting something
> > because it seems handy isn't sufficient justifiation.
> >
> > Also why would we need to export this? What modules might want to use this? I'm
> > generally not a huge fan of exporting things unless we strictly have to.
>
> It's one of the major reasons why I sent this together with the VFIO
> patches.  It'll be used in VFIO patches that is in the same series.  I will
> mention it in the commit message when repost.

OK cool, I've not dug through those as not my area, really it's about
having the appropriate justification.

I'm always inclined to not want us to export things by default, based on
experience of finding 'unusual' uses of various mm interfaces in drivers in
the past which have caused problems :)

But of course there are situations that warrant it, they just need to be
spelled out.

>
> >
> > >
> > > About the rename:
> > >
> > >   - Dropping "THP" because it doesn't really have much to do with THP
> > >     internally.
> >
> > Well the function seems specifically tailored to the THP use. I think you'll
> > need to further adjust this.
>
> Actually.. it is almost exactly what I need so far.  I can justify it below.

Yeah, but it's not a general function that gives you an unmapped area that
is aligned.

It's a 'function that gets you an aligned unmapped area but only for 64-bit
kernels and when you are not invoking it from a compat syscall and returns
0 instead of errors'.

This doesn't sound general to me?

>
> >
> > >
> > >   - The suffix "_aligned" imply it is a helper to generate aligned virtual
> > >     address based on what is specified (which can be not PMD_SIZE).
> >
> > Ack this is sensible!
> >
> > >
> > > Cc: Zi Yan <ziy@nvidia.com>
> > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Dev Jain <dev.jain@arm.com>
> > > Cc: Barry Song <baohua@kernel.org>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > ---
> > >  include/linux/huge_mm.h | 14 +++++++++++++-
> > >  mm/huge_memory.c        |  6 ++++--
> > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index 2f190c90192d..706488d92bb6 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> >
> > Why are we keeping everything in huge_mm.h, huge_memory.c if this is being made
> > generic?
> >
> > Surely this should be moved out into mm/mmap.c no?
>
> No objections, but I suggest a separate discussion and patch submission
> when the original function resides in huge_memory.c.  Hope it's ok for you.

I like to be as flexible as I can be in review, but I'm afraid I'm going to
have to be annoying about this one :)

It simply makes no sense to have non-THP stuff in 'the THP file'. Also this
makes this a general memory mapping function that should live with the
other related code.

I don't really think much discussion is required here? You could do this as
2 separate commits if that'd make life easier?

Sorry to be a pain here, but I'm really allergic to our having random
unrelated things in the wrong files, it's something mm has done rather too
much...

>
> >
> > > @@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> > >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > >  		vm_flags_t vm_flags);
> > > -
> > > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > > +		unsigned long addr, unsigned long len,
> > > +		loff_t off, unsigned long flags, unsigned long size,
> > > +		vm_flags_t vm_flags);
> >
> > I echo Jason's comments about a kdoc and explanation of what this function does.
> >
> > >  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> > >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> > >  		unsigned int new_order);
> > > @@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > >  	return 0;
> > >  }
> > >
> > > +static inline unsigned long
> > > +mm_get_unmapped_area_aligned(struct file *filp,
> > > +			     unsigned long addr, unsigned long len,
> > > +			     loff_t off, unsigned long flags, unsigned long size,
> > > +			     vm_flags_t vm_flags)
> > > +{
> > > +	return 0;
> > > +}
> > > +
> > >  static inline bool
> > >  can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> > >  {
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 4734de1dc0ae..52f13a70562f 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
> > >  		folio_test_large_rmappable(folio);
> > >  }
> > >
> > > -static unsigned long __thp_get_unmapped_area(struct file *filp,
> > > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > >  		unsigned long addr, unsigned long len,
> > >  		loff_t off, unsigned long flags, unsigned long size,
> > >  		vm_flags_t vm_flags)
> > > @@ -1132,6 +1132,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> > >  	ret += off_sub;
> > >  	return ret;
> > >  }
> > > +EXPORT_SYMBOL_GPL(mm_get_unmapped_area_aligned);
> >
> > I'm not convinced about exporting this... shouldn't be export only if we
> > explicitly have a user?
> >
> > I'd rather we didn't unless we needed to.
> >
> > >
> > >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > > @@ -1140,7 +1141,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> > >  	unsigned long ret;
> > >  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> > >
> > > -	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> > > +	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
> > > +					   PMD_SIZE, vm_flags);
> > >  	if (ret)
> > >  		return ret;
> > >
> > > --
> > > 2.49.0
> > >
> >
> > So, you don't touch the original function but there's stuff there I think we
> > need to think about if this is generalised.
> >
> > E.g.:
> >
> > 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
> > 		return 0;
> >
> > This still valid?
>
> Yes.  I want this feature (for VFIO) to not be enabled on 32bits, and not
> enabled with compat syscals.

OK, but then is this a 'general' function any more?

These checks were introduced by commit 4ef9ad19e176 ("mm: huge_memory:
don't force huge page alignment on 32 bit") and so are _absolutely
specifically_ intended for a THP use-case.

And now they _just happen_ to be useful to you but nothing about the
function name suggests that this is the case?

I mean it seems like you should be doing this check separately in both VFIO
and THP code and having the 'general 'function not do this no?

>
> >
> > 	/*
> > 	 * The failure might be due to length padding. The caller will retry
> > 	 * without the padding.
> > 	 */
> > 	if (IS_ERR_VALUE(ret))
> > 		return 0;
> >
> > This is assuming things the (currently single) caller will do, that is no longer
> > an assumption you can make, especially if exported.
>
> It's part of core function we want from a generic helper.  We want to know
> when the va allocation, after padded, would fail due to the padding. Then
> the caller can decide what to do next.  It needs to fail here properly.

I'm no sure I understand what you mean?

It's not just this case, it's basically any error condition results in 0.

It's actually quite dangerous, as the get_unmapped_area() functions are
meant to return either an error value or the located address _and zero is a
valid response_.

So if somebody used this function naively, they'd potentially have a very
nasty bug occur when an error arose.

If you want to export this, I just don't think we can have this be a thing
here.

>
> >
> > Actually you maybe want to abstract the whole of thp_get_unmapped_area_vmflags()
> > no? As this has a fallback mode?
> >
> > 	/*
> > 	 * Do not try to align to THP boundary if allocation at the address
> > 	 * hint succeeds.
> > 	 */
> > 	if (ret == addr)
> > 		return addr;
>
> This is not a fallback. This is when user specified a hint address (no
> matter with / without MAP_FIXED), if that address works then we should
> reuse that address, ignoring the alignment requirement from the driver.
> This is exactly the behavior VFIO needs, and this should also be the
> suggested behavior for whatever new drivers that would like to start using
> this generic helper.

I didn't say this was the fallback :) this just happened to be the code
underneath my comment. Sorry if that wasn't clear.

This is another kinda non-general thing but one that makes more sense. This
comment needs updating, however, obviously. You could just delete 'THP' in
the comment that'd probalby do it.

The fallback is in:

unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
		unsigned long len, unsigned long pgoff, unsigned long flags,
		vm_flags_t vm_flags)
{
	unsigned long ret;
	loff_t off = (loff_t)pgoff << PAGE_SHIFT;

	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
	if (ret)
		return ret;

So here, if ret returns an address, then it's fine we return that.

Otherwise, we invoke the below (the fallback):

	return mm_get_unmapped_area_vmflags(current->mm, filp, addr, len, pgoff, flags,
					    vm_flags);
}

>
> >
> > What was that about this no longer being relevant to THP? :>)
> >
> > Are all of these 'return 0' cases expected by any sensible caller? It seems like
> > it's a way for thp_get_unmapped_area_vmflags() to recognise when to fall back to
> > non-aligned?
>
> Hope above justfies everything.  It's my intention to reuse everything
> here.  If you have any concern on any of the "return 0" cases in the
> function being exported, please shoot, we can discuss.

Of course, I have some doubts here :)

>
> Thanks,
>
> --
> Peter Xu
>

To be clearer perhaps, what I think would work here is:

1. Remove the CONFIG_64BIT, in_compat_syscall() check and place it in THP
   and VFIO code separately, as this isn't a general thing.

2. Rather than return 0 in this function, return error codes so it matches
   the other mm_get_unmapped_area_*() functions.

3. Adjust thp_get_unmapped_area_vmflags() to detect the error value from
   this function and do the fallback logic in this case. There's no need
   for this 0 stuff (and it's possibly broken actually, since _in theory_
   you can get unmapped zero).

4. (sorry :) move the code to mm/mmap.c

5. Obviously address comments from others, most importantly (in my view)
   ensuring that there is a good kernel doc comment around the function.

6. Put the justifiation for exporting the function + stuff about VFIO in
   the commit message + expand it a little bit as discussed.

7. Other small stuff raised above (e.g. remove 'THP' comment etc.)

Again, sorry to be a pain, but I think we need to be careful to get this
right so we don't leave any footguns for ourselves in the future with
'implicit' stuff.

Thanks!

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 18:09   ` David Hildenbrand
@ 2025-06-13 19:21     ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-13 19:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, Nico Pache

On Fri, Jun 13, 2025 at 08:09:41PM +0200, David Hildenbrand wrote:
> On 13.06.25 15:41, Peter Xu wrote:
> > This patch enables best-effort mmap() for vfio-pci bars even without
> > MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
> > also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
> > addresses) to start enabling huge pfnmaps on VFIO bars.
> > 
> > Here the trick is making sure the MMIO PFNs will be aligned with the VAs
> > allocated from mmap() when !MAP_FIXED, so that whatever returned from
> > mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
> > for huge pfnmaps as much as possible.
> > 
> > To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
> > devices is needed.
> > 
> > Note that MMIO physical addresses should normally be guaranteed to be
> > always bar-size aligned, hence the bar offset can logically be directly
> > used to do the calculation.  However to make it strict and clear (rather
> > than relying on spec details), we still try to fetch the bar's physical
> > addresses from pci_dev.resource[].
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> 
> There is likely a
> 
> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> 
> missing?

Would it mean the same if we use the two SoBs like what this patch uses?
I sincerely don't know the difference..  I hope it's fine to show that this
patch was developed together.  Please let me know otherwise.

> 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >   drivers/vfio/pci/vfio_pci.c      |  3 ++
> >   drivers/vfio/pci/vfio_pci_core.c | 65 ++++++++++++++++++++++++++++++++
> >   include/linux/vfio_pci_core.h    |  6 +++
> >   3 files changed, 74 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 5ba39f7623bb..d9ae6cdbea28 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -144,6 +144,9 @@ static const struct vfio_device_ops vfio_pci_ops = {
> >   	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
> >   	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
> >   	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > +	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
> > +#endif
> >   };
> >   static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index 6328c3a05bcd..835bc168f8b7 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1641,6 +1641,71 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
> >   	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
> >   }
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > +/*
> > + * Hint function to provide mmap() virtual address candidate so as to be
> > + * able to map huge pfnmaps as much as possible.  It is done by aligning
> > + * the VA to the PFN to be mapped in the specific bar.
> > + *
> > + * Note that this function does the minimum check on mmap() parameters to
> > + * make the PFN calculation valid only. The majority of mmap() sanity check
> > + * will be done later in mmap().
> > + */
> > +unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
> > +					      struct file *file,
> > +					      unsigned long addr,
> > +					      unsigned long len,
> > +					      unsigned long pgoff,
> > +					      unsigned long flags)
> 
> A very suboptimal way to indent this many parameters; just use two tabs at
> the beginning.

This is the default indentation from Emacs c-mode.

Since this is a VFIO file, I checked the file and looks like there's not
yet a strict rule of indentation across the whole file.  I can switch to
two-tabs for sure if nobody else disagrees.

> 
> > +{
> > +	struct vfio_pci_core_device *vdev =
> > +		container_of(device, struct vfio_pci_core_device, vdev);
> > +	struct pci_dev *pdev = vdev->pdev;
> > +	unsigned long ret, phys_len, req_start, phys_addr;
> > +	unsigned int index;
> > +
> > +	index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> Could do
> 
> unsigned int index =  pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
> 
> at the very top.

Sure.

> 
> > +
> > +	/* Currently, only bars 0-5 supports huge pfnmap */
> > +	if (index >= VFIO_PCI_ROM_REGION_INDEX)
> > +		goto fallback;
> > +
> > +	/* Bar offset */
> > +	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
> > +	phys_len = PAGE_ALIGN(pci_resource_len(pdev, index));
> > +
> > +	/*
> > +	 * Make sure we at least can get a valid physical address to do the
> > +	 * math.  If this happens, it will probably fail mmap() later..
> > +	 */
> > +	if (req_start >= phys_len)
> > +		goto fallback;
> > +
> > +	phys_len = MIN(phys_len, len);
> > +	/* Calculate the start of physical address to be mapped */
> > +	phys_addr = pci_resource_start(pdev, index) + req_start;
> > +
> > +	/* Choose the alignment */
> > +	if (IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP) && phys_len >= PUD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PUD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	if (phys_len >= PMD_SIZE) {
> > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > +						   flags, PMD_SIZE, 0);
> > +		if (ret)
> > +			return ret;
> 
> Similar to Jason, I wonder if that logic should reside in the core, and we
> only indicate the maximum page table level we support.

I replied.  We can continue the discussion there.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 19:18       ` Lorenzo Stoakes
@ 2025-06-13 20:34         ` Peter Xu
  2025-06-14  5:58           ` Lorenzo Stoakes
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-13 20:34 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Baolin Wang, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

On Fri, Jun 13, 2025 at 08:18:42PM +0100, Lorenzo Stoakes wrote:
> On Fri, Jun 13, 2025 at 02:45:31PM -0400, Peter Xu wrote:
> > On Fri, Jun 13, 2025 at 04:36:57PM +0100, Lorenzo Stoakes wrote:
> > > On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> > > > This function is pretty handy for any type of VMA to provide a size-aligned
> > > > VMA address when mmap().  Rename the function and export it.
> > >
> > > This isn't a great commit message, 'to provide a size-aligned VMA address when
> > > mmap()' is super unclear - do you mean 'to provide an unmapped address that is
> > > also aligned to the specified size'?
> >
> > I sincerely don't know the difference, not a native speaker here..
> > Suggestions welcomed, I can update to whatever both of us agree on.
> 
> Sure, sorry I don't mean to be pedantic I just think it would be clearer to
> sort of expand upon this, as the commit message is rather short.
> 
> I think saying something like this function allows you to locate an
> unmapped region which is aligned to the specified size should suffice.

I changed the commit message to this:

    This function is pretty handy to locate an unmapped region which is aligned
    to the specified alignment, meanwhile taking pgoff into considerations.
    
    Rename the function and export it.  VFIO will be the first candidate to
    reuse this function in follow up patches to calculate mmap() virtual
    addresses for MMIO mappings.

> 
> >
> > >
> > > I think you should also specify your motive, renaming and exporting something
> > > because it seems handy isn't sufficient justifiation.
> > >
> > > Also why would we need to export this? What modules might want to use this? I'm
> > > generally not a huge fan of exporting things unless we strictly have to.
> >
> > It's one of the major reasons why I sent this together with the VFIO
> > patches.  It'll be used in VFIO patches that is in the same series.  I will
> > mention it in the commit message when repost.
> 
> OK cool, I've not dug through those as not my area, really it's about
> having the appropriate justification.
> 
> I'm always inclined to not want us to export things by default, based on
> experience of finding 'unusual' uses of various mm interfaces in drivers in
> the past which have caused problems :)
> 
> But of course there are situations that warrant it, they just need to be
> spelled out.
> 
> >
> > >
> > > >
> > > > About the rename:
> > > >
> > > >   - Dropping "THP" because it doesn't really have much to do with THP
> > > >     internally.
> > >
> > > Well the function seems specifically tailored to the THP use. I think you'll
> > > need to further adjust this.
> >
> > Actually.. it is almost exactly what I need so far.  I can justify it below.
> 
> Yeah, but it's not a general function that gives you an unmapped area that
> is aligned.
> 
> It's a 'function that gets you an aligned unmapped area but only for 64-bit
> kernels and when you are not invoking it from a compat syscall and returns
> 0 instead of errors'.
> 
> This doesn't sound general to me?

I still think it's general.  I think it's a general request for any huge
mappings.  For example, I do not want to enable aggressive VA allocations
on 32 bits systems because I know it's easier to get overloaded VA address
space with 32 bits.  It should also apply to all potential users whoever
wants to use this function by default.

I don't think it always needs to do so, if there's an user that, for
example, want to keep the calculation but still work on 32 bits, we can
provide yet another helper.  But it's not the case as of now, and I can't
think of such user.  In this case, I think it's OK we keep this in the
helper for all existing users, including VFIO.

> 
> >
> > >
> > > >
> > > >   - The suffix "_aligned" imply it is a helper to generate aligned virtual
> > > >     address based on what is specified (which can be not PMD_SIZE).
> > >
> > > Ack this is sensible!
> > >
> > > >
> > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > Cc: Dev Jain <dev.jain@arm.com>
> > > > Cc: Barry Song <baohua@kernel.org>
> > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > ---
> > > >  include/linux/huge_mm.h | 14 +++++++++++++-
> > > >  mm/huge_memory.c        |  6 ++++--
> > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > index 2f190c90192d..706488d92bb6 100644
> > > > --- a/include/linux/huge_mm.h
> > > > +++ b/include/linux/huge_mm.h
> > >
> > > Why are we keeping everything in huge_mm.h, huge_memory.c if this is being made
> > > generic?
> > >
> > > Surely this should be moved out into mm/mmap.c no?
> >
> > No objections, but I suggest a separate discussion and patch submission
> > when the original function resides in huge_memory.c.  Hope it's ok for you.
> 
> I like to be as flexible as I can be in review, but I'm afraid I'm going to
> have to be annoying about this one :)
> 
> It simply makes no sense to have non-THP stuff in 'the THP file'. Also this
> makes this a general memory mapping function that should live with the
> other related code.
> 
> I don't really think much discussion is required here? You could do this as
> 2 separate commits if that'd make life easier?
> 
> Sorry to be a pain here, but I'm really allergic to our having random
> unrelated things in the wrong files, it's something mm has done rather too
> much...

I don't understand why the helper is non-THP.  The alignment so far is
really about huge mappings.  Core mm's HUGE_PFNMAP config option also
depends on THP at least as of now.

# TODO: Allow to be enabled without THP
config ARCH_SUPPORTS_HUGE_PFNMAP
	def_bool n
	depends on TRANSPARENT_HUGEPAGE

> 
> >
> > >
> > > > @@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> > > >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > > >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > > >  		vm_flags_t vm_flags);
> > > > -
> > > > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > > > +		unsigned long addr, unsigned long len,
> > > > +		loff_t off, unsigned long flags, unsigned long size,
> > > > +		vm_flags_t vm_flags);
> > >
> > > I echo Jason's comments about a kdoc and explanation of what this function does.
> > >
> > > >  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> > > >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> > > >  		unsigned int new_order);
> > > > @@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > > >  	return 0;
> > > >  }
> > > >
> > > > +static inline unsigned long
> > > > +mm_get_unmapped_area_aligned(struct file *filp,
> > > > +			     unsigned long addr, unsigned long len,
> > > > +			     loff_t off, unsigned long flags, unsigned long size,
> > > > +			     vm_flags_t vm_flags)
> > > > +{
> > > > +	return 0;
> > > > +}
> > > > +
> > > >  static inline bool
> > > >  can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> > > >  {
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index 4734de1dc0ae..52f13a70562f 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
> > > >  		folio_test_large_rmappable(folio);
> > > >  }
> > > >
> > > > -static unsigned long __thp_get_unmapped_area(struct file *filp,
> > > > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > > >  		unsigned long addr, unsigned long len,
> > > >  		loff_t off, unsigned long flags, unsigned long size,
> > > >  		vm_flags_t vm_flags)
> > > > @@ -1132,6 +1132,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> > > >  	ret += off_sub;
> > > >  	return ret;
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(mm_get_unmapped_area_aligned);
> > >
> > > I'm not convinced about exporting this... shouldn't be export only if we
> > > explicitly have a user?
> > >
> > > I'd rather we didn't unless we needed to.
> > >
> > > >
> > > >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > > >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > > > @@ -1140,7 +1141,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> > > >  	unsigned long ret;
> > > >  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> > > >
> > > > -	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> > > > +	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
> > > > +					   PMD_SIZE, vm_flags);
> > > >  	if (ret)
> > > >  		return ret;
> > > >
> > > > --
> > > > 2.49.0
> > > >
> > >
> > > So, you don't touch the original function but there's stuff there I think we
> > > need to think about if this is generalised.
> > >
> > > E.g.:
> > >
> > > 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
> > > 		return 0;
> > >
> > > This still valid?
> >
> > Yes.  I want this feature (for VFIO) to not be enabled on 32bits, and not
> > enabled with compat syscals.
> 
> OK, but then is this a 'general' function any more?
> 
> These checks were introduced by commit 4ef9ad19e176 ("mm: huge_memory:
> don't force huge page alignment on 32 bit") and so are _absolutely
> specifically_ intended for a THP use-case.
> 
> And now they _just happen_ to be useful to you but nothing about the
> function name suggests that this is the case?
> 
> I mean it seems like you should be doing this check separately in both VFIO
> and THP code and having the 'general 'function not do this no?

I don't understand, sorry.

If this helper only has two users, the two users want the same check,
shouldn't we keep the check in the helper, rather than duplicating in the
two callers?

> 
> >
> > >
> > > 	/*
> > > 	 * The failure might be due to length padding. The caller will retry
> > > 	 * without the padding.
> > > 	 */
> > > 	if (IS_ERR_VALUE(ret))
> > > 		return 0;
> > >
> > > This is assuming things the (currently single) caller will do, that is no longer
> > > an assumption you can make, especially if exported.
> >
> > It's part of core function we want from a generic helper.  We want to know
> > when the va allocation, after padded, would fail due to the padding. Then
> > the caller can decide what to do next.  It needs to fail here properly.
> 
> I'm no sure I understand what you mean?
> 
> It's not just this case, it's basically any error condition results in 0.
> 
> It's actually quite dangerous, as the get_unmapped_area() functions are
> meant to return either an error value or the located address _and zero is a
> valid response_.

Not by default, when you didn't change vm.mmap_min_addr. I don't think it's
a good idea to be able to return NULL as a virtual address, unless
extremely necessary.  I don't even know whether Linux can do that now.

OTOH, it's common too so far to use this retval in get_unmapped_area().

Currently, the mm API is defined as:

	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);

Its retval is unsigned long, and its error is returned by IS_ERR_VALUE().
That's the current API across the whole mm, and that's why this function
does it because when used in THP it's easier for retval processing.  Same
to VFIO, as long as the API didn't change.

I'm OK if any of us wants to refactor this as a whole, but it'll be great
if you could agree we can do it separately, and also discussed separately.

> 
> So if somebody used this function naively, they'd potentially have a very
> nasty bug occur when an error arose.
> 
> If you want to export this, I just don't think we can have this be a thing
> here.
> 
> >
> > >
> > > Actually you maybe want to abstract the whole of thp_get_unmapped_area_vmflags()
> > > no? As this has a fallback mode?
> > >
> > > 	/*
> > > 	 * Do not try to align to THP boundary if allocation at the address
> > > 	 * hint succeeds.
> > > 	 */
> > > 	if (ret == addr)
> > > 		return addr;
> >
> > This is not a fallback. This is when user specified a hint address (no
> > matter with / without MAP_FIXED), if that address works then we should
> > reuse that address, ignoring the alignment requirement from the driver.
> > This is exactly the behavior VFIO needs, and this should also be the
> > suggested behavior for whatever new drivers that would like to start using
> > this generic helper.
> 
> I didn't say this was the fallback :) this just happened to be the code
> underneath my comment. Sorry if that wasn't clear.
> 
> This is another kinda non-general thing but one that makes more sense. This
> comment needs updating, however, obviously. You could just delete 'THP' in
> the comment that'd probalby do it.

Yes, the THP word does not apply anymore.   I'll change it, thanks for
pointing this out.

> 
> The fallback is in:
> 
> unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> 		unsigned long len, unsigned long pgoff, unsigned long flags,
> 		vm_flags_t vm_flags)
> {
> 	unsigned long ret;
> 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> 
> 	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> 	if (ret)
> 		return ret;
> 
> So here, if ret returns an address, then it's fine we return that.
> 
> Otherwise, we invoke the below (the fallback):
> 
> 	return mm_get_unmapped_area_vmflags(current->mm, filp, addr, len, pgoff, flags,
> 					    vm_flags);
> }
> 
> >
> > >
> > > What was that about this no longer being relevant to THP? :>)
> > >
> > > Are all of these 'return 0' cases expected by any sensible caller? It seems like
> > > it's a way for thp_get_unmapped_area_vmflags() to recognise when to fall back to
> > > non-aligned?
> >
> > Hope above justfies everything.  It's my intention to reuse everything
> > here.  If you have any concern on any of the "return 0" cases in the
> > function being exported, please shoot, we can discuss.
> 
> Of course, I have some doubts here :)
> 
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >
> 
> To be clearer perhaps, what I think would work here is:
> 
> 1. Remove the CONFIG_64BIT, in_compat_syscall() check and place it in THP
>    and VFIO code separately, as this isn't a general thing.

Commented above.  I still think it should be kept until we have a valid use
case to not enable it.

> 
> 2. Rather than return 0 in this function, return error codes so it matches
>    the other mm_get_unmapped_area_*() functions.

Commented above.

> 
> 3. Adjust thp_get_unmapped_area_vmflags() to detect the error value from
>    this function and do the fallback logic in this case. There's no need
>    for this 0 stuff (and it's possibly broken actually, since _in theory_
>    you can get unmapped zero).

Please see the discussion in the other thread, where I replied to Jason to
explain why the fallback might not be what the user always want.

For example, the last patch does try 1G first and if it fails somehow it'll
try 2M.  It doesn't want to fallback to 4K when 1G alloc fails.

> 
> 4. (sorry :) move the code to mm/mmap.c

Commented above.  Note: I'm not saying it _can't_ be moved out, but it
still makes sense to me to be in huge_memory.c.

> 
> 5. Obviously address comments from others, most importantly (in my view)
>    ensuring that there is a good kernel doc comment around the function.
> 
> 6. Put the justifiation for exporting the function + stuff about VFIO in
>    the commit message + expand it a little bit as discussed.

Please check if above version works for you.

> 
> 7. Other small stuff raised above (e.g. remove 'THP' comment etc.)

I'll do this.

> 
> Again, sorry to be a pain, but I think we need to be careful to get this
> right so we don't leave any footguns for ourselves in the future with
> 'implicit' stuff.
> 
> Thanks!
> 

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 19:15         ` Peter Xu
@ 2025-06-13 23:16           ` Jason Gunthorpe
  2025-06-16 22:06             ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-13 23:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 03:15:19PM -0400, Peter Xu wrote:
> > > > > +	if (phys_len >= PMD_SIZE) {
> > > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > > +						   flags, PMD_SIZE, 0);
> > > > > +		if (ret)
> > > > > +			return ret;
> > > > > +	}
> > > > 
> > > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > > > contiguity to get a 32*16k=1GB option.
> > > > 
> > > > Forcing to only align to the PMD or PUD seems suboptimal..
> > > 
> > > Right, however the cont-pte / cont-pmd are still not supported in huge
> > > pfnmaps in general?  It'll definitely be nice if someone could look at that
> > > from ARM perspective, then provide support of both in one shot.
> > 
> > Maybe leave behind a comment about this. I've been poking around if
> > somone would do the ARM PFNMAP support but can't report any commitment.
> 
> I didn't know what's the best part to take a note for the whole pfnmap
> effort, but I added a note into the commit message on this patch:
> 
>         Note 2: Currently continuous pgtable entries (for example, cont-pte) is not
>         yet supported for huge pfnmaps in general.  It also is not considered in
>         this patch so far.  Separate work will be needed to enable continuous
>         pgtable entries on archs that support it.
> 
> > 
> > > > > +fallback:
> > > > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > > > 
> > > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > > > thp_get_unmapped_area_vmflags() too?
> > > > 
> > > > Is there any reason the caller should have to do a retry?
> > > 
> > > We would still need thp_get_unmapped_area_vmflags() because that encodes
> > > PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> > > as a generic helper.
> > 
> > There is only one caller for thp_get_unmapped_area_vmflags(), just
> > open code PMD_SIZE there and thin this whole thing out. It reads
> > better like that anyhow:
> > 
> > 	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
> > 		   && !addr /* no hint */
> > 		   && IS_ALIGNED(len, PMD_SIZE)) {
> > 		/* Ensures that larger anonymous mappings are THP aligned. */
> > 		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
> > 						    flags, vm_flags, PMD_SIZE);
> > 
> > > That was ok, however that loses some flexibility when the caller wants to
> > > try with different alignments, exactly like above: currently, it was trying
> > > to do a first attempt of PUD mapping then fallback to PMD if that fails.
> > 
> > Oh, that's a good point, I didn't notice that subtle bit.
> > 
> > But then maybe that is showing the API is just wrong and the core code
> > should be trying to find the best alignment not the caller. Like we
> > can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?
> > 
> > VFIO would just pass the BAR size, implying the best alignment, and
> > the core implementation will try to get the largest VMA alignment that
> > snaps to an arch supported page contiguity, testing each of the arches
> > page size possibilities in turn.
> > 
> > That sounds like a much better API than pushing this into drivers??
> 
> Yes it would be nice if the core mm can evolve to make supporting such
> easier.  Though the question is how to pass information over to core mm.

I was just thinking something simple, change how your new 
mm_get_unmapped_area_aligned() works so that the caller is expected to
pass in the size of the biggest folio/pfn page in as
align.

The mm_get_unmapped_area_aligned() returns a vm address that
will result in large mappings.

pgoff works the same way, the assumption is the biggest folio is at
pgoff 0 and followed by another biggest folio so the pgoff logic tries
to make the second folio map fully.

ie what a hugetlb fd or thp memfd would like.

Then you still hook the file operations and still figure out what BAR
and so on to call mm_get_unmapped_area_aligned() with the correct
aligned parameter.

mm_get_unmapped_area_aligned() goes through the supported page sizes
of the arch and selects the best one for the indicated biggest folio

If we were happy writing this in vfio then it can work just as well in
the core mm side.

> It's similar to many other use cases of get_unmapped_area() users.  For
> example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at
> least knowing which part of the file was being mapped:
> 
> 	if (offset < DST_QUEUE_OFF_BASE) {
> 		vq = v4l2_m2m_get_src_vq(fh->m2m_ctx);
> 	} else {
> 		vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx);
> 		pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT);
> 	}

Careful thats only use for nommu :)

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
                     ` (2 preceding siblings ...)
  2025-06-13 15:13   ` Zi Yan
@ 2025-06-14  4:11   ` Liam R. Howlett
  2025-06-17 21:07     ` Peter Xu
  3 siblings, 1 reply; 77+ messages in thread
From: Liam R. Howlett @ 2025-06-14  4:11 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Huacai Chen, Thomas Bogendoerfer, Muchun Song,
	Oscar Salvador, loongarch, linux-mips

* Peter Xu <peterx@redhat.com> [691231 23:00]:
> Only mips and loongarch implemented this API, however what it does was
> checking against stack overflow for either len or addr.  That's already
> done in arch's arch_get_unmapped_area*() functions, hence not needed.

I'm not as confident..

> 
> It means the whole API is pretty much obsolete at least now, remove it
> completely.
> 
> Cc: Huacai Chen <chenhuacai@kernel.org>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: Muchun Song <muchun.song@linux.dev>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: loongarch@lists.linux.dev
> Cc: linux-mips@vger.kernel.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  arch/loongarch/include/asm/hugetlb.h | 14 --------------
>  arch/mips/include/asm/hugetlb.h      | 14 --------------
>  fs/hugetlbfs/inode.c                 |  8 ++------
>  include/asm-generic/hugetlb.h        |  8 --------
>  include/linux/hugetlb.h              |  6 ------
>  5 files changed, 2 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h
> index 4dc4b3e04225..ab68b594f889 100644
> --- a/arch/loongarch/include/asm/hugetlb.h
> +++ b/arch/loongarch/include/asm/hugetlb.h
> @@ -10,20 +10,6 @@
>  
>  uint64_t pmd_to_entrylo(unsigned long pmd_val);
>  
> -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> -static inline int prepare_hugepage_range(struct file *file,
> -					 unsigned long addr,
> -					 unsigned long len)
> -{
> -	unsigned long task_size = STACK_TOP;
> -
> -	if (len > task_size)
> -		return -ENOMEM;
> -	if (task_size - len < addr)
> -		return -EINVAL;
> -	return 0;
> -}
> -
>  #define __HAVE_ARCH_HUGE_PTE_CLEAR
>  static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
>  				  pte_t *ptep, unsigned long sz)
> diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
> index fbc71ddcf0f6..8c460ce01ffe 100644
> --- a/arch/mips/include/asm/hugetlb.h
> +++ b/arch/mips/include/asm/hugetlb.h
> @@ -11,20 +11,6 @@
>  
>  #include <asm/page.h>
>  
> -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> -static inline int prepare_hugepage_range(struct file *file,
> -					 unsigned long addr,
> -					 unsigned long len)
> -{
> -	unsigned long task_size = STACK_TOP;

arch/mips/include/asm/processor.h:#define STACK_TOP             mips_stack_top()


unsigned long mips_stack_top(void)                                                                                                                                                                                                             
{       
        unsigned long top = TASK_SIZE & PAGE_MASK;                                                                                                                                                                                             
        
        if (IS_ENABLED(CONFIG_MIPS_FP_SUPPORT)) {
                /* One page for branch delay slot "emulation" */                                                                                                                                                                               
                top -= PAGE_SIZE;                                                                                                                                                                                                              
        }                                                                                                                                                                                                                                      
        
        /* Space for the VDSO, data page & GIC user page */                                                                                                                                                                                    
        top -= PAGE_ALIGN(current->thread.abi->vdso->size);                                                                                                                                                                                    
        top -= PAGE_SIZE;
        top -= mips_gic_present() ? PAGE_SIZE : 0;                                                                                                                                                                                             
        
        /* Space for cache colour alignment */                                                                                                                                                                                                 
        if (cpu_has_dc_aliases)
                top -= shm_align_mask + 1;                                                                                                                                                                                                     
        
        /* Space to randomize the VDSO base */                                                                                                                                                                                                 
        if (current->flags & PF_RANDOMIZE)
                top -= VDSO_RANDOMIZE_SIZE;                                                                                                                                                                                                    
        
        return top;                                                                                                                                                                                                                            
}

This seems different than TASK_SIZE.

Code is from:
commit ea7e0480a4b695d0aa6b3fa99bd658a003122113
Author: Paul Burton <paulburton@kernel.org>
Date:   Tue Sep 25 15:51:26 2018 -0700


> -	if (len > task_size)
> -		return -ENOMEM;
> -	if (task_size - len < addr)
> -		return -EINVAL;
> -	return 0;
> -}
> -

Unfortunately, the commit message for the addition of this code are not
helpful.

commit 50a41ff292fafe1e937102be23464b54fed8b78c
Author: David Daney <ddaney@caviumnetworks.com>
Date:   Wed May 27 17:47:42 2009 -0700

... But the dates are helpful.  This code used to use:
#define STACK_TOP      ((TASK_SIZE & PAGE_MASK) - PAGE_SIZE)

It's not exactly task size either.

I don't think this is an issue to remove this check because the overflow
should be caught later (or trigger the opposite search).  But it's not
clear why STACK_TOP was done in the first place.. Maybe just because we
know the overflow here would be an issue later, but then we'd avoid the
opposite search - and maybe that's the point?

Either way, your comment about the same check existing doesn't seem
correct.

I haven't checked loong arch, but I'd be willing to wager this was just
cloned mips code... because this happens so much.

...

Thanks,
Liam

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
                     ` (2 preceding siblings ...)
  2025-06-13 15:36   ` Lorenzo Stoakes
@ 2025-06-14  5:23   ` Liam R. Howlett
  2025-06-16 12:14     ` Jason Gunthorpe
  3 siblings, 1 reply; 77+ messages in thread
From: Liam R. Howlett @ 2025-06-14  5:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Baolin Wang, Lorenzo Stoakes, Ryan Roberts, Dev Jain,
	Barry Song

* Peter Xu <peterx@redhat.com> [250613 09:41]:
> This function is pretty handy for any type of VMA to provide a size-aligned
> VMA address when mmap().  Rename the function and export it.
> 
> About the rename:
> 
>   - Dropping "THP" because it doesn't really have much to do with THP
>     internally.
> 
>   - The suffix "_aligned" imply it is a helper to generate aligned virtual
>     address based on what is specified (which can be not PMD_SIZE).

I am not okay with this.

You are renaming a function to drop thp and not moving it into generic
code.  Either it is a generic function that lives with the generic code,
or drop this change all together.

If this function is going to be generic, please make the return of 0
valid and not an error.  You are masking all errors to 0 currently.

vm_unmapped_area_info has an align_mask, and that's only used for
hugepages. It is wrong to have a generic function that does not use the
generic struct element that exists for this reason.  Is there a reason
that align_mask doesn't work, or why it's not used?

The return of mm_get_unmapped_area_vmflags() is not aligned with the
return of this function.  That is, the address returned from
mm_get_unmapped_area_vmflags() differs from __thp_get_unmapped_area()
based on MMF_TOPDOWN, and/or something related to off_sub?

Anyways, since it's different from mm_get_unmapped_area() in this
regard, we cannot rename it mm_get_unmapped_area_aligned() - it sounds
like a helper and is different, by a lot.

I also am not okay to export it for no reason.

Also, is it okay to export something as gpl or does the copyright holder
need to do that (I have no idea about this stuff, or maybe you work for
the copyright holder)?

The hint (addr) is also never checked for alignment in this function and
we are appending _aligned() to the name.  With this change we can now
get an unaligned _aligned() address.  This (probably) can happen with
MAP_FIXED today, but I don't think we imply it's going to be aligned
elsewhere.

Hate for the Pre-existing code in this function:

Dear lord:
off_sub = (off - ret) & (size - 1);
Using ret here is just (to be polite) unclear to save one assignment.  I
expect the one assignment would be optimised out by the compilers.

My hate for the unsub off_sub grows:
ret += off_sub;
return ret;

It is extremely frustrating that the self-documenting parts of this
function have documentation while poorly named variables are used in
puzzling calculations without any.

...

Thanks,
Liam

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-13 20:34         ` Peter Xu
@ 2025-06-14  5:58           ` Lorenzo Stoakes
  0 siblings, 0 replies; 77+ messages in thread
From: Lorenzo Stoakes @ 2025-06-14  5:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache, Baolin Wang, Liam R. Howlett, Ryan Roberts, Dev Jain,
	Barry Song

Peter - I think the problem we're having here is that you're making this a
_general_ and _exported_ function that must live up to the standards of such a
function.

But at the same time, for convenience, want it to happen to do what's convenient
for VFIO and THP.

To stop us going around in circles -

I won't accept this patch if it:

a. Returns 0 on errors,
b. Does stuff that is specific to THP, etc.
c. Is located in mm/huge_memory.c when you're saying it's general.

This isn't a case of nits to be sorted out later, these are fundamental issues
that prevent your series from being mergeable.

On Fri, Jun 13, 2025 at 04:34:37PM -0400, Peter Xu wrote:
> On Fri, Jun 13, 2025 at 08:18:42PM +0100, Lorenzo Stoakes wrote:
> > On Fri, Jun 13, 2025 at 02:45:31PM -0400, Peter Xu wrote:
> > > On Fri, Jun 13, 2025 at 04:36:57PM +0100, Lorenzo Stoakes wrote:
> > > > On Fri, Jun 13, 2025 at 09:41:09AM -0400, Peter Xu wrote:
> > > > > This function is pretty handy for any type of VMA to provide a size-aligned
> > > > > VMA address when mmap().  Rename the function and export it.
> > > >
> > > > This isn't a great commit message, 'to provide a size-aligned VMA address when
> > > > mmap()' is super unclear - do you mean 'to provide an unmapped address that is
> > > > also aligned to the specified size'?
> > >
> > > I sincerely don't know the difference, not a native speaker here..
> > > Suggestions welcomed, I can update to whatever both of us agree on.
> >
> > Sure, sorry I don't mean to be pedantic I just think it would be clearer to
> > sort of expand upon this, as the commit message is rather short.
> >
> > I think saying something like this function allows you to locate an
> > unmapped region which is aligned to the specified size should suffice.
>
> I changed the commit message to this:
>
>     This function is pretty handy to locate an unmapped region which is aligned
>     to the specified alignment, meanwhile taking pgoff into considerations.
>
>     Rename the function and export it.  VFIO will be the first candidate to
>     reuse this function in follow up patches to calculate mmap() virtual
>     addresses for MMIO mappings.

This is better but doesn't describe what this function does as you're doing
unusual things.

>
> >
> > >
> > > >
> > > > I think you should also specify your motive, renaming and exporting something
> > > > because it seems handy isn't sufficient justifiation.
> > > >
> > > > Also why would we need to export this? What modules might want to use this? I'm
> > > > generally not a huge fan of exporting things unless we strictly have to.
> > >
> > > It's one of the major reasons why I sent this together with the VFIO
> > > patches.  It'll be used in VFIO patches that is in the same series.  I will
> > > mention it in the commit message when repost.
> >
> > OK cool, I've not dug through those as not my area, really it's about
> > having the appropriate justification.
> >
> > I'm always inclined to not want us to export things by default, based on
> > experience of finding 'unusual' uses of various mm interfaces in drivers in
> > the past which have caused problems :)
> >
> > But of course there are situations that warrant it, they just need to be
> > spelled out.
> >
> > >
> > > >
> > > > >
> > > > > About the rename:
> > > > >
> > > > >   - Dropping "THP" because it doesn't really have much to do with THP
> > > > >     internally.
> > > >
> > > > Well the function seems specifically tailored to the THP use. I think you'll
> > > > need to further adjust this.
> > >
> > > Actually.. it is almost exactly what I need so far.  I can justify it below.
> >
> > Yeah, but it's not a general function that gives you an unmapped area that
> > is aligned.
> >
> > It's a 'function that gets you an aligned unmapped area but only for 64-bit
> > kernels and when you are not invoking it from a compat syscall and returns
> > 0 instead of errors'.
> >
> > This doesn't sound general to me?
>
> I still think it's general.  I think it's a general request for any huge
> mappings.  For example, I do not want to enable aggressive VA allocations
> on 32 bits systems because I know it's easier to get overloaded VA address
> space with 32 bits.  It should also apply to all potential users whoever
> wants to use this function by default.

This is a stretch, you're now assuming alignment must be large enough to be a
problem on 32-bit systems, and you've not mentioned compat syscalls _at all_
here.

Commit 4ef9ad19e176 ("mm: huge_memory: don't force huge page alignment on 32
bit") is what introduced this. It literally references issued encountered in
THP.

>
> I don't think it always needs to do so, if there's an user that, for
> example, want to keep the calculation but still work on 32 bits, we can
> provide yet another helper.  But it's not the case as of now, and I can't
> think of such user.  In this case, I think it's OK we keep this in the
> helper for all existing users, including VFIO.

It's not OK, sorry.

>
> >
> > >
> > > >
> > > > >
> > > > >   - The suffix "_aligned" imply it is a helper to generate aligned virtual
> > > > >     address based on what is specified (which can be not PMD_SIZE).
> > > >
> > > > Ack this is sensible!
> > > >
> > > > >
> > > > > Cc: Zi Yan <ziy@nvidia.com>
> > > > > Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > > Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> > > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > > Cc: Dev Jain <dev.jain@arm.com>
> > > > > Cc: Barry Song <baohua@kernel.org>
> > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > ---
> > > > >  include/linux/huge_mm.h | 14 +++++++++++++-
> > > > >  mm/huge_memory.c        |  6 ++++--
> > > > >  2 files changed, 17 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > > > index 2f190c90192d..706488d92bb6 100644
> > > > > --- a/include/linux/huge_mm.h
> > > > > +++ b/include/linux/huge_mm.h
> > > >
> > > > Why are we keeping everything in huge_mm.h, huge_memory.c if this is being made
> > > > generic?
> > > >
> > > > Surely this should be moved out into mm/mmap.c no?
> > >
> > > No objections, but I suggest a separate discussion and patch submission
> > > when the original function resides in huge_memory.c.  Hope it's ok for you.
> >
> > I like to be as flexible as I can be in review, but I'm afraid I'm going to
> > have to be annoying about this one :)
> >
> > It simply makes no sense to have non-THP stuff in 'the THP file'. Also this
> > makes this a general memory mapping function that should live with the
> > other related code.
> >
> > I don't really think much discussion is required here? You could do this as
> > 2 separate commits if that'd make life easier?
> >
> > Sorry to be a pain here, but I'm really allergic to our having random
> > unrelated things in the wrong files, it's something mm has done rather too
> > much...
>
> I don't understand why the helper is non-THP.  The alignment so far is
> really about huge mappings.  Core mm's HUGE_PFNMAP config option also
> depends on THP at least as of now.
>
> # TODO: Allow to be enabled without THP
> config ARCH_SUPPORTS_HUGE_PFNMAP
> 	def_bool n
> 	depends on TRANSPARENT_HUGEPAGE

I really don't understand what your point is? You're naming this
mm_get_unmapped_area_aligned()? No reference to THP or VFIO?

>
> >
> > >
> > > >
> > > > > @@ -339,7 +339,10 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> > > > >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > > > >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > > > >  		vm_flags_t vm_flags);
> > > > > -
> > > > > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > > > > +		unsigned long addr, unsigned long len,
> > > > > +		loff_t off, unsigned long flags, unsigned long size,
> > > > > +		vm_flags_t vm_flags);
> > > >
> > > > I echo Jason's comments about a kdoc and explanation of what this function does.
> > > >
> > > > >  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> > > > >  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> > > > >  		unsigned int new_order);
> > > > > @@ -543,6 +546,15 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > > > >  	return 0;
> > > > >  }
> > > > >
> > > > > +static inline unsigned long
> > > > > +mm_get_unmapped_area_aligned(struct file *filp,
> > > > > +			     unsigned long addr, unsigned long len,
> > > > > +			     loff_t off, unsigned long flags, unsigned long size,
> > > > > +			     vm_flags_t vm_flags)
> > > > > +{
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > >  static inline bool
> > > > >  can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> > > > >  {
> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > > index 4734de1dc0ae..52f13a70562f 100644
> > > > > --- a/mm/huge_memory.c
> > > > > +++ b/mm/huge_memory.c
> > > > > @@ -1088,7 +1088,7 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
> > > > >  		folio_test_large_rmappable(folio);
> > > > >  }
> > > > >
> > > > > -static unsigned long __thp_get_unmapped_area(struct file *filp,
> > > > > +unsigned long mm_get_unmapped_area_aligned(struct file *filp,
> > > > >  		unsigned long addr, unsigned long len,
> > > > >  		loff_t off, unsigned long flags, unsigned long size,
> > > > >  		vm_flags_t vm_flags)
> > > > > @@ -1132,6 +1132,7 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
> > > > >  	ret += off_sub;
> > > > >  	return ret;
> > > > >  }
> > > > > +EXPORT_SYMBOL_GPL(mm_get_unmapped_area_aligned);
> > > >
> > > > I'm not convinced about exporting this... shouldn't be export only if we
> > > > explicitly have a user?
> > > >
> > > > I'd rather we didn't unless we needed to.
> > > >
> > > > >
> > > > >  unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > > > >  		unsigned long len, unsigned long pgoff, unsigned long flags,
> > > > > @@ -1140,7 +1141,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
> > > > >  	unsigned long ret;
> > > > >  	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> > > > >
> > > > > -	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> > > > > +	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
> > > > > +					   PMD_SIZE, vm_flags);
> > > > >  	if (ret)
> > > > >  		return ret;
> > > > >
> > > > > --
> > > > > 2.49.0
> > > > >
> > > >
> > > > So, you don't touch the original function but there's stuff there I think we
> > > > need to think about if this is generalised.
> > > >
> > > > E.g.:
> > > >
> > > > 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
> > > > 		return 0;
> > > >
> > > > This still valid?
> > >
> > > Yes.  I want this feature (for VFIO) to not be enabled on 32bits, and not
> > > enabled with compat syscals.
> >
> > OK, but then is this a 'general' function any more?
> >
> > These checks were introduced by commit 4ef9ad19e176 ("mm: huge_memory:
> > don't force huge page alignment on 32 bit") and so are _absolutely
> > specifically_ intended for a THP use-case.
> >
> > And now they _just happen_ to be useful to you but nothing about the
> > function name suggests that this is the case?
> >
> > I mean it seems like you should be doing this check separately in both VFIO
> > and THP code and having the 'general 'function not do this no?
>
> I don't understand, sorry.
>
> If this helper only has two users, the two users want the same check,
> shouldn't we keep the check in the helper, rather than duplicating in the
> two callers?

Because you're making this an exported 'general' function with '_aligned' in the
suffix and your whole patch is about how it's general.

The problem is somebody will use this function thinking it is general, then find
out it's not general it's a 'de-duplicate VFIO and THP' function.

>
> >
> > >
> > > >
> > > > 	/*
> > > > 	 * The failure might be due to length padding. The caller will retry
> > > > 	 * without the padding.
> > > > 	 */
> > > > 	if (IS_ERR_VALUE(ret))
> > > > 		return 0;
> > > >
> > > > This is assuming things the (currently single) caller will do, that is no longer
> > > > an assumption you can make, especially if exported.
> > >
> > > It's part of core function we want from a generic helper.  We want to know
> > > when the va allocation, after padded, would fail due to the padding. Then
> > > the caller can decide what to do next.  It needs to fail here properly.
> >
> > I'm no sure I understand what you mean?
> >
> > It's not just this case, it's basically any error condition results in 0.
> >
> > It's actually quite dangerous, as the get_unmapped_area() functions are
> > meant to return either an error value or the located address _and zero is a
> > valid response_.
>
> Not by default, when you didn't change vm.mmap_min_addr. I don't think it's
> a good idea to be able to return NULL as a virtual address, unless
> extremely necessary.  I don't even know whether Linux can do that now.

It can afaik.

>
> OTOH, it's common too so far to use this retval in get_unmapped_area().
>
> Currently, the mm API is defined as:
>
> 	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
>
> Its retval is unsigned long, and its error is returned by IS_ERR_VALUE().
> That's the current API across the whole mm, and that's why this function
> does it because when used in THP it's easier for retval processing.  Same
> to VFIO, as long as the API didn't change.

This sounds like you agree with me that you're fundamentally breaking this
contract for convenience?

This isn't acceptable for a general function.

>
> I'm OK if any of us wants to refactor this as a whole, but it'll be great
> if you could agree we can do it separately, and also discussed separately.

Sorry these are _fundamental_ issues, not nits or niceties that can be followed
up on.

>
> >
> > So if somebody used this function naively, they'd potentially have a very
> > nasty bug occur when an error arose.
> >
> > If you want to export this, I just don't think we can have this be a thing
> > here.
> >
> > >
> > > >
> > > > Actually you maybe want to abstract the whole of thp_get_unmapped_area_vmflags()
> > > > no? As this has a fallback mode?
> > > >
> > > > 	/*
> > > > 	 * Do not try to align to THP boundary if allocation at the address
> > > > 	 * hint succeeds.
> > > > 	 */
> > > > 	if (ret == addr)
> > > > 		return addr;
> > >
> > > This is not a fallback. This is when user specified a hint address (no
> > > matter with / without MAP_FIXED), if that address works then we should
> > > reuse that address, ignoring the alignment requirement from the driver.
> > > This is exactly the behavior VFIO needs, and this should also be the
> > > suggested behavior for whatever new drivers that would like to start using
> > > this generic helper.
> >
> > I didn't say this was the fallback :) this just happened to be the code
> > underneath my comment. Sorry if that wasn't clear.
> >
> > This is another kinda non-general thing but one that makes more sense. This
> > comment needs updating, however, obviously. You could just delete 'THP' in
> > the comment that'd probalby do it.
>
> Yes, the THP word does not apply anymore.   I'll change it, thanks for
> pointing this out.

Thanks.

>
> >
> > The fallback is in:
> >
> > unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
> > 		unsigned long len, unsigned long pgoff, unsigned long flags,
> > 		vm_flags_t vm_flags)
> > {
> > 	unsigned long ret;
> > 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
> >
> > 	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
> > 	if (ret)
> > 		return ret;
> >
> > So here, if ret returns an address, then it's fine we return that.
> >
> > Otherwise, we invoke the below (the fallback):
> >
> > 	return mm_get_unmapped_area_vmflags(current->mm, filp, addr, len, pgoff, flags,
> > 					    vm_flags);
> > }
> >
> > >
> > > >
> > > > What was that about this no longer being relevant to THP? :>)
> > > >
> > > > Are all of these 'return 0' cases expected by any sensible caller? It seems like
> > > > it's a way for thp_get_unmapped_area_vmflags() to recognise when to fall back to
> > > > non-aligned?
> > >
> > > Hope above justfies everything.  It's my intention to reuse everything
> > > here.  If you have any concern on any of the "return 0" cases in the
> > > function being exported, please shoot, we can discuss.
> >
> > Of course, I have some doubts here :)
> >
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> > >
> >
> > To be clearer perhaps, what I think would work here is:
> >
> > 1. Remove the CONFIG_64BIT, in_compat_syscall() check and place it in THP
> >    and VFIO code separately, as this isn't a general thing.
>
> Commented above.  I still think it should be kept until we have a valid use
> case to not enable it.

No, this isn't acceptable sorry. I won't accept the patch as-is with this in
place.

>
> >
> > 2. Rather than return 0 in this function, return error codes so it matches
> >    the other mm_get_unmapped_area_*() functions.
>
> Commented above.

No, this isn't acceptable sorry. I won't accept the patch as-is with this in
place.

>
> >
> > 3. Adjust thp_get_unmapped_area_vmflags() to detect the error value from
> >    this function and do the fallback logic in this case. There's no need
> >    for this 0 stuff (and it's possibly broken actually, since _in theory_
> >    you can get unmapped zero).
>
> Please see the discussion in the other thread, where I replied to Jason to
> explain why the fallback might not be what the user always want.
>
> For example, the last patch does try 1G first and if it fails somehow it'll
> try 2M.  It doesn't want to fallback to 4K when 1G alloc fails.

You're misunderstanding me.

I said adjust the THP code to do the fallback. To be super clear I meant:

Change:

 	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
 	if (ret)
 		return ret;

 	return mm_get_unmapped_area_vmflags(current->mm, filp, addr, len, pgoff, flags,
 					    vm_flags);

To:

	ret = mm_get_unmapped_area_align(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
	if (!IS_ERR_VAL(ret))
		return ret;

 	return mm_get_unmapped_area_vmflags(current->mm, filp, addr, len, pgoff, flags,
 					    vm_flags);

In thp_get_unmapped_area_vmflags().

>
> >
> > 4. (sorry :) move the code to mm/mmap.c
>
> Commented above.  Note: I'm not saying it _can't_ be moved out, but it
> still makes sense to me to be in huge_memory.c.

No, this isn't acceptable sorry. I won't accept the patch as-is with this in
place.

>
> >
> > 5. Obviously address comments from others, most importantly (in my view)
> >    ensuring that there is a good kernel doc comment around the function.
> >
> > 6. Put the justifiation for exporting the function + stuff about VFIO in
> >    the commit message + expand it a little bit as discussed.
>
> Please check if above version works for you.

It's not, you're not explaining at all what this function does. But even if you
did, the function is doing something that isn't at all general.

>
> >
> > 7. Other small stuff raised above (e.g. remove 'THP' comment etc.)
>
> I'll do this.

Well there's this at least :)

>
> >
> > Again, sorry to be a pain, but I think we need to be careful to get this
> > right so we don't leave any footguns for ourselves in the future with
> > 'implicit' stuff.
> >
> > Thanks!
> >
>
> Thanks,
>
> --
> Peter Xu
>

Yeah sorry but you really need to rethink this.

I appreciate you trying to de-duplicate here, but again we truly must have a
high bar for this kind of generalised function, because it's absolutely the kind
of foot-gun that'll come back to bite when somebody sees
mm_get_unmapped_area_aligned() and doesn't realise it's not in any way generic.

And any patch that does that will not show any reference to the zero returns,
etc., it'll not cc any of us, and people will just quietly break their code in
subtle ways.

Thanks!

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-13 13:41 ` [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook Peter Xu
  2025-06-13 14:18   ` Jason Gunthorpe
  2025-06-13 18:03   ` David Hildenbrand
@ 2025-06-14 14:46   ` kernel test robot
  2025-06-17 15:39     ` Peter Xu
  2 siblings, 1 reply; 77+ messages in thread
From: kernel test robot @ 2025-06-14 14:46 UTC (permalink / raw)
  To: Peter Xu, linux-kernel, linux-mm, kvm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Alex Williamson, Zi Yan, Jason Gunthorpe, Alex Mastro,
	David Hildenbrand, Nico Pache, peterx

Hi Peter,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Deduplicate-mm_get_unmapped_area/20250613-214307
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250613134111.469884-5-peterx%40redhat.com
patch subject: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
config: sh-randconfig-002-20250614 (https://download.01.org/0day-ci/archive/20250614/202506142215.koMEU2rT-lkp@intel.com/config)
compiler: sh4-linux-gcc (GCC) 12.4.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250614/202506142215.koMEU2rT-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506142215.koMEU2rT-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/vfio/vfio_main.c: In function 'vfio_device_get_unmapped_area':
>> drivers/vfio/vfio_main.c:1367:24: error: implicit declaration of function 'mm_get_unmapped_area'; did you mean 'get_unmapped_area'? [-Werror=implicit-function-declaration]
    1367 |                 return mm_get_unmapped_area(current->mm, file, addr,
         |                        ^~~~~~~~~~~~~~~~~~~~
         |                        get_unmapped_area
   cc1: some warnings being treated as errors


vim +1367 drivers/vfio/vfio_main.c

  1356	
  1357	static unsigned long vfio_device_get_unmapped_area(struct file *file,
  1358							   unsigned long addr,
  1359							   unsigned long len,
  1360							   unsigned long pgoff,
  1361							   unsigned long flags)
  1362	{
  1363		struct vfio_device_file *df = file->private_data;
  1364		struct vfio_device *device = df->device;
  1365	
  1366		if (!device->ops->get_unmapped_area)
> 1367			return mm_get_unmapped_area(current->mm, file, addr,
  1368						    len, pgoff, flags);
  1369	
  1370		return device->ops->get_unmapped_area(device, file, addr, len,
  1371						      pgoff, flags);
  1372	}
  1373	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
                     ` (4 preceding siblings ...)
  2025-06-13 18:00   ` David Hildenbrand
@ 2025-06-16  8:01   ` David Laight
  2025-06-17 21:13     ` Peter Xu
  5 siblings, 1 reply; 77+ messages in thread
From: David Laight @ 2025-06-16  8:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache

On Fri, 13 Jun 2025 09:41:07 -0400
Peter Xu <peterx@redhat.com> wrote:

> Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> the helper instead to dedup the lines.

Would it make more sense to make it an inline wrapper?
Moving the EXPORT_SYMBOL to mm_get_unmapped_area_vmflags.

	David

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/mmap.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 09c563c95112..422f5b9d9660 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -871,9 +871,8 @@ mm_get_unmapped_area(struct mm_struct *mm, struct file *file,
>  		     unsigned long addr, unsigned long len,
>  		     unsigned long pgoff, unsigned long flags)
>  {
> -	if (test_bit(MMF_TOPDOWN, &mm->flags))
> -		return arch_get_unmapped_area_topdown(file, addr, len, pgoff, flags, 0);
> -	return arch_get_unmapped_area(file, addr, len, pgoff, flags, 0);
> +	return mm_get_unmapped_area_vmflags(mm, file, addr, len,
> +					    pgoff, flags, 0);
>  }
>  EXPORT_SYMBOL(mm_get_unmapped_area);
>  


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-14  5:23   ` Liam R. Howlett
@ 2025-06-16 12:14     ` Jason Gunthorpe
  2025-06-16 12:20       ` Lorenzo Stoakes
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-16 12:14 UTC (permalink / raw)
  To: Liam R. Howlett, Peter Xu, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache, Baolin Wang, Lorenzo Stoakes,
	Ryan Roberts, Dev Jain, Barry Song

On Sat, Jun 14, 2025 at 01:23:30AM -0400, Liam R. Howlett wrote:

> vm_unmapped_area_info has an align_mask, and that's only used for
> hugepages. It is wrong to have a generic function that does not use the
> generic struct element that exists for this reason.  Is there a reason
> that align_mask doesn't work, or why it's not used?

I had the same question and looked into it for a bit. It does seem
desirable, but also not entirely straightforward. I think the arch
code for arch_get_unmapped_area() needs some redesign to produce
the vm_unmapped_area_info() that the core code can update.

Unfortunately there are numerous weird things in the arches :\

Like x86 shouldn't be setting alignment for huge tlbfs files, that
should be done in the core code by huge tlbfs caling the new
mm_get_unmapped_area_aligned() on its own..

So I think we should leave this hacky implementation for now and start
building out the generic side to call it in the right places, then we
can consider how to implement a better integration with the arch code.

Also, probably 'aligned' is not the right name. This new function
should be called by VMA owners that know they have pgoff aligned high
order folios/pfns inside their mapping. The 'align' argument is the
max order of their pgoff aligned folio/pfns.

The purpose of the function is to adjust the resulting area to
optimize for the high order folios that are present while following
the uAPI rules for mmap.

Maybe call it something like _order and document it like the above?

> I also am not okay to export it for no reason.

The next patches are the reason.

> Also, is it okay to export something as gpl or does the copyright holder
> need to do that (I have no idea about this stuff, or maybe you work for
> the copyright holder)?

Yes, you are always safe to use the GPL export.

> The hint (addr) is also never checked for alignment in this function and
> we are appending _aligned() to the name.  With this change we can now
> get an unaligned _aligned() address.  This (probably) can happen with
> MAP_FIXED today, but I don't think we imply it's going to be aligned
> elsewhere.

Should be documented at least..

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-16 12:14     ` Jason Gunthorpe
@ 2025-06-16 12:20       ` Lorenzo Stoakes
  2025-06-16 12:26         ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Lorenzo Stoakes @ 2025-06-16 12:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Peter Xu, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache, Baolin Wang, Ryan Roberts,
	Dev Jain, Barry Song

On Mon, Jun 16, 2025 at 09:14:28AM -0300, Jason Gunthorpe wrote:
> Also, probably 'aligned' is not the right name. This new function
> should be called by VMA owners that know they have pgoff aligned high
> order folios/pfns inside their mapping. The 'align' argument is the
> max order of their pgoff aligned folio/pfns.
>
> The purpose of the function is to adjust the resulting area to
> optimize for the high order folios that are present while following
> the uAPI rules for mmap.
>
> Maybe call it something like _order and document it like the above?

Right, if it were made clear this is explicitly related to higher order
folios that would go a long way to making the generalisation more
acceptable.

But we definitely need to have it not filter errors if it's generic.

>
>
> > I also am not okay to export it for no reason.
>
> The next patches are the reason.

Regardless exporting it like this raises the bar for quality here.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned
  2025-06-16 12:20       ` Lorenzo Stoakes
@ 2025-06-16 12:26         ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-16 12:26 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Liam R. Howlett, Peter Xu, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache, Baolin Wang, Ryan Roberts,
	Dev Jain, Barry Song

On Mon, Jun 16, 2025 at 01:20:55PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jun 16, 2025 at 09:14:28AM -0300, Jason Gunthorpe wrote:
> > Also, probably 'aligned' is not the right name. This new function
> > should be called by VMA owners that know they have pgoff aligned high
> > order folios/pfns inside their mapping. The 'align' argument is the
> > max order of their pgoff aligned folio/pfns.
> >
> > The purpose of the function is to adjust the resulting area to
> > optimize for the high order folios that are present while following
> > the uAPI rules for mmap.
> >
> > Maybe call it something like _order and document it like the above?
> 
> Right, if it were made clear this is explicitly related to higher order
> folios that would go a long way to making the generalisation more
> acceptable.
> 
> But we definitely need to have it not filter errors if it's generic.
> 
> >
> >
> > > I also am not okay to export it for no reason.
> >
> > The next patches are the reason.
> 
> Regardless exporting it like this raises the bar for quality here.

Yes, it is also possible we have the wrong op, I know
get_unmapped_area() pre-exists, but if we are really cleaning this
stuff then something like get_max_pte_order() is probably a saner op.

It would return the size of the biggest pgoff aligned folio/pfn within
the file. Then the core code would do the special logic without
exporting this function.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-13 23:16           ` Jason Gunthorpe
@ 2025-06-16 22:06             ` Peter Xu
  2025-06-16 23:00               ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-16 22:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Fri, Jun 13, 2025 at 08:16:57PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 13, 2025 at 03:15:19PM -0400, Peter Xu wrote:
> > > > > > +	if (phys_len >= PMD_SIZE) {
> > > > > > +		ret = mm_get_unmapped_area_aligned(file, addr, len, phys_addr,
> > > > > > +						   flags, PMD_SIZE, 0);
> > > > > > +		if (ret)
> > > > > > +			return ret;
> > > > > > +	}
> > > > > 
> > > > > Hurm, we have contiguous pages now, so PMD_SIZE is not so great, eg on
> > > > > 4k ARM with we can have a 16*2M=32MB contiguity, and 16k ARM uses
> > > > > contiguity to get a 32*16k=1GB option.
> > > > > 
> > > > > Forcing to only align to the PMD or PUD seems suboptimal..
> > > > 
> > > > Right, however the cont-pte / cont-pmd are still not supported in huge
> > > > pfnmaps in general?  It'll definitely be nice if someone could look at that
> > > > from ARM perspective, then provide support of both in one shot.
> > > 
> > > Maybe leave behind a comment about this. I've been poking around if
> > > somone would do the ARM PFNMAP support but can't report any commitment.
> > 
> > I didn't know what's the best part to take a note for the whole pfnmap
> > effort, but I added a note into the commit message on this patch:
> > 
> >         Note 2: Currently continuous pgtable entries (for example, cont-pte) is not
> >         yet supported for huge pfnmaps in general.  It also is not considered in
> >         this patch so far.  Separate work will be needed to enable continuous
> >         pgtable entries on archs that support it.
> > 
> > > 
> > > > > > +fallback:
> > > > > > +	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
> > > > > 
> > > > > Why not put this into mm_get_unmapped_area_vmflags() and get rid of
> > > > > thp_get_unmapped_area_vmflags() too?
> > > > > 
> > > > > Is there any reason the caller should have to do a retry?
> > > > 
> > > > We would still need thp_get_unmapped_area_vmflags() because that encodes
> > > > PMD_SIZE for THPs; we need the flexibility of providing any size alignment
> > > > as a generic helper.
> > > 
> > > There is only one caller for thp_get_unmapped_area_vmflags(), just
> > > open code PMD_SIZE there and thin this whole thing out. It reads
> > > better like that anyhow:
> > > 
> > > 	} else if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && !file
> > > 		   && !addr /* no hint */
> > > 		   && IS_ALIGNED(len, PMD_SIZE)) {
> > > 		/* Ensures that larger anonymous mappings are THP aligned. */
> > > 		addr = mm_get_unmapped_area_aligned(file, 0, len, pgoff,
> > > 						    flags, vm_flags, PMD_SIZE);
> > > 
> > > > That was ok, however that loses some flexibility when the caller wants to
> > > > try with different alignments, exactly like above: currently, it was trying
> > > > to do a first attempt of PUD mapping then fallback to PMD if that fails.
> > > 
> > > Oh, that's a good point, I didn't notice that subtle bit.
> > > 
> > > But then maybe that is showing the API is just wrong and the core code
> > > should be trying to find the best alignment not the caller. Like we
> > > can have those PUD/PMD size ifdefs inside the mm instead of in VFIO?
> > > 
> > > VFIO would just pass the BAR size, implying the best alignment, and
> > > the core implementation will try to get the largest VMA alignment that
> > > snaps to an arch supported page contiguity, testing each of the arches
> > > page size possibilities in turn.
> > > 
> > > That sounds like a much better API than pushing this into drivers??
> > 
> > Yes it would be nice if the core mm can evolve to make supporting such
> > easier.  Though the question is how to pass information over to core mm.
> 
> I was just thinking something simple, change how your new 
> mm_get_unmapped_area_aligned() works so that the caller is expected to
> pass in the size of the biggest folio/pfn page in as
> align.
> 
> The mm_get_unmapped_area_aligned() returns a vm address that
> will result in large mappings.
> 
> pgoff works the same way, the assumption is the biggest folio is at
> pgoff 0 and followed by another biggest folio so the pgoff logic tries
> to make the second folio map fully.
> 
> ie what a hugetlb fd or thp memfd would like.
> 
> Then you still hook the file operations and still figure out what BAR
> and so on to call mm_get_unmapped_area_aligned() with the correct
> aligned parameter.
> 
> mm_get_unmapped_area_aligned() goes through the supported page sizes
> of the arch and selects the best one for the indicated biggest folio
> 
> If we were happy writing this in vfio then it can work just as well in
> the core mm side.

So far, the new vfio_pci_core_get_unmapped_area() almost does VFIO's own
stuff, except that it does retry with different sizes.

Can I understand it as a suggestion to pass in a bitmask into the core mm
API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
constant "align", so that core mm would try to allocate from the largest
size to smaller until it finds some working VA to use?

> 
> > It's similar to many other use cases of get_unmapped_area() users.  For
> > example, see v4l2_m2m_get_unmapped_area() which has similar treatment on at
> > least knowing which part of the file was being mapped:
> > 
> > 	if (offset < DST_QUEUE_OFF_BASE) {
> > 		vq = v4l2_m2m_get_src_vq(fh->m2m_ctx);
> > 	} else {
> > 		vq = v4l2_m2m_get_dst_vq(fh->m2m_ctx);
> > 		pgoff -= (DST_QUEUE_OFF_BASE >> PAGE_SHIFT);
> > 	}
> 
> Careful thats only use for nommu :)

My fault, please ignore it.. :)

I'm also surprised it is even available for !MMU.. but I decided to not dig
anymore today on that.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-16 22:06             ` Peter Xu
@ 2025-06-16 23:00               ` Jason Gunthorpe
  2025-06-17 20:56                 ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-16 23:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:

> Can I understand it as a suggestion to pass in a bitmask into the core mm
> API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> constant "align", so that core mm would try to allocate from the largest
> size to smaller until it finds some working VA to use?

I don't think you need a bitmask.

Split the concerns, the caller knows what is inside it's FD. It only
needs to provide the highest pgoff aligned folio/pfn within the FD.

The mm knows what leaf page tables options exist. It should try to
align to the closest leaf page table size that is <= the FD's max
aligned folio.

Higher alignment would be wasteful of address space.

Lower alignment misses an opportunity to create large leaf PTEs.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-14 14:46   ` kernel test robot
@ 2025-06-17 15:39     ` Peter Xu
  2025-06-17 15:41       ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-17 15:39 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-kernel, linux-mm, kvm, oe-kbuild-all, Andrew Morton,
	Alex Williamson, Zi Yan, Jason Gunthorpe, Alex Mastro,
	David Hildenbrand, Nico Pache

On Sat, Jun 14, 2025 at 10:46:45PM +0800, kernel test robot wrote:
> Hi Peter,
> 
> kernel test robot noticed the following build errors:
> 
> [auto build test ERROR on akpm-mm/mm-everything]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Peter-Xu/mm-Deduplicate-mm_get_unmapped_area/20250613-214307
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
> patch link:    https://lore.kernel.org/r/20250613134111.469884-5-peterx%40redhat.com
> patch subject: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
> config: sh-randconfig-002-20250614 (https://download.01.org/0day-ci/archive/20250614/202506142215.koMEU2rT-lkp@intel.com/config)
> compiler: sh4-linux-gcc (GCC) 12.4.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250614/202506142215.koMEU2rT-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202506142215.koMEU2rT-lkp@intel.com/
> 
> All errors (new ones prefixed by >>):
> 
>    drivers/vfio/vfio_main.c: In function 'vfio_device_get_unmapped_area':
> >> drivers/vfio/vfio_main.c:1367:24: error: implicit declaration of function 'mm_get_unmapped_area'; did you mean 'get_unmapped_area'? [-Werror=implicit-function-declaration]
>     1367 |                 return mm_get_unmapped_area(current->mm, file, addr,
>          |                        ^~~~~~~~~~~~~~~~~~~~
>          |                        get_unmapped_area
>    cc1: some warnings being treated as errors
> 
> 
> vim +1367 drivers/vfio/vfio_main.c
> 
>   1356	
>   1357	static unsigned long vfio_device_get_unmapped_area(struct file *file,
>   1358							   unsigned long addr,
>   1359							   unsigned long len,
>   1360							   unsigned long pgoff,
>   1361							   unsigned long flags)
>   1362	{
>   1363		struct vfio_device_file *df = file->private_data;
>   1364		struct vfio_device *device = df->device;
>   1365	
>   1366		if (!device->ops->get_unmapped_area)
> > 1367			return mm_get_unmapped_area(current->mm, file, addr,
>   1368						    len, pgoff, flags);
>   1369	
>   1370		return device->ops->get_unmapped_area(device, file, addr, len,
>   1371						      pgoff, flags);
>   1372	}
>   1373	

This is "ARCH_SH + VFIO + !MMU".. I'll make sure to cover this config when
repost.  I'll squash below into the patch:

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 19db8e58d223..cc14884d282f 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1354,6 +1354,7 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
        return device->ops->mmap(device, vma);
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
 static unsigned long vfio_device_get_unmapped_area(struct file *file,
                                                   unsigned long addr,
                                                   unsigned long len,
@@ -1370,6 +1371,7 @@ static unsigned long vfio_device_get_unmapped_area(struct file *file,
        return device->ops->get_unmapped_area(device, file, addr, len,
                                              pgoff, flags);
 }
+#endif
 
 const struct file_operations vfio_device_fops = {
        .owner          = THIS_MODULE,
@@ -1380,7 +1382,9 @@ const struct file_operations vfio_device_fops = {
        .unlocked_ioctl = vfio_device_fops_unl_ioctl,
        .compat_ioctl   = compat_ptr_ioctl,
        .mmap           = vfio_device_fops_mmap,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
        .get_unmapped_area = vfio_device_get_unmapped_area,
+#endif
 };
 
 static struct vfio_device *vfio_device_from_file(struct file *file)

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 15:39     ` Peter Xu
@ 2025-06-17 15:41       ` Jason Gunthorpe
  2025-06-17 16:47         ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 15:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 11:39:07AM -0400, Peter Xu wrote:
>  
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
>  static unsigned long vfio_device_get_unmapped_area(struct file *file,
>                                                    unsigned long addr,
>                                                    unsigned long len,
> @@ -1370,6 +1371,7 @@ static unsigned long vfio_device_get_unmapped_area(struct file *file,
>         return device->ops->get_unmapped_area(device, file, addr, len,
>                                               pgoff, flags);
>  }
> +#endif
>  
>  const struct file_operations vfio_device_fops = {
>         .owner          = THIS_MODULE,
> @@ -1380,7 +1382,9 @@ const struct file_operations vfio_device_fops = {
>         .unlocked_ioctl = vfio_device_fops_unl_ioctl,
>         .compat_ioctl   = compat_ptr_ioctl,
>         .mmap           = vfio_device_fops_mmap,
> +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
>         .get_unmapped_area = vfio_device_get_unmapped_area,
> +#endif
>  };

IMHO this also seems like something the core code should be dealing
with and not putting weird ifdefs in drivers.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 15:41       ` Jason Gunthorpe
@ 2025-06-17 16:47         ` Peter Xu
  2025-06-17 19:39           ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-17 16:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 12:41:57PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 17, 2025 at 11:39:07AM -0400, Peter Xu wrote:
> >  
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> >  static unsigned long vfio_device_get_unmapped_area(struct file *file,
> >                                                    unsigned long addr,
> >                                                    unsigned long len,
> > @@ -1370,6 +1371,7 @@ static unsigned long vfio_device_get_unmapped_area(struct file *file,
> >         return device->ops->get_unmapped_area(device, file, addr, len,
> >                                               pgoff, flags);
> >  }
> > +#endif
> >  
> >  const struct file_operations vfio_device_fops = {
> >         .owner          = THIS_MODULE,
> > @@ -1380,7 +1382,9 @@ const struct file_operations vfio_device_fops = {
> >         .unlocked_ioctl = vfio_device_fops_unl_ioctl,
> >         .compat_ioctl   = compat_ptr_ioctl,
> >         .mmap           = vfio_device_fops_mmap,
> > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> >         .get_unmapped_area = vfio_device_get_unmapped_area,
> > +#endif
> >  };
> 
> IMHO this also seems like something the core code should be dealing
> with and not putting weird ifdefs in drivers.

It may depend on whether we want to still do the fallbacks to
mm_get_unmapped_area().  I get your point in the other email but not yet
get a chance to reply.  I'll try that out to see how it looks and reply
there.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 16:47         ` Peter Xu
@ 2025-06-17 19:39           ` Peter Xu
  2025-06-17 19:46             ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-17 19:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 12:47:35PM -0400, Peter Xu wrote:
> On Tue, Jun 17, 2025 at 12:41:57PM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 17, 2025 at 11:39:07AM -0400, Peter Xu wrote:
> > >  
> > > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > >  static unsigned long vfio_device_get_unmapped_area(struct file *file,
> > >                                                    unsigned long addr,
> > >                                                    unsigned long len,
> > > @@ -1370,6 +1371,7 @@ static unsigned long vfio_device_get_unmapped_area(struct file *file,
> > >         return device->ops->get_unmapped_area(device, file, addr, len,
> > >                                               pgoff, flags);
> > >  }
> > > +#endif
> > >  
> > >  const struct file_operations vfio_device_fops = {
> > >         .owner          = THIS_MODULE,
> > > @@ -1380,7 +1382,9 @@ const struct file_operations vfio_device_fops = {
> > >         .unlocked_ioctl = vfio_device_fops_unl_ioctl,
> > >         .compat_ioctl   = compat_ptr_ioctl,
> > >         .mmap           = vfio_device_fops_mmap,
> > > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > >         .get_unmapped_area = vfio_device_get_unmapped_area,
> > > +#endif
> > >  };
> > 
> > IMHO this also seems like something the core code should be dealing
> > with and not putting weird ifdefs in drivers.
> 
> It may depend on whether we want to still do the fallbacks to
> mm_get_unmapped_area().  I get your point in the other email but not yet
> get a chance to reply.  I'll try that out to see how it looks and reply
> there.

I just noticed this is unfortunate and special; I yet don't see a way to
avoid the fallback here.

Note that this is the vfio_device's fallback, even if the new helper
(whatever we name it..) could do fallback internally, vfio_device still
would need to be accessible to mm_get_unmapped_area() to make this config
build pass.

So I think I'll need my fixup here..

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 19:39           ` Peter Xu
@ 2025-06-17 19:46             ` Jason Gunthorpe
  2025-06-17 20:01               ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 19:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 03:39:19PM -0400, Peter Xu wrote:
> On Tue, Jun 17, 2025 at 12:47:35PM -0400, Peter Xu wrote:
> > On Tue, Jun 17, 2025 at 12:41:57PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 17, 2025 at 11:39:07AM -0400, Peter Xu wrote:
> > > >  
> > > > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > > >  static unsigned long vfio_device_get_unmapped_area(struct file *file,
> > > >                                                    unsigned long addr,
> > > >                                                    unsigned long len,
> > > > @@ -1370,6 +1371,7 @@ static unsigned long vfio_device_get_unmapped_area(struct file *file,
> > > >         return device->ops->get_unmapped_area(device, file, addr, len,
> > > >                                               pgoff, flags);
> > > >  }
> > > > +#endif
> > > >  
> > > >  const struct file_operations vfio_device_fops = {
> > > >         .owner          = THIS_MODULE,
> > > > @@ -1380,7 +1382,9 @@ const struct file_operations vfio_device_fops = {
> > > >         .unlocked_ioctl = vfio_device_fops_unl_ioctl,
> > > >         .compat_ioctl   = compat_ptr_ioctl,
> > > >         .mmap           = vfio_device_fops_mmap,
> > > > +#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
> > > >         .get_unmapped_area = vfio_device_get_unmapped_area,
> > > > +#endif
> > > >  };
> > > 
> > > IMHO this also seems like something the core code should be dealing
> > > with and not putting weird ifdefs in drivers.
> > 
> > It may depend on whether we want to still do the fallbacks to
> > mm_get_unmapped_area().  I get your point in the other email but not yet
> > get a chance to reply.  I'll try that out to see how it looks and reply
> > there.
> 
> I just noticed this is unfortunate and special; I yet don't see a way to
> avoid the fallback here.
> 
> Note that this is the vfio_device's fallback, even if the new helper
> (whatever we name it..) could do fallback internally, vfio_device still
> would need to be accessible to mm_get_unmapped_area() to make this config
> build pass.

I don't understand this remark?

get_unmapped_area is not conditional on CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP?

Some new mm_get_unmapped_area_aligned() should not be conditional on
CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP? (This is Lorenzo's and Liam's remark)

So what is VFIO doing that requires CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 19:46             ` Jason Gunthorpe
@ 2025-06-17 20:01               ` Peter Xu
  2025-06-17 23:00                 ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-17 20:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 04:46:21PM -0300, Jason Gunthorpe wrote:
> > I just noticed this is unfortunate and special; I yet don't see a way to
> > avoid the fallback here.
> > 
> > Note that this is the vfio_device's fallback, even if the new helper
> > (whatever we name it..) could do fallback internally, vfio_device still
> > would need to be accessible to mm_get_unmapped_area() to make this config
> > build pass.
> 
> I don't understand this remark?
> 
> get_unmapped_area is not conditional on CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP?
> 
> Some new mm_get_unmapped_area_aligned() should not be conditional on
> CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP? (This is Lorenzo's and Liam's remark)

Yes, this will be addressed.

> 
> So what is VFIO doing that requires CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP?

It's the fallback part for vfio device, not vfio_pci device.  vfio_pci
device doesn't need this special treatment after moving to the new helper
because that hides everything.  vfio_device still needs it.

So, we have two ops that need to be touched to support this:

        vfio_device_fops
        vfio_pci_ops 

For the 1st one's vfio_device_fops.get_unmapped_area(), it'll need its own
fallback which must be mm_get_unmapped_area() to keep the old behavior, and
that was defined only if CONFIG_MMU.

IOW, if one day file_operations.get_unmapped_area() would allow some other
retval to be able to fallback to the default (mm_get_unmapped_area()), then
we don't need this special ifdef.  But now it's not ready for that..

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-16 23:00               ` Jason Gunthorpe
@ 2025-06-17 20:56                 ` Peter Xu
  2025-06-17 23:18                   ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-17 20:56 UTC (permalink / raw)
  To: Jason Gunthorpe, Liam R. Howlett, Lorenzo Stoakes
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Alex Mastro, David Hildenbrand, Nico Pache

On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> 
> > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > constant "align", so that core mm would try to allocate from the largest
> > size to smaller until it finds some working VA to use?
> 
> I don't think you need a bitmask.
> 
> Split the concerns, the caller knows what is inside it's FD. It only
> needs to provide the highest pgoff aligned folio/pfn within the FD.

Ultimately I even dropped this hint.  I found that it's not really
get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
So I decided to avoid this parameter as of now.

> 
> The mm knows what leaf page tables options exist. It should try to
> align to the closest leaf page table size that is <= the FD's max
> aligned folio.

So again IMHO this is also not per-FD information, but needs to be passed
over from the driver for each call.

Likely the "order" parameter appeared in other discussions to imply a
maximum supported size from the driver side (or, for a folio, but that is
definitely another user after this series can land).

So far I didn't yet add the "order", because currently VFIO definitely
supports all max orders the system supports.  Maybe we can add the order
when there's a real need, but maybe it won't happen in the near future?

In summary, I came up with below changes, would below look reasonable?

I also added Liam and Lorenzo in this reply.

================8<================

From 7f1b7aada21ab036849edc49635fb0656e0457c4 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 May 2025 12:45:55 -0400
Subject: [PATCH 1/4] mm: Rename __thp_get_unmapped_area to
 mm_get_unmapped_area_aligned

This function is handy to locate an unmapped region which is best aligned
to the specified alignment, taking whatever form of pgoff address space
into considerations.

Rename the function and make it more general for even non-THP use in follow
up patches.  Dropping "THP" in the name because it doesn't have much to do
with THP internally.  The suffix "_aligned" imply it is a helper to
generate aligned virtual address based on what is specified (which can be
not PMD_SIZE).

When at it, using check_add_overflow() helpers to verify the inputs to make
sure no overflow will happen.

Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/huge_memory.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4734de1dc0ae..885b5845dbba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1088,23 +1088,28 @@ static inline bool is_transparent_hugepage(const struct folio *folio)
 		folio_test_large_rmappable(folio);
 }
 
-static unsigned long __thp_get_unmapped_area(struct file *filp,
+static unsigned long mm_get_unmapped_area_aligned(struct file *filp,
 		unsigned long addr, unsigned long len,
-		loff_t off, unsigned long flags, unsigned long size,
+		loff_t off, unsigned long flags, unsigned long align,
 		vm_flags_t vm_flags)
 {
-	loff_t off_end = off + len;
-	loff_t off_align = round_up(off, size);
+	loff_t off_end;
+	loff_t off_align = round_up(off, align);
 	unsigned long len_pad, ret, off_sub;
 
 	if (!IS_ENABLED(CONFIG_64BIT) || in_compat_syscall())
 		return 0;
 
-	if (off_end <= off_align || (off_end - off_align) < size)
+	/* Can't use the overflow API, do manual check for now */
+	if (off_align < off)
 		return 0;
-
-	len_pad = len + size;
-	if (len_pad < len || (off + len_pad) < off)
+	if (check_add_overflow(off, len, &off_end))
+		return 0;
+	if (off_end <= off_align || (off_end - off_align) < align)
+		return 0;
+	if (check_add_overflow(len, align, &len_pad))
+		return 0;
+	if ((off + len_pad) < off)
 		return 0;
 
 	ret = mm_get_unmapped_area_vmflags(current->mm, filp, addr, len_pad,
@@ -1118,16 +1123,16 @@ static unsigned long __thp_get_unmapped_area(struct file *filp,
 		return 0;
 
 	/*
-	 * Do not try to align to THP boundary if allocation at the address
+	 * Do not try to provide alignment if allocation at the address
 	 * hint succeeds.
 	 */
 	if (ret == addr)
 		return addr;
 
-	off_sub = (off - ret) & (size - 1);
+	off_sub = (off - ret) & (align - 1);
 
 	if (test_bit(MMF_TOPDOWN, &current->mm->flags) && !off_sub)
-		return ret + size;
+		return ret + align;
 
 	ret += off_sub;
 	return ret;
@@ -1140,7 +1145,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 	unsigned long ret;
 	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
-	ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE, vm_flags);
+	ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+					   PMD_SIZE, vm_flags);
 	if (ret)
 		return ret;
 
-- 
2.49.0


From 709379a39f4a59a6d3bda7a39ca55f08fdaf9e1a Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Tue, 17 Jun 2025 15:27:07 -0400
Subject: [PATCH 2/4] mm: huge_mapping_get_va_aligned() helper

Add this helper to allocate a VA that would be best to map huge mappings
that the system would support. It can be used in file's get_unmapped_area()
functions as long as proper max_pgoff will be provided so that core mm will
know the available range of pgoff to map in the future.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/linux/huge_mm.h | 10 ++++++++-
 mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 1 deletion(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2f190c90192d..59fdafb1034b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -339,7 +339,8 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags,
 		vm_flags_t vm_flags);
-
+unsigned long huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags);
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
@@ -543,6 +544,13 @@ thp_get_unmapped_area_vmflags(struct file *filp, unsigned long addr,
 	return 0;
 }
 
+static inline unsigned long
+huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+}
+
 static inline bool
 can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 885b5845dbba..bc016b656dc7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1161,6 +1161,52 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 }
 EXPORT_SYMBOL_GPL(thp_get_unmapped_area);
 
+/**
+ * huge_mapping_get_va_aligned: best-effort VA allocation for huge mappings
+ *
+ * @filp: file target of the mmap() request
+ * @addr: hint address from mmap() request
+ * @len: len of the mmap() request
+ * @pgoff: file offset of the mmap() request
+ * @flags: flags of the mmap() request
+ *
+ * This function should normally be used by a driver's specific
+ * get_unmapped_area() handler to provide a huge-mapping friendly virtual
+ * address for a specific mmap() request.  The caller should pass in most
+ * of the parameters from the get_unmapped_area() request.
+ *
+ * Normally it means the caller's mmap() needs to also support any possible
+ * huge mappings the system supports.
+ *
+ * Return: a best-effort virtual address that will satisfy the most huge
+ * mappings for the result VMA to be mapped.
+ */
+unsigned long huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
+		unsigned long len, unsigned long pgoff, unsigned long flags)
+{
+	loff_t off = (loff_t)pgoff << PAGE_SHIFT;
+	unsigned long ret;
+
+	/* TODO: support continuous ptes/pmds */
+	if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
+	    len >= PUD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+						   PUD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+	if (len >= PMD_SIZE) {
+		ret = mm_get_unmapped_area_aligned(filp, addr, len, off, flags,
+						   PMD_SIZE, 0);
+		if (ret)
+			return ret;
+	}
+
+	return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+}
+EXPORT_SYMBOL_GPL(huge_mapping_get_va_aligned);
+
 static struct folio *vma_alloc_anon_folio_pmd(struct vm_area_struct *vma,
 		unsigned long addr)
 {
-- 
2.49.0


From ff90dbba05ea54e5c6690fbedf330c837f8f0ea1 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 4 Jun 2025 17:54:40 -0400
Subject: [PATCH 3/4] vfio: Introduce vfio_device_ops.get_unmapped_area hook

Add a hook to vfio_device_ops to allow sub-modules provide virtual
addresses for an mmap() request.

Note that the fallback will be mm_get_unmapped_area(), which should
maintain the old behavior of generic VA allocation (__get_unmapped_area).
It's a bit unfortunate that is needed, as the current get_unmapped_area()
file ops cannot support a retval which fallbacks to the default.  So that
is needed both here and whenever sub-module will opt-in with its own.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/vfio_main.c | 25 +++++++++++++++++++++++++
 include/linux/vfio.h     |  8 ++++++++
 2 files changed, 33 insertions(+)

diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
index 1fd261efc582..480cc2398810 100644
--- a/drivers/vfio/vfio_main.c
+++ b/drivers/vfio/vfio_main.c
@@ -1354,6 +1354,28 @@ static int vfio_device_fops_mmap(struct file *filep, struct vm_area_struct *vma)
 	return device->ops->mmap(device, vma);
 }
 
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+static unsigned long vfio_device_get_unmapped_area(struct file *file,
+						   unsigned long addr,
+						   unsigned long len,
+						   unsigned long pgoff,
+						   unsigned long flags)
+{
+	struct vfio_device_file *df = file->private_data;
+	struct vfio_device *device = df->device;
+	unsigned long ret;
+
+	if (device->ops->get_unmapped_area) {
+		ret = device->ops->get_unmapped_area(device, file, addr,
+						      len, pgoff, flags);
+		if (ret)
+			return ret;
+	}
+
+	return mm_get_unmapped_area(current->mm, file, addr, len, pgoff, flags);
+}
+#endif
+
 const struct file_operations vfio_device_fops = {
 	.owner		= THIS_MODULE,
 	.open		= vfio_device_fops_cdev_open,
@@ -1363,6 +1385,9 @@ const struct file_operations vfio_device_fops = {
 	.unlocked_ioctl	= vfio_device_fops_unl_ioctl,
 	.compat_ioctl	= compat_ptr_ioctl,
 	.mmap		= vfio_device_fops_mmap,
+#ifdef CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP
+	.get_unmapped_area = vfio_device_get_unmapped_area,
+#endif
 };
 
 static struct vfio_device *vfio_device_from_file(struct file *file)
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 707b00772ce1..d900541e2716 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -108,6 +108,8 @@ struct vfio_device {
  * @dma_unmap: Called when userspace unmaps IOVA from the container
  *             this device is attached to.
  * @device_feature: Optional, fill in the VFIO_DEVICE_FEATURE ioctl
+ * @get_unmapped_area: Optional, provide virtual address hint for mmap().
+ *                     If zero is returned, fallback to the default allocator.
  */
 struct vfio_device_ops {
 	char	*name;
@@ -135,6 +137,12 @@ struct vfio_device_ops {
 	void	(*dma_unmap)(struct vfio_device *vdev, u64 iova, u64 length);
 	int	(*device_feature)(struct vfio_device *device, u32 flags,
 				  void __user *arg, size_t argsz);
+	unsigned long (*get_unmapped_area)(struct vfio_device *device,
+					   struct file *file,
+					   unsigned long addr,
+					   unsigned long len,
+					   unsigned long pgoff,
+					   unsigned long flags);
 };
 
 #if IS_ENABLED(CONFIG_IOMMUFD)
-- 
2.49.0

From 38539aafac83ae204d3e03f441f7e33841db6b07 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 May 2025 13:21:20 -0400
Subject: [PATCH 4/4] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED
 mappings

This patch enables best-effort mmap() for vfio-pci bars even without
MAP_FIXED, so as to utilize huge pfnmaps as much as possible.  It should
also avoid userspace changes (switching to MAP_FIXED with pre-aligned VA
addresses) to start enabling huge pfnmaps on VFIO bars.

Here the trick is making sure the MMIO PFNs will be aligned with the VAs
allocated from mmap() when !MAP_FIXED, so that whatever returned from
mmap(!MAP_FIXED) of vfio-pci MMIO regions will be automatically suitable
for huge pfnmaps as much as possible.

To achieve that, a custom vfio_device's get_unmapped_area() for vfio-pci
devices is needed.

Note, MMIO physical addresses should normally be guaranteed to be always
bar-size aligned, hence the bar offset can logically be directly used to do
the calculation.  However to make it strict and clear (rather than relying
on spec details), we still try to fetch the bar's physical addresses from
pci_dev.resource[].

[1] https://lore.kernel.org/linux-pci/20250529214414.1508155-1-amastro@fb.com/

Reported-by: Alex Mastro <amastro@fb.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/vfio/pci/vfio_pci.c      |  1 +
 drivers/vfio/pci/vfio_pci_core.c | 34 ++++++++++++++++++++++++++++++++
 include/linux/vfio_pci_core.h    |  3 +++
 3 files changed, 38 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 5ba39f7623bb..32b570f17d0f 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -144,6 +144,7 @@ static const struct vfio_device_ops vfio_pci_ops = {
 	.detach_ioas	= vfio_iommufd_physical_detach_ioas,
 	.pasid_attach_ioas	= vfio_iommufd_physical_pasid_attach_ioas,
 	.pasid_detach_ioas	= vfio_iommufd_physical_pasid_detach_ioas,
+	.get_unmapped_area	= vfio_pci_core_get_unmapped_area,
 };
 
 static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index 6328c3a05bcd..5392bec4929a 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1641,6 +1641,40 @@ static unsigned long vma_to_pfn(struct vm_area_struct *vma)
 	return (pci_resource_start(vdev->pdev, index) >> PAGE_SHIFT) + pgoff;
 }
 
+/*
+ * Hint function to provide mmap() virtual address candidate so as to be
+ * able to map huge pfnmaps as much as possible.  It is done by aligning
+ * the VA to the PFN to be mapped in the specific bar.
+ *
+ * Note that this function does the minimum check on mmap() parameters to
+ * make the PFN calculation valid only. The majority of mmap() sanity check
+ * will be done later in mmap().
+ */
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+		struct file *file, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags)
+{
+	struct vfio_pci_core_device *vdev =
+		container_of(device, struct vfio_pci_core_device, vdev);
+	struct pci_dev *pdev = vdev->pdev;
+	unsigned int index = pgoff >> (VFIO_PCI_OFFSET_SHIFT - PAGE_SHIFT);
+	unsigned long req_start;
+
+	/* Currently, only bars 0-5 supports huge pfnmap */
+	if (index >= VFIO_PCI_ROM_REGION_INDEX)
+		return 0;
+
+	/* Calculate the start of physical address to be mapped */
+	req_start = (pgoff << PAGE_SHIFT) & ((1UL << VFIO_PCI_OFFSET_SHIFT) - 1);
+	if (check_add_overflow(req_start, pci_resource_start(pdev, index),
+			       &req_start))
+		return 0;
+
+	return huge_mapping_get_va_aligned(file, addr, len, req_start >> PAGE_SHIFT,
+					   flags);
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_get_unmapped_area);
+
 static vm_fault_t vfio_pci_mmap_huge_fault(struct vm_fault *vmf,
 					   unsigned int order)
 {
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index fbb472dd99b3..d97c920b4dbf 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -119,6 +119,9 @@ ssize_t vfio_pci_core_read(struct vfio_device *core_vdev, char __user *buf,
 		size_t count, loff_t *ppos);
 ssize_t vfio_pci_core_write(struct vfio_device *core_vdev, const char __user *buf,
 		size_t count, loff_t *ppos);
+unsigned long vfio_pci_core_get_unmapped_area(struct vfio_device *device,
+		struct file *file, unsigned long addr, unsigned long len,
+		unsigned long pgoff, unsigned long flags);
 int vfio_pci_core_mmap(struct vfio_device *core_vdev, struct vm_area_struct *vma);
 void vfio_pci_core_request(struct vfio_device *core_vdev, unsigned int count);
 int vfio_pci_core_match(struct vfio_device *core_vdev, char *buf);
-- 
2.49.0


-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range()
  2025-06-14  4:11   ` Liam R. Howlett
@ 2025-06-17 21:07     ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-17 21:07 UTC (permalink / raw)
  To: Liam R. Howlett, linux-kernel, linux-mm, kvm, Andrew Morton,
	Alex Williamson, Zi Yan, Jason Gunthorpe, Alex Mastro,
	David Hildenbrand, Nico Pache, Huacai Chen, Thomas Bogendoerfer,
	Muchun Song, Oscar Salvador, loongarch, linux-mips

On Sat, Jun 14, 2025 at 12:11:22AM -0400, Liam R. Howlett wrote:
> * Peter Xu <peterx@redhat.com> [691231 23:00]:
> > Only mips and loongarch implemented this API, however what it does was
> > checking against stack overflow for either len or addr.  That's already
> > done in arch's arch_get_unmapped_area*() functions, hence not needed.
> 
> I'm not as confident..
> 
> > 
> > It means the whole API is pretty much obsolete at least now, remove it
> > completely.
> > 
> > Cc: Huacai Chen <chenhuacai@kernel.org>
> > Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> > Cc: Muchun Song <muchun.song@linux.dev>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Cc: loongarch@lists.linux.dev
> > Cc: linux-mips@vger.kernel.org
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  arch/loongarch/include/asm/hugetlb.h | 14 --------------
> >  arch/mips/include/asm/hugetlb.h      | 14 --------------
> >  fs/hugetlbfs/inode.c                 |  8 ++------
> >  include/asm-generic/hugetlb.h        |  8 --------
> >  include/linux/hugetlb.h              |  6 ------
> >  5 files changed, 2 insertions(+), 48 deletions(-)
> > 
> > diff --git a/arch/loongarch/include/asm/hugetlb.h b/arch/loongarch/include/asm/hugetlb.h
> > index 4dc4b3e04225..ab68b594f889 100644
> > --- a/arch/loongarch/include/asm/hugetlb.h
> > +++ b/arch/loongarch/include/asm/hugetlb.h
> > @@ -10,20 +10,6 @@
> >  
> >  uint64_t pmd_to_entrylo(unsigned long pmd_val);
> >  
> > -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> > -static inline int prepare_hugepage_range(struct file *file,
> > -					 unsigned long addr,
> > -					 unsigned long len)
> > -{
> > -	unsigned long task_size = STACK_TOP;
> > -
> > -	if (len > task_size)
> > -		return -ENOMEM;
> > -	if (task_size - len < addr)
> > -		return -EINVAL;
> > -	return 0;
> > -}
> > -
> >  #define __HAVE_ARCH_HUGE_PTE_CLEAR
> >  static inline void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
> >  				  pte_t *ptep, unsigned long sz)
> > diff --git a/arch/mips/include/asm/hugetlb.h b/arch/mips/include/asm/hugetlb.h
> > index fbc71ddcf0f6..8c460ce01ffe 100644
> > --- a/arch/mips/include/asm/hugetlb.h
> > +++ b/arch/mips/include/asm/hugetlb.h
> > @@ -11,20 +11,6 @@
> >  
> >  #include <asm/page.h>
> >  
> > -#define __HAVE_ARCH_PREPARE_HUGEPAGE_RANGE
> > -static inline int prepare_hugepage_range(struct file *file,
> > -					 unsigned long addr,
> > -					 unsigned long len)
> > -{
> > -	unsigned long task_size = STACK_TOP;
> 
> arch/mips/include/asm/processor.h:#define STACK_TOP             mips_stack_top()
> 
> 
> unsigned long mips_stack_top(void)                                                                                                                                                                                                             
> {       
>         unsigned long top = TASK_SIZE & PAGE_MASK;                                                                                                                                                                                             
>         
>         if (IS_ENABLED(CONFIG_MIPS_FP_SUPPORT)) {
>                 /* One page for branch delay slot "emulation" */                                                                                                                                                                               
>                 top -= PAGE_SIZE;                                                                                                                                                                                                              
>         }                                                                                                                                                                                                                                      
>         
>         /* Space for the VDSO, data page & GIC user page */                                                                                                                                                                                    
>         top -= PAGE_ALIGN(current->thread.abi->vdso->size);                                                                                                                                                                                    
>         top -= PAGE_SIZE;
>         top -= mips_gic_present() ? PAGE_SIZE : 0;                                                                                                                                                                                             
>         
>         /* Space for cache colour alignment */                                                                                                                                                                                                 
>         if (cpu_has_dc_aliases)
>                 top -= shm_align_mask + 1;                                                                                                                                                                                                     
>         
>         /* Space to randomize the VDSO base */                                                                                                                                                                                                 
>         if (current->flags & PF_RANDOMIZE)
>                 top -= VDSO_RANDOMIZE_SIZE;                                                                                                                                                                                                    
>         
>         return top;                                                                                                                                                                                                                            
> }
> 
> This seems different than TASK_SIZE.
> 
> Code is from:
> commit ea7e0480a4b695d0aa6b3fa99bd658a003122113
> Author: Paul Burton <paulburton@kernel.org>
> Date:   Tue Sep 25 15:51:26 2018 -0700
> 
> 
> > -	if (len > task_size)
> > -		return -ENOMEM;
> > -	if (task_size - len < addr)
> > -		return -EINVAL;
> > -	return 0;
> > -}
> > -
> 
> Unfortunately, the commit message for the addition of this code are not
> helpful.
> 
> commit 50a41ff292fafe1e937102be23464b54fed8b78c
> Author: David Daney <ddaney@caviumnetworks.com>
> Date:   Wed May 27 17:47:42 2009 -0700
> 
> ... But the dates are helpful.  This code used to use:
> #define STACK_TOP      ((TASK_SIZE & PAGE_MASK) - PAGE_SIZE)
> 
> It's not exactly task size either.
> 
> I don't think this is an issue to remove this check because the overflow
> should be caught later (or trigger the opposite search).  But it's not
> clear why STACK_TOP was done in the first place.. Maybe just because we
> know the overflow here would be an issue later, but then we'd avoid the
> opposite search - and maybe that's the point?
> 
> Either way, your comment about the same check existing doesn't seem
> correct.

I will fix up the commit message to mention both archs:

  Only mips and loongarch implemented this API, however what it does was
  checking against stack overflow for either len or addr.  That's already
  done in arch's arch_get_unmapped_area*() functions, even though it may not
  be 100% identical checks.

  For example, for both of the architectures, there will be a trivial
  difference on how stack top was defined.  The old code uses STACK_TOP which
  may be slightly smaller than TASK_SIZE on either of them, but the hope is
  that shouldn't be a problem.

  It means the whole API is pretty much obsolete at least now, remove it
  completely.

> 
> I haven't checked loong arch, but I'd be willing to wager this was just
> cloned mips code... because this happens so much.

They define STACK_TOP differently, but AFAIU there're some duplications in
pattern of the two archs.

Please let me know if the fixed commit message works for you above, thanks.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area()
  2025-06-16  8:01   ` David Laight
@ 2025-06-17 21:13     ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-17 21:13 UTC (permalink / raw)
  To: David Laight
  Cc: linux-kernel, linux-mm, kvm, Andrew Morton, Alex Williamson,
	Zi Yan, Jason Gunthorpe, Alex Mastro, David Hildenbrand,
	Nico Pache

On Mon, Jun 16, 2025 at 09:01:34AM +0100, David Laight wrote:
> On Fri, 13 Jun 2025 09:41:07 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > Essentially it sets vm_flags==0 for mm_get_unmapped_area_vmflags().  Use
> > the helper instead to dedup the lines.
> 
> Would it make more sense to make it an inline wrapper?
> Moving the EXPORT_SYMBOL to mm_get_unmapped_area_vmflags.

Yes, makes sense to me. However that seems to be better justified as a
separate patch.

If you wouldn't mind, I hope we can land the minimum version of the series
first without expanding too much of what it touches.  I already start to
regret having the first two patches, but since I've posted, I'll carry them
as of now.  Please let me know if you have strong feelings.

Thanks a lot for taking a look,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 20:01               ` Peter Xu
@ 2025-06-17 23:00                 ` Jason Gunthorpe
  2025-06-17 23:26                   ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 23:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 04:01:11PM -0400, Peter Xu wrote:

> > So what is VFIO doing that requires CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP?
> 
> It's the fallback part for vfio device, not vfio_pci device.  vfio_pci
> device doesn't need this special treatment after moving to the new helper
> because that hides everything.  vfio_device still needs it.
> 
> So, we have two ops that need to be touched to support this:
> 
>         vfio_device_fops
>         vfio_pci_ops 
> 
> For the 1st one's vfio_device_fops.get_unmapped_area(), it'll need its own
> fallback which must be mm_get_unmapped_area() to keep the old behavior, and
> that was defined only if CONFIG_MMU.

OK, CONFIG_MMU makes a little bit of sense

> IOW, if one day file_operations.get_unmapped_area() would allow some other
> retval to be able to fallback to the default (mm_get_unmapped_area()), then
> we don't need this special ifdef.  But now it's not ready for that..

That can't be fixed with a config, the logic in vfio_device_fops has
to be 

if (!device->ops->get_unmapped_area()
   return .. do_default thing..

return device->ops->get_unmapped()

Has nothing to do with CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP, there are
more device->ops that just PCI.

If you do the API with an align/order argument then the default
behavior should happen when passing PAGE_SIZE.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-17 20:56                 ` Peter Xu
@ 2025-06-17 23:18                   ` Jason Gunthorpe
  2025-06-17 23:36                     ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 23:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
> On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> > 
> > > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > > constant "align", so that core mm would try to allocate from the largest
> > > size to smaller until it finds some working VA to use?
> > 
> > I don't think you need a bitmask.
> > 
> > Split the concerns, the caller knows what is inside it's FD. It only
> > needs to provide the highest pgoff aligned folio/pfn within the FD.
> 
> Ultimately I even dropped this hint.  I found that it's not really
> get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
> So I decided to avoid this parameter as of now.

Well, the point of the pgoff is only what you said earlier, to adjust
the starting alignment so the pgoff aligned high order folios/pfns
line up properly.

> > The mm knows what leaf page tables options exist. It should try to
> > align to the closest leaf page table size that is <= the FD's max
> > aligned folio.
> 
> So again IMHO this is also not per-FD information, but needs to be passed
> over from the driver for each call.

It is per-FD in the sense that each FD is unique and each range of
pgoff could have a unique maximum.
 
> Likely the "order" parameter appeared in other discussions to imply a
> maximum supported size from the driver side (or, for a folio, but that is
> definitely another user after this series can land).

Yes, it is the only information the driver can actually provide and
comes directly from what it will install in the VMA.

> So far I didn't yet add the "order", because currently VFIO definitely
> supports all max orders the system supports.  Maybe we can add the order
> when there's a real need, but maybe it won't happen in the near
> future?

The purpose of the order is to prevent over alignment and waste of
VMA. Your technique to use the length to limit alignment instead is
good enough for VFIO but not very general.

The VFIO part looks pretty good, I still don't really understand why
you'd have CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP though. The inline
fallback you have for it seems good enough and we don't care if things
are overaligned for ioremap.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook
  2025-06-17 23:00                 ` Jason Gunthorpe
@ 2025-06-17 23:26                   ` Peter Xu
  0 siblings, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-17 23:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: kernel test robot, linux-kernel, linux-mm, kvm, oe-kbuild-all,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 08:00:30PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 17, 2025 at 04:01:11PM -0400, Peter Xu wrote:
> 
> > > So what is VFIO doing that requires CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP?
> > 
> > It's the fallback part for vfio device, not vfio_pci device.  vfio_pci
> > device doesn't need this special treatment after moving to the new helper
> > because that hides everything.  vfio_device still needs it.
> > 
> > So, we have two ops that need to be touched to support this:
> > 
> >         vfio_device_fops
> >         vfio_pci_ops 
> > 
> > For the 1st one's vfio_device_fops.get_unmapped_area(), it'll need its own
> > fallback which must be mm_get_unmapped_area() to keep the old behavior, and
> > that was defined only if CONFIG_MMU.
> 
> OK, CONFIG_MMU makes a little bit of sense
> 
> > IOW, if one day file_operations.get_unmapped_area() would allow some other
> > retval to be able to fallback to the default (mm_get_unmapped_area()), then
> > we don't need this special ifdef.  But now it's not ready for that..
> 
> That can't be fixed with a config, the logic in vfio_device_fops has
> to be 
> 
> if (!device->ops->get_unmapped_area()
>    return .. do_default thing..
> 
> return device->ops->get_unmapped()
> 
> Has nothing to do with CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP, there are
> more device->ops that just PCI.

IMHO CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP doesn't imply anything PCI specific
either, it only says an arch supports PFNMAP in larger than PAGE_SIZE.
IIUC it doesn't necessarily need to be PCI.

So here in this case, get_unmapped_area() will only be customized if the
kernel is compiled with any possible huge mapping on pfnmaps.  Otherwise
the customized hook isn't needed.

> 
> If you do the API with an align/order argument then the default
> behavior should happen when passing PAGE_SIZE.

This should indeed also work.

I'll wait for comments in the other threads.  So far I didn't yet add the
"order" parameter or anything like it.  If we would like to have the
parameter, I can use it here to avoid the ifdef with PAGE_SIZE / PAGE_SHIFT
/ .... when repost.

Said that, I don't think I understand at all the use of get_unmapped_area()
for !MMU use case.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-17 23:18                   ` Jason Gunthorpe
@ 2025-06-17 23:36                     ` Peter Xu
  2025-06-18 16:56                       ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-17 23:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 08:18:07PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
> > On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> > > 
> > > > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > > > constant "align", so that core mm would try to allocate from the largest
> > > > size to smaller until it finds some working VA to use?
> > > 
> > > I don't think you need a bitmask.
> > > 
> > > Split the concerns, the caller knows what is inside it's FD. It only
> > > needs to provide the highest pgoff aligned folio/pfn within the FD.
> > 
> > Ultimately I even dropped this hint.  I found that it's not really
> > get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
> > So I decided to avoid this parameter as of now.
> 
> Well, the point of the pgoff is only what you said earlier, to adjust
> the starting alignment so the pgoff aligned high order folios/pfns
> line up properly.

I meant "highest pgoff" that I dropped.

We definitely need the pgoff to make it work.  So here I dropped "highest
pgoff" passed from the caller because I decided to leave such check to the
mmap() hook later.

> 
> > > The mm knows what leaf page tables options exist. It should try to
> > > align to the closest leaf page table size that is <= the FD's max
> > > aligned folio.
> > 
> > So again IMHO this is also not per-FD information, but needs to be passed
> > over from the driver for each call.
> 
> It is per-FD in the sense that each FD is unique and each range of
> pgoff could have a unique maximum.
>  
> > Likely the "order" parameter appeared in other discussions to imply a
> > maximum supported size from the driver side (or, for a folio, but that is
> > definitely another user after this series can land).
> 
> Yes, it is the only information the driver can actually provide and
> comes directly from what it will install in the VMA.
> 
> > So far I didn't yet add the "order", because currently VFIO definitely
> > supports all max orders the system supports.  Maybe we can add the order
> > when there's a real need, but maybe it won't happen in the near
> > future?
> 
> The purpose of the order is to prevent over alignment and waste of
> VMA. Your technique to use the length to limit alignment instead is
> good enough for VFIO but not very general.

Yes that's also something I didn't like.  I think I'll just go ahead and
add the order parameter, then use it in previous patch too.

I'll wait for some more time though for others' input before a respin.

Thanks,

> 
> The VFIO part looks pretty good, I still don't really understand why
> you'd have CONFIG_ARCH_SUPPORTS_HUGE_PFNMAP though. The inline
> fallback you have for it seems good enough and we don't care if things
> are overaligned for ioremap.
> 
> Jason
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-17 23:36                     ` Peter Xu
@ 2025-06-18 16:56                       ` Peter Xu
  2025-06-18 17:46                         ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-18 16:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 17, 2025 at 07:36:08PM -0400, Peter Xu wrote:
> On Tue, Jun 17, 2025 at 08:18:07PM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 17, 2025 at 04:56:13PM -0400, Peter Xu wrote:
> > > On Mon, Jun 16, 2025 at 08:00:11PM -0300, Jason Gunthorpe wrote:
> > > > On Mon, Jun 16, 2025 at 06:06:23PM -0400, Peter Xu wrote:
> > > > 
> > > > > Can I understand it as a suggestion to pass in a bitmask into the core mm
> > > > > API (e.g. keep the name of mm_get_unmapped_area_aligned()), instead of a
> > > > > constant "align", so that core mm would try to allocate from the largest
> > > > > size to smaller until it finds some working VA to use?
> > > > 
> > > > I don't think you need a bitmask.
> > > > 
> > > > Split the concerns, the caller knows what is inside it's FD. It only
> > > > needs to provide the highest pgoff aligned folio/pfn within the FD.
> > > 
> > > Ultimately I even dropped this hint.  I found that it's not really
> > > get_unmapped_area()'s job to detect over-sized pgoffs.  It's mmap()'s job.
> > > So I decided to avoid this parameter as of now.
> > 
> > Well, the point of the pgoff is only what you said earlier, to adjust
> > the starting alignment so the pgoff aligned high order folios/pfns
> > line up properly.
> 
> I meant "highest pgoff" that I dropped.
> 
> We definitely need the pgoff to make it work.  So here I dropped "highest
> pgoff" passed from the caller because I decided to leave such check to the
> mmap() hook later.
> 
> > 
> > > > The mm knows what leaf page tables options exist. It should try to
> > > > align to the closest leaf page table size that is <= the FD's max
> > > > aligned folio.
> > > 
> > > So again IMHO this is also not per-FD information, but needs to be passed
> > > over from the driver for each call.
> > 
> > It is per-FD in the sense that each FD is unique and each range of
> > pgoff could have a unique maximum.
> >  
> > > Likely the "order" parameter appeared in other discussions to imply a
> > > maximum supported size from the driver side (or, for a folio, but that is
> > > definitely another user after this series can land).
> > 
> > Yes, it is the only information the driver can actually provide and
> > comes directly from what it will install in the VMA.
> > 
> > > So far I didn't yet add the "order", because currently VFIO definitely
> > > supports all max orders the system supports.  Maybe we can add the order
> > > when there's a real need, but maybe it won't happen in the near
> > > future?
> > 
> > The purpose of the order is to prevent over alignment and waste of
> > VMA. Your technique to use the length to limit alignment instead is
> > good enough for VFIO but not very general.
> 
> Yes that's also something I didn't like.  I think I'll just go ahead and
> add the order parameter, then use it in previous patch too.

So I changed my mind, slightly.  I can still have the "order" parameter to
make the API cleaner (even if it'll be a pure overhead.. because all
existing caller will pass in PUD_SIZE as of now), but I think I'll still
stick with the ifdef in patch 4, as I mentioned here:

https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/

The problem is I just noticed yet again that exporting
huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
it'll need something like this to make !MMU compile for VFIO, while this is
definitely some ugliness I also want to avoid..

===8<===
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 59fdafb1034b..f40a8fb64eaa 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -548,7 +548,11 @@ static inline unsigned long
 huge_mapping_get_va_aligned(struct file *filp, unsigned long addr,
                unsigned long len, unsigned long pgoff, unsigned long flags)
 {
+#ifdef CONFIG_MMU
        return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
+#else
+       return 0;
+#endif
 }

 static inline bool
===8<===

The issue is still mm_get_unmapped_area() is only exported on CONFIG_MMU,
so we need to special case that for huge_mapping_get_va_aligned(), and here
for !THP && !MMU.

Besides the ugliness, it's also about how to choose a default value to
return when mm_get_unmapped_area() isn't available.

I gave it a defalut value (0) as example, but I don't even thnk that 0
makes sense.  It would (if ever triggerable from any caller on !MMU) mean
it will return 0 directly to __get_unmapped_area() and further do_mmap()
(of !MMU code, which will come down from ksys_mmap_pgoff() of nommu.c) will
take that addr=0 to be the addr to mmap.. that sounds wrong.

There's just no way to provide a sane default value for !MMU.

So going one step back: huge_mapping_get_va_aligned() (or whatever name we
prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
It should follow the same way we treat mm_get_unmapped_area().

Here it also goes back to the question on why !MMU even support mmap():

https://www.kernel.org/doc/Documentation/nommu-mmap.txt

So, for the case of v4l driver (v4l2_m2m_get_unmapped_area that I used to
quote, which only defines in !MMU and I used to misread..), for example,
it's really a minimal mmap() support on ucLinux and that's all about that.
My gut feeling is the noMMU use case more or less abused the current
get_unmapped_area() hook to provide the physical addresses, so as to make
mmap() work even on ucLinux.

It's for sure not a proof that we should have huge_mapping_get_va_aligned()
or mm_get_unmapped_area() availalbe even for !MMU.  That's all about VAs
and that do not exist in !MMU as a concept.

Thanks,

-- 
Peter Xu

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-18 16:56                       ` Peter Xu
@ 2025-06-18 17:46                         ` Jason Gunthorpe
  2025-06-18 19:15                           ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-18 17:46 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 18, 2025 at 12:56:01PM -0400, Peter Xu wrote:
> So I changed my mind, slightly.  I can still have the "order" parameter to
> make the API cleaner (even if it'll be a pure overhead.. because all
> existing caller will pass in PUD_SIZE as of now), 

That doesn't seem right, the callers should report the real value not
artifically cap it.. Like ARM does have page sizes greater than PUD
that might be interesting to enable someday for PFN users.

> but I think I'll still
> stick with the ifdef in patch 4, as I mentioned here:

> https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/
> 
> The problem is I just noticed yet again that exporting
> huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
> it'll need something like this to make !MMU compile for VFIO, while this is
> definitely some ugliness I also want to avoid..

IMHO this uglyness should certainly be contained to the mm code and not
leak into drivers.

> There's just no way to provide a sane default value for !MMU.

So all this mess seems to say that get_unmapped_area() is just the
wrong fop to have here. It can't be implemented sanely for !MMU and
has these weird conditions, like can't fail.

I again suggest to just simplify and add an new fop 

size_t get_best_mapping_order(struct file *filp, pgoff_t pgoff,
                              size_t length);

Which will return the largest pgoff aligned order within pgoff/length
that the FD could try to install. Very simple for the driver
side. vfio pci will just return ilog2(bar_size).

PAGE_SHIFT can be a safe default.

Then put all this maze of conditionals in the mm side replacing the
call to fops->get_unmapped_area() and don't export anything new. The
mm will automaticall cap the alignment based on what the architecture
can do and what 

!MMU would simply entirely ignore this new stuff.

> So going one step back: huge_mapping_get_va_aligned() (or whatever name we
> prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
> It should follow the same way we treat mm_get_unmapped_area().

We just deleted !SMP, I really wonder if it is time for !MMU to go
away too..

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-18 17:46                         ` Jason Gunthorpe
@ 2025-06-18 19:15                           ` Peter Xu
  2025-06-19 13:58                             ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-18 19:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 18, 2025 at 02:46:41PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 18, 2025 at 12:56:01PM -0400, Peter Xu wrote:
> > So I changed my mind, slightly.  I can still have the "order" parameter to
> > make the API cleaner (even if it'll be a pure overhead.. because all
> > existing caller will pass in PUD_SIZE as of now), 
> 
> That doesn't seem right, the callers should report the real value not
> artifically cap it.. Like ARM does have page sizes greater than PUD
> that might be interesting to enable someday for PFN users.

It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
its huge_fault().

> 
> > but I think I'll still
> > stick with the ifdef in patch 4, as I mentioned here:
> 
> > https://lore.kernel.org/all/aFGMG3763eSv9l8b@x1.local/
> > 
> > The problem is I just noticed yet again that exporting
> > huge_mapping_get_va_aligned() for all configs doesn't make sense.  At least
> > it'll need something like this to make !MMU compile for VFIO, while this is
> > definitely some ugliness I also want to avoid..
> 
> IMHO this uglyness should certainly be contained to the mm code and not
> leak into drivers.
> 
> > There's just no way to provide a sane default value for !MMU.
> 
> So all this mess seems to say that get_unmapped_area() is just the
> wrong fop to have here. It can't be implemented sanely for !MMU and
> has these weird conditions, like can't fail.
> 
> I again suggest to just simplify and add an new fop 
> 
> size_t get_best_mapping_order(struct file *filp, pgoff_t pgoff,
>                               size_t length);
> 
> Which will return the largest pgoff aligned order within pgoff/length
> that the FD could try to install. Very simple for the driver
> side. vfio pci will just return ilog2(bar_size).
> 
> PAGE_SHIFT can be a safe default.

I agree this is a better way.  We can make the PAGE_SHIFT by default or
just 0, because it doesn't sound necessary to me to support anything
smaller than PAGE_SIZE.. maybe a "int" retval would suffice to also cover
errors.

So this will introduce a new file operation that will only be used so far
in VFIO, playing similar role until we start to convert many
get_unmapped_area() to this one.

> 
> Then put all this maze of conditionals in the mm side replacing the
> call to fops->get_unmapped_area() and don't export anything new. The
> mm will automaticall cap the alignment based on what the architecture
> can do and what 
> 
> !MMU would simply entirely ignore this new stuff.

For the long term, we should move all get_unmapped_area() users to the new
API.  For old !MMU users, we should rename get_unmapped_area() to something
better, like get_mmap_addr().  For those cases it's really not about
looking for something not mapped, but normally exactly what is requested.

> 
> > So going one step back: huge_mapping_get_va_aligned() (or whatever name we
> > prefer) doesn't make sense to be exported always, but only when CONFIG_MMU.
> > It should follow the same way we treat mm_get_unmapped_area().
> 
> We just deleted !SMP, I really wonder if it is time for !MMU to go
> away too..

Yes, if this comes earlier, we can completely drop get_unmapped_area()
after all existing MMU users converted to the new one.

Any early objections / concerns / comments from anyone else, before I go
and introduce it?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-18 19:15                           ` Peter Xu
@ 2025-06-19 13:58                             ` Jason Gunthorpe
  2025-06-19 14:55                               ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-19 13:58 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 18, 2025 at 03:15:50PM -0400, Peter Xu wrote:
> > > So I changed my mind, slightly.  I can still have the "order" parameter to
> > > make the API cleaner (even if it'll be a pure overhead.. because all
> > > existing caller will pass in PUD_SIZE as of now), 
> > 
> > That doesn't seem right, the callers should report the real value not
> > artifically cap it.. Like ARM does have page sizes greater than PUD
> > that might be interesting to enable someday for PFN users.
> 
> It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
> its huge_fault().

Hm, OK that does make sense. I would add a small comment though as it
is not so intuitive and may not apply to something using ioremap..

> So this will introduce a new file operation that will only be used so far
> in VFIO, playing similar role until we start to convert many
> get_unmapped_area() to this one.

Yes, if someone wants to do a project here you can markup
memfds/shmem/hugetlbfs/etc/etc to define their internal folio orders
and hopefully ultimately remove some of that alignment logic from the
arch code.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-19 13:58                             ` Jason Gunthorpe
@ 2025-06-19 14:55                               ` Peter Xu
  2025-06-19 18:40                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-19 14:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Thu, Jun 19, 2025 at 10:58:52AM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 18, 2025 at 03:15:50PM -0400, Peter Xu wrote:
> > > > So I changed my mind, slightly.  I can still have the "order" parameter to
> > > > make the API cleaner (even if it'll be a pure overhead.. because all
> > > > existing caller will pass in PUD_SIZE as of now), 
> > > 
> > > That doesn't seem right, the callers should report the real value not
> > > artifically cap it.. Like ARM does have page sizes greater than PUD
> > > that might be interesting to enable someday for PFN users.
> > 
> > It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
> > its huge_fault().
> 
> Hm, OK that does make sense. I would add a small comment though as it
> is not so intuitive and may not apply to something using ioremap..

Sure, I'll remember to add some comment if I'll go back to the old
interface.  I hope it won't happen..

> 
> > So this will introduce a new file operation that will only be used so far
> > in VFIO, playing similar role until we start to convert many
> > get_unmapped_area() to this one.
> 
> Yes, if someone wants to do a project here you can markup
> memfds/shmem/hugetlbfs/etc/etc to define their internal folio orders
> and hopefully ultimately remove some of that alignment logic from the
> arch code.

I'm a bit refrained to touch all of the files just for this, but I can
definitely add very verbose explanation into the commit log when I'll
introduce the new API, on not only the relationship of that and the old
APIs, also possible future works.

Besides the get_unmapped_area() -> NEW API conversions which is arch
independent in most cases, indeed if it would be great to reduce per-arch
alignment requirement as much as possible.  At least that should apply for
hugetlbfs that it shouldn't be arch-dependent.  I am not sure about the
rest, though.  For example, I see archs may treat PF_RANDOMIZE differently.
There might be a lot of trivial details to look at.

OTOH, one other thought (which may not need to monitor all archs) is it
does look confusing to have two layers of alignment operation, which is at
least the case of THP right now.  So it might be good to at least punch it
through to use vm_unmapped_area_info.align_mask / etc. if possible, to
avoid double-padding: after all, unmapped_area() also did align paddings.
It smells like something we overlooked when initially support THP.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-19 14:55                               ` Peter Xu
@ 2025-06-19 18:40                                 ` Jason Gunthorpe
  2025-06-24 20:37                                   ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-19 18:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Thu, Jun 19, 2025 at 10:55:02AM -0400, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 10:58:52AM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 18, 2025 at 03:15:50PM -0400, Peter Xu wrote:
> > > > > So I changed my mind, slightly.  I can still have the "order" parameter to
> > > > > make the API cleaner (even if it'll be a pure overhead.. because all
> > > > > existing caller will pass in PUD_SIZE as of now), 
> > > > 
> > > > That doesn't seem right, the callers should report the real value not
> > > > artifically cap it.. Like ARM does have page sizes greater than PUD
> > > > that might be interesting to enable someday for PFN users.
> > > 
> > > It needs to pass in PUD_SIZE to match what vfio-pci currently supports in
> > > its huge_fault().
> > 
> > Hm, OK that does make sense. I would add a small comment though as it
> > is not so intuitive and may not apply to something using ioremap..
> 
> Sure, I'll remember to add some comment if I'll go back to the old
> interface.  I hope it won't happen..

Even with this new version you have to decide to return PUD_SIZE or
bar_size in pci and your same reasoning that PUD_SIZE make sense
applies (though I would probably return bar_size and just let the core
code cap it to PUD_SIZE)

> I'm a bit refrained to touch all of the files just for this, but I can
> definitely add very verbose explanation into the commit log when I'll
> introduce the new API, on not only the relationship of that and the old
> APIs, also possible future works.

Yeah, I wouldn't grow this work any more. It does highlight there is
alot of room to improve the arch interface though.

> OTOH, one other thought (which may not need to monitor all archs) is it
> does look confusing to have two layers of alignment operation, which is at
> least the case of THP right now.  So it might be good to at least punch it
> through to use vm_unmapped_area_info.align_mask / etc. if possible, to
> avoid double-padding: after all, unmapped_area() also did align paddings.
> It smells like something we overlooked when initially support THP.

I would not address that in this series, THP has been abusing this for
a long time, may as well keep it for now.

Either the arch code should return the info struct or the order should
be passed down to arch code. This would give enough information to the
maple tree algorithm to be able to do one operation.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-19 18:40                                 ` Jason Gunthorpe
@ 2025-06-24 20:37                                   ` Peter Xu
  2025-06-24 20:51                                     ` Peter Xu
  2025-06-24 23:40                                     ` Jason Gunthorpe
  0 siblings, 2 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-24 20:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> Even with this new version you have to decide to return PUD_SIZE or
> bar_size in pci and your same reasoning that PUD_SIZE make sense
> applies (though I would probably return bar_size and just let the core
> code cap it to PUD_SIZE)

Yes.

Today I went back to look at this, I was trying to introduce this for
file_operations:

	int (*get_mapping_order)(struct file *, unsigned long, size_t);

It looks almost good, except that it so far has no way to return the
physical address for further calculation on the alignment.

For THP, VA is always calculated against pgoff not physical address on the
alignment.  I think it's OK for THP, because every 2M THP folio will be
naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
in the calculation of thp_get_unmapped_area_vmflags().

Logically it should even also work for vfio-pci, as long as VFIO keeps
using the lower 40 bits of the device_fd to represent the bar offset,
meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
aligned with bar sizes.

But from an API POV, get_mapping_order() logically should return something
for further calculation of the alignment to get the VA.  pgoff here may not
always be the right thing to use to align to the VA: after all, pgtable
mapping is about VA -> PA, the only reasonable and reliable way is to align
VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
always aligned to PA address space.

Any thoughts?

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-24 20:37                                   ` Peter Xu
@ 2025-06-24 20:51                                     ` Peter Xu
  2025-06-24 23:40                                     ` Jason Gunthorpe
  1 sibling, 0 replies; 77+ messages in thread
From: Peter Xu @ 2025-06-24 20:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > Even with this new version you have to decide to return PUD_SIZE or
> > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > applies (though I would probably return bar_size and just let the core
> > code cap it to PUD_SIZE)
> 
> Yes.
> 
> Today I went back to look at this, I was trying to introduce this for
> file_operations:
> 
> 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> 
> It looks almost good, except that it so far has no way to return the
> physical address for further calculation on the alignment.
> 
> For THP, VA is always calculated against pgoff not physical address on the
> alignment.  I think it's OK for THP, because every 2M THP folio will be
> naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> in the calculation of thp_get_unmapped_area_vmflags().
> 
> Logically it should even also work for vfio-pci, as long as VFIO keeps
> using the lower 40 bits of the device_fd to represent the bar offset,
> meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> aligned with bar sizes.
> 
> But from an API POV, get_mapping_order() logically should return something
> for further calculation of the alignment to get the VA.  pgoff here may not
> always be the right thing to use to align to the VA: after all, pgtable
> mapping is about VA -> PA, the only reasonable and reliable way is to align
> VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> always aligned to PA address space.
> 
> Any thoughts?

I should have listed current viable next steps..  We have at least these
options:

(a) Ignore this issue, keep the get_mapping_order() interface like above,
    as long as it works for vfio-pci

    I don't like this option.  I prefer the API (if we're going to
    introduce one) to be applicable no matter how pgoff would be mapped to
    PAs.  I don't like the API to rely on specific driver on specific spec
    (in this case, PCI).

(b) I can make the new API like this instead:

    int (*get_mapping_order)(struct file *, unsigned long, unsigned long *, size_t);

    where I can return a *phys_pgoff altogether after the call returned the
    order to map in retval.  But that's very not pretty if not ugly.

(c) Go back to what I did with the current v1, addressing comments and keep
    using get_unmapped_area() until we know a better way.

I'll vote for (c), but I'm open to suggestions.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-24 20:37                                   ` Peter Xu
  2025-06-24 20:51                                     ` Peter Xu
@ 2025-06-24 23:40                                     ` Jason Gunthorpe
  2025-06-25  0:48                                       ` Peter Xu
  1 sibling, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-24 23:40 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > Even with this new version you have to decide to return PUD_SIZE or
> > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > applies (though I would probably return bar_size and just let the core
> > code cap it to PUD_SIZE)
> 
> Yes.
> 
> Today I went back to look at this, I was trying to introduce this for
> file_operations:
> 
> 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> 
> It looks almost good, except that it so far has no way to return the
> physical address for further calculation on the alignment.
> 
> For THP, VA is always calculated against pgoff not physical address on the
> alignment.  I think it's OK for THP, because every 2M THP folio will be
> naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> in the calculation of thp_get_unmapped_area_vmflags().
> 
> Logically it should even also work for vfio-pci, as long as VFIO keeps
> using the lower 40 bits of the device_fd to represent the bar offset,
> meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> aligned with bar sizes.
> 
> But from an API POV, get_mapping_order() logically should return something
> for further calculation of the alignment to get the VA.  pgoff here may not
> always be the right thing to use to align to the VA: after all, pgtable
> mapping is about VA -> PA, the only reasonable and reliable way is to align
> VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> always aligned to PA address space.

My feeling, and the reason I used the phrase "pgoff aligned address",
is that the owner of the file should already ensure that for the large
PTEs/folios:
 pgoff % 2**order == 0
 physical % 2**order == 0

So, things like VFIO do need to hand out high alignment pgoffs to make
this work - which it already does.

To me this just keeps thing simpler. I guess if someone comes up with
a case where they really can't get a pgoff alignment and really need a
high order mapping then maybe we can add a new return field of some
kind (pgoff adjustment?) but that is so weird I'd leave it to the
future person to come and justfiy it.

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-24 23:40                                     ` Jason Gunthorpe
@ 2025-06-25  0:48                                       ` Peter Xu
  2025-06-25 13:07                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-25  0:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 24, 2025 at 08:40:32PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 24, 2025 at 04:37:26PM -0400, Peter Xu wrote:
> > On Thu, Jun 19, 2025 at 03:40:41PM -0300, Jason Gunthorpe wrote:
> > > Even with this new version you have to decide to return PUD_SIZE or
> > > bar_size in pci and your same reasoning that PUD_SIZE make sense
> > > applies (though I would probably return bar_size and just let the core
> > > code cap it to PUD_SIZE)
> > 
> > Yes.
> > 
> > Today I went back to look at this, I was trying to introduce this for
> > file_operations:
> > 
> > 	int (*get_mapping_order)(struct file *, unsigned long, size_t);
> > 
> > It looks almost good, except that it so far has no way to return the
> > physical address for further calculation on the alignment.
> > 
> > For THP, VA is always calculated against pgoff not physical address on the
> > alignment.  I think it's OK for THP, because every 2M THP folio will be
> > naturally 2M aligned on the physical address, so it fits when e.g. pgoff=0
> > in the calculation of thp_get_unmapped_area_vmflags().
> > 
> > Logically it should even also work for vfio-pci, as long as VFIO keeps
> > using the lower 40 bits of the device_fd to represent the bar offset,
> > meanwhile it'll also require PCIe spec asking the PCI bars to be mapped
> > aligned with bar sizes.
> > 
> > But from an API POV, get_mapping_order() logically should return something
> > for further calculation of the alignment to get the VA.  pgoff here may not
> > always be the right thing to use to align to the VA: after all, pgtable
> > mapping is about VA -> PA, the only reasonable and reliable way is to align
> > VA to the PA to be mappped, and as an API we shouldn't assume pgoff is
> > always aligned to PA address space.
> 
> My feeling, and the reason I used the phrase "pgoff aligned address",
> is that the owner of the file should already ensure that for the large
> PTEs/folios:
>  pgoff % 2**order == 0
>  physical % 2**order == 0

IMHO there shouldn't really be any hard requirement in mm that pgoff and
physical address space need to be aligned.. but I confess I don't have an
example driver that didn't do that in the linux tree.

> 
> So, things like VFIO do need to hand out high alignment pgoffs to make
> this work - which it already does.
> 
> To me this just keeps thing simpler. I guess if someone comes up with
> a case where they really can't get a pgoff alignment and really need a
> high order mapping then maybe we can add a new return field of some
> kind (pgoff adjustment?) but that is so weird I'd leave it to the
> future person to come and justfiy it.

When looking more, I also found some special cased get_unmapped_area() that
may not be trivially converted into the new API even for CONFIG_MMU, namely:

- io_uring_get_unmapped_area
- arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)

I'll need to have some closer look tomorrow.  If any of them cannot be 100%
safely converted to the new API, I'd also think we should not introduce the
new API, but reuse get_unmapped_area() until we know a way out.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-25  0:48                                       ` Peter Xu
@ 2025-06-25 13:07                                         ` Jason Gunthorpe
  2025-06-25 17:12                                           ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-25 13:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Tue, Jun 24, 2025 at 08:48:45PM -0400, Peter Xu wrote:
> > My feeling, and the reason I used the phrase "pgoff aligned address",
> > is that the owner of the file should already ensure that for the large
> > PTEs/folios:
> >  pgoff % 2**order == 0
> >  physical % 2**order == 0
> 
> IMHO there shouldn't really be any hard requirement in mm that pgoff and
> physical address space need to be aligned.. but I confess I don't have an
> example driver that didn't do that in the linux tree.

Well, maybe, but right now there does seem to be for
THP/hugetlbfs/etc. It is a nice simple solution that exposes the
alignment requirements to userspace if it wants to use MAP_FIXED.

> > To me this just keeps thing simpler. I guess if someone comes up with
> > a case where they really can't get a pgoff alignment and really need a
> > high order mapping then maybe we can add a new return field of some
> > kind (pgoff adjustment?) but that is so weird I'd leave it to the
> > future person to come and justfiy it.
> 
> When looking more, I also found some special cased get_unmapped_area() that
> may not be trivially converted into the new API even for CONFIG_MMU, namely:
> 
> - io_uring_get_unmapped_area
> - arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)
> 
> I'll need to have some closer look tomorrow.  If any of them cannot be 100%
> safely converted to the new API, I'd also think we should not introduce the
> new API, but reuse get_unmapped_area() until we know a way out.

Oh yuk. It is trying to avoid the dcache flush on some kernel paths
for virtually tagged cache systems.

Arguably this fixup should not be in io_uring, but conveying the right
information to the core code, and requesting a special flush
avoidance mapping is not so easy.

But again I suspect the pgoff is the right solution.

IIRC this is handled by forcing a few low virtual address bits to
always match across all user mappings (the colour) via the pgoff. This
way the userspace always uses the same cache tag and doesn't become
cache incoherent. ie:

   user_addr % PAGE_SIZE*N == pgoff % PAGE_SIZE*N

The issue is now the kernel is using the direct map and we can't force
a random jumble of pages to have the right colours to match
userspace. So the kernel has all those dcache flushes sprinkled about
before it touches user mapped memory through the direct map as the
kernel will use a different colour and cache tag.

So.. if iouring selects a pgoff that automatically gives the right
colour for the userspace mapping to also match the kernel mapping's
colour then things should just work.

Frankly I'm shocked that someone invested time in trying to make this
work - the commit log says it was for parisc and only 2 years ago :(

d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements")

I thought such physically tagged cache systems were long ago dead and
buried..

Shouldn't this entirely reject MAP_FIXED too?

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-25 13:07                                         ` Jason Gunthorpe
@ 2025-06-25 17:12                                           ` Peter Xu
  2025-06-25 18:41                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-25 17:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 25, 2025 at 10:07:11AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 24, 2025 at 08:48:45PM -0400, Peter Xu wrote:
> > > My feeling, and the reason I used the phrase "pgoff aligned address",
> > > is that the owner of the file should already ensure that for the large
> > > PTEs/folios:
> > >  pgoff % 2**order == 0
> > >  physical % 2**order == 0
> > 
> > IMHO there shouldn't really be any hard requirement in mm that pgoff and
> > physical address space need to be aligned.. but I confess I don't have an
> > example driver that didn't do that in the linux tree.
> 
> Well, maybe, but right now there does seem to be for
> THP/hugetlbfs/etc. It is a nice simple solution that exposes the
> alignment requirements to userspace if it wants to use MAP_FIXED.
> 
> > > To me this just keeps thing simpler. I guess if someone comes up with
> > > a case where they really can't get a pgoff alignment and really need a
> > > high order mapping then maybe we can add a new return field of some
> > > kind (pgoff adjustment?) but that is so weird I'd leave it to the
> > > future person to come and justfiy it.
> > 
> > When looking more, I also found some special cased get_unmapped_area() that
> > may not be trivially converted into the new API even for CONFIG_MMU, namely:
> > 
> > - io_uring_get_unmapped_area
> > - arena_get_unmapped_area (from bpf_map->ops->map_get_unmapped_area)
> > 
> > I'll need to have some closer look tomorrow.  If any of them cannot be 100%
> > safely converted to the new API, I'd also think we should not introduce the
> > new API, but reuse get_unmapped_area() until we know a way out.
> 
> Oh yuk. It is trying to avoid the dcache flush on some kernel paths
> for virtually tagged cache systems.
> 
> Arguably this fixup should not be in io_uring, but conveying the right
> information to the core code, and requesting a special flush
> avoidance mapping is not so easy.

IIUC it still makes sense to be with io_uring, because only io_uring
subsystem knows what to align against.  I don't yet understand how generic
mm can do this, after all generic mm doesn't know the address that io_uring
is using (from io_region_get_ptr()).

> 
> But again I suspect the pgoff is the right solution.
> 
> IIRC this is handled by forcing a few low virtual address bits to
> always match across all user mappings (the colour) via the pgoff. This
> way the userspace always uses the same cache tag and doesn't become
> cache incoherent. ie:
> 
>    user_addr % PAGE_SIZE*N == pgoff % PAGE_SIZE*N
> 
> The issue is now the kernel is using the direct map and we can't force

After I read the two use cases, I mostly agree.  Just one trivial thing to
mention, it may not be direct map but vmap() (see io_region_init_ptr()).

> a random jumble of pages to have the right colours to match
> userspace. So the kernel has all those dcache flushes sprinkled about
> before it touches user mapped memory through the direct map as the
> kernel will use a different colour and cache tag.
> 
> So.. if iouring selects a pgoff that automatically gives the right
> colour for the userspace mapping to also match the kernel mapping's
> colour then things should just work.
> 
> Frankly I'm shocked that someone invested time in trying to make this
> work - the commit log says it was for parisc and only 2 years ago :(
> 
> d808459b2e31 ("io_uring: Adjust mapping wrt architecture aliasing requirements")
> 
> I thought such physically tagged cache systems were long ago dead and
> buried..

Yeah.. internet says parisc stopped shipping since 2005.  Obviously
there're still people running io_uring on parisc systems, more or less.
This change seems to be required to make io_uring work on parisc or any
vipt.

> 
> Shouldn't this entirely reject MAP_FIXED too?

It already does, see (io_uring_get_unmapped_area(), of parisc):

	/*
	 * Do not allow to map to user-provided address to avoid breaking the
	 * aliasing rules. Userspace is not able to guess the offset address of
	 * kernel kmalloc()ed memory area.
	 */
	if (addr)
		return -EINVAL;

I do not know whoever would use MAP_FIXED but with addr=0.  So failing
addr!=0 should literally stop almost all MAP_FIXED already.

Side topic, but... logically speaking this should really be fine when
!SHM_COLOUR.  This commit should break MAP_FIXED for everyone on io_uring,
but I guess nobody really use MAP_FIXED for io_uring fds..

It's also utterly confusing to set addr=ptr for parisc, fundamentally addr
here must be a kernel va not user va, so it'll (AFAIU) 100% fail later with
STACK_SIZE checks..  IMHO we should really change this to:

diff --git a/io_uring/memmap.c b/io_uring/memmap.c
index 725dc0bec24c..1225a9393dc5 100644
--- a/io_uring/memmap.c
+++ b/io_uring/memmap.c
@@ -380,12 +380,10 @@ unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr,
         */
        filp = NULL;
        flags |= MAP_SHARED;
-       pgoff = 0;      /* has been translated to ptr above */
 #ifdef SHM_COLOUR
-       addr = (uintptr_t) ptr;
-       pgoff = addr >> PAGE_SHIFT;
+       pgoff = (uintptr_t)ptr >> PAGE_SHIFT;
 #else
-       addr = 0UL;
+       pgoff = 0;      /* has been translated to ptr above */
 #endif
        return mm_get_unmapped_area(current->mm, filp, addr, len, pgoff, flags);
 }

And avoid the confusing "addr=ptr" setup.  This might be too off-topic,
though.

Then I also looked at the other bpf arena use case, which doubled the len
when requesting VA and does proper round ups for 4G:

arena_get_unmapped_area():
	ret = mm_get_unmapped_area(current->mm, filp, addr, len * 2, 0, flags);
        ...
	return round_up(ret, SZ_4G);

AFAIU, this is buggy.. at least we should check "round_up(ret, SZ_4G)"
still falls into the (ret, ret+2*len) region... or AFAIU we can return some
address that might be used by other VMAs already..

But in general that smells like a similar alignment issue, IIUC.  So might
be applicable for the new API.

Going back to the topic of this series - I think the new API would work for
io_uring and parisc too if I can return phys_pgoff, here what parisc would
need is:

#ifdef SHM_COLOUR
        *phys_pgoff = io_region_get_ptr(..) >> PAGE_SHIFT;
#else
        *phys_pgoff = pgoff;
#endif

Here *phys_pgoff (or a rename) would be required to fetch the kernel VA (no
matter direct mapping or vmap()) offset, to avoid aliasing issue.

Should I go and introduce the API with *phys_pgoff returned together, then?
I'll still need to scratch my head on how to properly define it, but it at
least will also get vfio use case decouple with spec dependency.

Thanks,

-- 
Peter Xu


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-25 17:12                                           ` Peter Xu
@ 2025-06-25 18:41                                             ` Jason Gunthorpe
  2025-06-25 19:26                                               ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-25 18:41 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:

> After I read the two use cases, I mostly agree.  Just one trivial thing to
> mention, it may not be direct map but vmap() (see io_region_init_ptr()).

If it is vmapped then this is all silly, you should vmap and mmmap
using the same cache colouring and, AFAIK, pgoff is how this works for
purely userspace.

Once vmap'd it should determine the cache colour and set the pgoff
properly, then everything should already work no?

> It already does, see (io_uring_get_unmapped_area(), of parisc):
> 
> 	/*
> 	 * Do not allow to map to user-provided address to avoid breaking the
> 	 * aliasing rules. Userspace is not able to guess the offset address of
> 	 * kernel kmalloc()ed memory area.
> 	 */
> 	if (addr)
> 		return -EINVAL;
> 
> I do not know whoever would use MAP_FIXED but with addr=0.  So failing
> addr!=0 should literally stop almost all MAP_FIXED already.

Maybe but also it is not right to not check MAP_FIXED directly.. And
addr is supposed to be a hint for non-fixed mode so it is weird to
-EINVAL when you can ignore the hint??

> Going back to the topic of this series - I think the new API would work for
> io_uring and parisc too if I can return phys_pgoff, here what parisc would
> need is:

The best solution is to fix the selection of normal pgoff so it has
consistent colouring of user VMAs and kernel vmaps. Either compute a
pgoff that matches the vmap (hopefully easy if it is not uABI) or
teach the kernel vmap how to respect a "pgoff" to set the cache
colouring just like the user VMA's do (AFIACR).

But I think this is getting maybe too big and I'd just introduce the
new API and not try to convert this hard stuff. The above explanation
how it could be fixed should be enough??

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-25 18:41                                             ` Jason Gunthorpe
@ 2025-06-25 19:26                                               ` Peter Xu
  2025-06-30 14:05                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-06-25 19:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 25, 2025 at 03:41:54PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:
> 
> > After I read the two use cases, I mostly agree.  Just one trivial thing to
> > mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> 
> If it is vmapped then this is all silly, you should vmap and mmmap
> using the same cache colouring and, AFAIK, pgoff is how this works for
> purely userspace.
> 
> Once vmap'd it should determine the cache colour and set the pgoff
> properly, then everything should already work no?

I don't yet see how to set the pgoff.  Here pgoff is passed from the
userspace, which follows io_uring's definition (per io_uring_mmap).

For example, in parisc one could map the complete queue with
pgoff=IORING_OFF_CQ_RING (0x8000000), but then the VA alignment needs to be
adjusted to the vmap() returned for complete queue's io_mapped_region.ptr.

> 
> > It already does, see (io_uring_get_unmapped_area(), of parisc):
> > 
> > 	/*
> > 	 * Do not allow to map to user-provided address to avoid breaking the
> > 	 * aliasing rules. Userspace is not able to guess the offset address of
> > 	 * kernel kmalloc()ed memory area.
> > 	 */
> > 	if (addr)
> > 		return -EINVAL;
> > 
> > I do not know whoever would use MAP_FIXED but with addr=0.  So failing
> > addr!=0 should literally stop almost all MAP_FIXED already.
> 
> Maybe but also it is not right to not check MAP_FIXED directly.. And
> addr is supposed to be a hint for non-fixed mode so it is weird to
> -EINVAL when you can ignore the hint??

I agree on both points here.

> 
> > Going back to the topic of this series - I think the new API would work for
> > io_uring and parisc too if I can return phys_pgoff, here what parisc would
> > need is:
> 
> The best solution is to fix the selection of normal pgoff so it has
> consistent colouring of user VMAs and kernel vmaps. Either compute a
> pgoff that matches the vmap (hopefully easy if it is not uABI) or
> teach the kernel vmap how to respect a "pgoff" to set the cache
> colouring just like the user VMA's do (AFIACR).
> 
> But I think this is getting maybe too big and I'd just introduce the
> new API and not try to convert this hard stuff. The above explanation
> how it could be fixed should be enough??

I never planned to do it myself.  However if I'm going to sign-off and
propose an API, I want to be crystal clear of the goal of the API, and
feasibility of the goal even if I'm not going to work on it..

We don't want to introduce something then found it won't work even for some
MMU use cases, and start maintaining both, or revert back. I wished we
could have sticked with the get_unmapped_area() as of now and leave the API
for later.

So if we want the new API to be proposed here, and make VFIO use it first
(while consider it to be applicable to all existing MMU users at least,
which I checked all of them so far now), I'd think this proper:

    int (*mmap_va_hint)(struct file *file, unsigned long *pgoff, size_t len);

The changes comparing to previous:

    (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
    the hook can adjust the pgoff for the va allocator to be used.  The
    adjustment will not be visible to future mmap() when VMA is created.

    (2) I renamed it to mmap_va_hint(), because *pgoff will be able to be
    updated, so it's not only about ordering, but "order" and "pgoff
    adjustment" hints that the core mm will use when calculating the VA.

Does it look ok to you?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-25 19:26                                               ` Peter Xu
@ 2025-06-30 14:05                                                 ` Jason Gunthorpe
  2025-07-02 20:58                                                   ` Peter Xu
  0 siblings, 1 reply; 77+ messages in thread
From: Jason Gunthorpe @ 2025-06-30 14:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jun 25, 2025 at 03:26:44PM -0400, Peter Xu wrote:
> On Wed, Jun 25, 2025 at 03:41:54PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:
> > 
> > > After I read the two use cases, I mostly agree.  Just one trivial thing to
> > > mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> > 
> > If it is vmapped then this is all silly, you should vmap and mmmap
> > using the same cache colouring and, AFAIK, pgoff is how this works for
> > purely userspace.
> > 
> > Once vmap'd it should determine the cache colour and set the pgoff
> > properly, then everything should already work no?
> 
> I don't yet see how to set the pgoff.  Here pgoff is passed from the
> userspace, which follows io_uring's definition (per io_uring_mmap).

That's too bad

So you have to do it the other way and pass the pgoff to the vmap so
the vmap ends up with the same colouring as a user VMa holding the
same pages..

> So if we want the new API to be proposed here, and make VFIO use it first
> (while consider it to be applicable to all existing MMU users at least,
> which I checked all of them so far now), I'd think this proper:
> 
>     int (*mmap_va_hint)(struct file *file, unsigned long *pgoff, size_t len);
> 
> The changes comparing to previous:
> 
>     (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
>     the hook can adjust the pgoff for the va allocator to be used.  The
>     adjustment will not be visible to future mmap() when VMA is created.

It seems functional, but the above is better, IMHO.

>     (2) I renamed it to mmap_va_hint(), because *pgoff will be able to be
>     updated, so it's not only about ordering, but "order" and "pgoff
>     adjustment" hints that the core mm will use when calculating the VA.

Where does order come back though? Returns order?

It seems viable

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-06-30 14:05                                                 ` Jason Gunthorpe
@ 2025-07-02 20:58                                                   ` Peter Xu
  2025-07-02 23:32                                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 77+ messages in thread
From: Peter Xu @ 2025-07-02 20:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Mon, Jun 30, 2025 at 11:05:37AM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 25, 2025 at 03:26:44PM -0400, Peter Xu wrote:
> > On Wed, Jun 25, 2025 at 03:41:54PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 25, 2025 at 01:12:11PM -0400, Peter Xu wrote:
> > > 
> > > > After I read the two use cases, I mostly agree.  Just one trivial thing to
> > > > mention, it may not be direct map but vmap() (see io_region_init_ptr()).
> > > 
> > > If it is vmapped then this is all silly, you should vmap and mmmap
> > > using the same cache colouring and, AFAIK, pgoff is how this works for
> > > purely userspace.
> > > 
> > > Once vmap'd it should determine the cache colour and set the pgoff
> > > properly, then everything should already work no?
> > 
> > I don't yet see how to set the pgoff.  Here pgoff is passed from the
> > userspace, which follows io_uring's definition (per io_uring_mmap).
> 
> That's too bad
> 
> So you have to do it the other way and pass the pgoff to the vmap so
> the vmap ends up with the same colouring as a user VMa holding the
> same pages..

Not sure if I get that point, but.. it'll be hard to achieve at least.

The vmap() happens (submit/complete queues initializes) when io_uring
instance is created.  The mmap() happens later, and it can also happen
multiple times, so that all of the VAs got mmap()ed need to share the same
colouring with the vmap()..  In this case it sounds reasonable to me to
have the alignment done at mmap(), against the vmap() results.

> 
> > So if we want the new API to be proposed here, and make VFIO use it first
> > (while consider it to be applicable to all existing MMU users at least,
> > which I checked all of them so far now), I'd think this proper:
> > 
> >     int (*mmap_va_hint)(struct file *file, unsigned long *pgoff, size_t len);
> > 
> > The changes comparing to previous:
> > 
> >     (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
> >     the hook can adjust the pgoff for the va allocator to be used.  The
> >     adjustment will not be visible to future mmap() when VMA is created.
> 
> It seems functional, but the above is better, IMHO.

Do you mean we can start with no modification allowed on *pgoff?  I'd
prefer having *pgoff modifiable from the start, as it'll not only work for
io_uring / parisc above since the 1st day (so we don't need to introduce it
on top, modifying existing users..), but it'll also be cleaner to be used
in the current VFIO's use case.

> 
> >     (2) I renamed it to mmap_va_hint(), because *pgoff will be able to be
> >     updated, so it's not only about ordering, but "order" and "pgoff
> >     adjustment" hints that the core mm will use when calculating the VA.
> 
> Where does order come back though? Returns order?

Yes.

> 
> It seems viable

After I double check with the API above, I can go and prepare a new version.

Thanks a lot, Jason.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings
  2025-07-02 20:58                                                   ` Peter Xu
@ 2025-07-02 23:32                                                     ` Jason Gunthorpe
  0 siblings, 0 replies; 77+ messages in thread
From: Jason Gunthorpe @ 2025-07-02 23:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: Liam R. Howlett, Lorenzo Stoakes, linux-kernel, linux-mm, kvm,
	Andrew Morton, Alex Williamson, Zi Yan, Alex Mastro,
	David Hildenbrand, Nico Pache

On Wed, Jul 02, 2025 at 04:58:46PM -0400, Peter Xu wrote:
> > So you have to do it the other way and pass the pgoff to the vmap so
> > the vmap ends up with the same colouring as a user VMa holding the
> > same pages..
> 
> Not sure if I get that point, but.. it'll be hard to achieve at least.
> 
> The vmap() happens (submit/complete queues initializes) when io_uring
> instance is created.  The mmap() happens later, and it can also happen
> multiple times, so that all of the VAs got mmap()ed need to share the same
> colouring with the vmap()..  In this case it sounds reasonable to me to
> have the alignment done at mmap(), against the vmap() results.

The way this usually works is the memory is bound to a mmap "cookie"
- the pgoff - which userspace can use as many times as it likes.

Usually you know the thing being allocated will be mmap'd and what
it's pgoff will be because it is 1:1 with the cookie/pgoff.

Didn't try to guess what io_uring has done here, but, IMHO, it would
be weird if the pgoffs are not 1:1 with the vmaps.

Since you said the pgoff was constant and not exchanged user/kernel
then presumably the vmap just needs to use that constant pgoff for its
colouring.

> > > The changes comparing to previous:
> > > 
> > >     (1) merged pgoff and *phys_pgoff parameters into one unsigned long, so
> > >     the hook can adjust the pgoff for the va allocator to be used.  The
> > >     adjustment will not be visible to future mmap() when VMA is created.
> > 
> > It seems functional, but the above is better, IMHO.
> 
> Do you mean we can start with no modification allowed on *pgoff?  I'd
> prefer having *pgoff modifiable from the start, as it'll not only work for
> io_uring / parisc above since the 1st day (so we don't need to introduce it
> on top, modifying existing users..), but it'll also be cleaner to be used
> in the current VFIO's use case.

I think modifiably pgoff is really a weird concept... Especially if it
is only modified for the alignment calculation.

But if it is the only way so be it

Jason

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2025-07-02 23:32 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-13 13:41 [PATCH 0/5] mm/vfio: huge pfnmaps with !MAP_FIXED mappings Peter Xu
2025-06-13 13:41 ` [PATCH 1/5] mm: Deduplicate mm_get_unmapped_area() Peter Xu
2025-06-13 14:12   ` Jason Gunthorpe
2025-06-13 14:55   ` Oscar Salvador
2025-06-13 14:58   ` Zi Yan
2025-06-13 15:57   ` Lorenzo Stoakes
2025-06-13 17:00     ` Pedro Falcato
2025-06-13 18:00   ` David Hildenbrand
2025-06-16  8:01   ` David Laight
2025-06-17 21:13     ` Peter Xu
2025-06-13 13:41 ` [PATCH 2/5] mm/hugetlb: Remove prepare_hugepage_range() Peter Xu
2025-06-13 14:12   ` Jason Gunthorpe
2025-06-13 14:59   ` Oscar Salvador
2025-06-13 15:13   ` Zi Yan
2025-06-13 16:24     ` Peter Xu
2025-06-13 18:01       ` David Hildenbrand
2025-06-14  4:11   ` Liam R. Howlett
2025-06-17 21:07     ` Peter Xu
2025-06-13 13:41 ` [PATCH 3/5] mm: Rename __thp_get_unmapped_area to mm_get_unmapped_area_aligned Peter Xu
2025-06-13 14:17   ` Jason Gunthorpe
2025-06-13 15:13     ` Peter Xu
2025-06-13 16:00       ` Jason Gunthorpe
2025-06-13 18:31         ` Peter Xu
2025-06-13 15:19   ` Zi Yan
2025-06-13 18:33     ` Peter Xu
2025-06-13 15:36   ` Lorenzo Stoakes
2025-06-13 18:45     ` Peter Xu
2025-06-13 19:18       ` Lorenzo Stoakes
2025-06-13 20:34         ` Peter Xu
2025-06-14  5:58           ` Lorenzo Stoakes
2025-06-14  5:23   ` Liam R. Howlett
2025-06-16 12:14     ` Jason Gunthorpe
2025-06-16 12:20       ` Lorenzo Stoakes
2025-06-16 12:26         ` Jason Gunthorpe
2025-06-13 13:41 ` [PATCH 4/5] vfio: Introduce vfio_device_ops.get_unmapped_area hook Peter Xu
2025-06-13 14:18   ` Jason Gunthorpe
2025-06-13 18:03   ` David Hildenbrand
2025-06-14 14:46   ` kernel test robot
2025-06-17 15:39     ` Peter Xu
2025-06-17 15:41       ` Jason Gunthorpe
2025-06-17 16:47         ` Peter Xu
2025-06-17 19:39           ` Peter Xu
2025-06-17 19:46             ` Jason Gunthorpe
2025-06-17 20:01               ` Peter Xu
2025-06-17 23:00                 ` Jason Gunthorpe
2025-06-17 23:26                   ` Peter Xu
2025-06-13 13:41 ` [PATCH 5/5] vfio-pci: Best-effort huge pfnmaps with !MAP_FIXED mappings Peter Xu
2025-06-13 14:29   ` Jason Gunthorpe
2025-06-13 15:26     ` Peter Xu
2025-06-13 16:09       ` Jason Gunthorpe
2025-06-13 19:15         ` Peter Xu
2025-06-13 23:16           ` Jason Gunthorpe
2025-06-16 22:06             ` Peter Xu
2025-06-16 23:00               ` Jason Gunthorpe
2025-06-17 20:56                 ` Peter Xu
2025-06-17 23:18                   ` Jason Gunthorpe
2025-06-17 23:36                     ` Peter Xu
2025-06-18 16:56                       ` Peter Xu
2025-06-18 17:46                         ` Jason Gunthorpe
2025-06-18 19:15                           ` Peter Xu
2025-06-19 13:58                             ` Jason Gunthorpe
2025-06-19 14:55                               ` Peter Xu
2025-06-19 18:40                                 ` Jason Gunthorpe
2025-06-24 20:37                                   ` Peter Xu
2025-06-24 20:51                                     ` Peter Xu
2025-06-24 23:40                                     ` Jason Gunthorpe
2025-06-25  0:48                                       ` Peter Xu
2025-06-25 13:07                                         ` Jason Gunthorpe
2025-06-25 17:12                                           ` Peter Xu
2025-06-25 18:41                                             ` Jason Gunthorpe
2025-06-25 19:26                                               ` Peter Xu
2025-06-30 14:05                                                 ` Jason Gunthorpe
2025-07-02 20:58                                                   ` Peter Xu
2025-07-02 23:32                                                     ` Jason Gunthorpe
2025-06-13 18:09   ` David Hildenbrand
2025-06-13 19:21     ` Peter Xu
     [not found]   ` <20250613174442.1589882-1-amastro@fb.com>
2025-06-13 18:53     ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).