public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps
@ 2025-02-18 22:22 Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 1/6] vfio/type1: Catch zero from pin_user_pages_remote() Alex Williamson
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
	linux-mm, david, willy

v2:
 - Rewrapped comment block in 3/6
 - Added 4/6 to use consistent types (Jason)
 - Renamed s/pgmask/addr_mask/ (David)
 - Updated 6/6 with proposed epfn algorithm (Jason)
 - Applied and retained sign-offs for all but 6/6 where the epfn
   calculation changed

v1: https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/

As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
become a significant overhead for VMs making use of device assignment.
Not only does each mapping require upwards of a few seconds, but BARs
are mapped in and out of the VM address space multiple times during
guest boot.  Also factor in that multi-GPU configurations are
increasingly commonplace and BAR sizes are continuing to increase.
Configurations today can already be delayed minutes during guest boot.

We've taken steps to make Linux a better guest by batching PCI BAR
sizing operations[1], but it only provides and incremental improvement.

This series attempts to fully address the issue by leveraging the huge
pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
mappings, we can later take advantage of the knowledge of the mapping
level page mask to iterate on the relevant mapping stride.  In the
commonly achieved optimal case, this results in a reduction of pfn
lookups by a factor of 256k.  For a local test system, an overhead of
~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
page sized operations reduced to 32 pud sized operations).

Please review, test, and provide feedback.  I hope that mm folks can
ack the trivial follow_pfnmap_args update to provide the mapping level
page mask.  Naming is hard, so any preference other than pgmask is
welcome.  Thanks,

Alex

[1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/


Alex Williamson (6):
  vfio/type1: Catch zero from pin_user_pages_remote()
  vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
  vfio/type1: Use vfio_batch for vaddr_get_pfns()
  vfio/type1: Use consistent types for page counts
  mm: Provide address mask in struct follow_pfnmap_args
  vfio/type1: Use mapping page mask for pfnmaps

 drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
 include/linux/mm.h              |   2 +
 mm/memory.c                     |   1 +
 3 files changed, 80 insertions(+), 46 deletions(-)

-- 
2.48.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v2 1/6] vfio/type1: Catch zero from pin_user_pages_remote()
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 2/6] vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch Alex Williamson
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg

pin_user_pages_remote() can currently return zero for invalid args
or zero nr_pages, neither of which should ever happen.  However
vaddr_get_pfns() indicates it should only ever return a positive
value or -errno and there's a theoretical case where this can slip
through and be unhandled by callers.  Therefore convert zero to
-EFAULT.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 50ebc9593c9d..119cf886d8c0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -564,6 +564,8 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	if (ret > 0) {
 		*pfn = page_to_pfn(pages[0]);
 		goto done;
+	} else if (!ret) {
+		ret = -EFAULT;
 	}
 
 	vaddr = untagged_addr_remote(mm, vaddr);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 2/6] vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 1/6] vfio/type1: Catch zero from pin_user_pages_remote() Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 3/6] vfio/type1: Use vfio_batch for vaddr_get_pfns() Alex Williamson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg

This is a step towards passing the structure to vaddr_get_pfns()
directly in order to provide greater distinction between page backed
pfns and pfnmaps.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 119cf886d8c0..2e95f5f4d881 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -471,12 +471,12 @@ static int put_pfn(unsigned long pfn, int prot)
 
 #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
 
-static void vfio_batch_init(struct vfio_batch *batch)
+static void __vfio_batch_init(struct vfio_batch *batch, bool single)
 {
 	batch->size = 0;
 	batch->offset = 0;
 
-	if (unlikely(disable_hugepages))
+	if (single || unlikely(disable_hugepages))
 		goto fallback;
 
 	batch->pages = (struct page **) __get_free_page(GFP_KERNEL);
@@ -491,6 +491,16 @@ static void vfio_batch_init(struct vfio_batch *batch)
 	batch->capacity = 1;
 }
 
+static void vfio_batch_init(struct vfio_batch *batch)
+{
+	__vfio_batch_init(batch, false);
+}
+
+static void vfio_batch_init_single(struct vfio_batch *batch)
+{
+	__vfio_batch_init(batch, true);
+}
+
 static void vfio_batch_unpin(struct vfio_batch *batch, struct vfio_dma *dma)
 {
 	while (batch->size) {
@@ -730,7 +740,7 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 				  unsigned long *pfn_base, bool do_accounting)
 {
-	struct page *pages[1];
+	struct vfio_batch batch;
 	struct mm_struct *mm;
 	int ret;
 
@@ -738,7 +748,9 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	if (!mmget_not_zero(mm))
 		return -ENODEV;
 
-	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, pages);
+	vfio_batch_init_single(&batch);
+
+	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, batch.pages);
 	if (ret != 1)
 		goto out;
 
@@ -757,6 +769,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 	}
 
 out:
+	vfio_batch_fini(&batch);
 	mmput(mm);
 	return ret;
 }
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 3/6] vfio/type1: Use vfio_batch for vaddr_get_pfns()
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 1/6] vfio/type1: Catch zero from pin_user_pages_remote() Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 2/6] vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
  2025-02-18 22:22 ` [PATCH v2 4/6] vfio/type1: Use consistent types for page counts Alex Williamson
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg

Passing the vfio_batch to vaddr_get_pfns() allows for greater
distinction between page backed pfns and pfnmaps.  In the case of page
backed pfns, vfio_batch.size is set to a positive value matching the
number of pages filled in vfio_batch.pages.  For a pfnmap,
vfio_batch.size remains zero as vfio_batch.pages are not used.  In both
cases the return value continues to indicate the number of pfns and the
provided pfn arg is set to the initial pfn value.

This allows us to shortcut the pfnmap case, which is detected by the
zero vfio_batch.size.  pfnmaps do not contribute to locked memory
accounting, therefore we can update counters and continue directly,
which also enables a future where vaddr_get_pfns() can return a value
greater than one for consecutive pfnmaps.

NB. Now that we're not guessing whether the initial pfn is page backed
or pfnmap, we no longer need to special case the put_pfn() and batch
size reset.  It's safe for vfio_batch_unpin() to handle this case.

Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 63 ++++++++++++++++++---------------
 1 file changed, 35 insertions(+), 28 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 2e95f5f4d881..fafd8af125c7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -555,12 +555,16 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 
 /*
  * Returns the positive number of pfns successfully obtained or a negative
- * error code.
+ * error code.  The initial pfn is stored in the pfn arg.  For page-backed
+ * pfns, the provided batch is also updated to indicate the filled pages and
+ * initial offset.  For VM_PFNMAP pfns, only the returned number of pfns and
+ * returned initial pfn are provided; subsequent pfns are contiguous.
  */
 static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 			  long npages, int prot, unsigned long *pfn,
-			  struct page **pages)
+			  struct vfio_batch *batch)
 {
+	long pin_pages = min_t(long, npages, batch->capacity);
 	struct vm_area_struct *vma;
 	unsigned int flags = 0;
 	int ret;
@@ -569,10 +573,12 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 		flags |= FOLL_WRITE;
 
 	mmap_read_lock(mm);
-	ret = pin_user_pages_remote(mm, vaddr, npages, flags | FOLL_LONGTERM,
-				    pages, NULL);
+	ret = pin_user_pages_remote(mm, vaddr, pin_pages, flags | FOLL_LONGTERM,
+				    batch->pages, NULL);
 	if (ret > 0) {
-		*pfn = page_to_pfn(pages[0]);
+		*pfn = page_to_pfn(batch->pages[0]);
+		batch->size = ret;
+		batch->offset = 0;
 		goto done;
 	} else if (!ret) {
 		ret = -EFAULT;
@@ -628,32 +634,42 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 		*pfn_base = 0;
 	}
 
+	if (unlikely(disable_hugepages))
+		npage = 1;
+
 	while (npage) {
 		if (!batch->size) {
 			/* Empty batch, so refill it. */
-			long req_pages = min_t(long, npage, batch->capacity);
-
-			ret = vaddr_get_pfns(mm, vaddr, req_pages, dma->prot,
-					     &pfn, batch->pages);
+			ret = vaddr_get_pfns(mm, vaddr, npage, dma->prot,
+					     &pfn, batch);
 			if (ret < 0)
 				goto unpin_out;
 
-			batch->size = ret;
-			batch->offset = 0;
-
 			if (!*pfn_base) {
 				*pfn_base = pfn;
 				rsvd = is_invalid_reserved_pfn(*pfn_base);
 			}
+
+			/* Handle pfnmap */
+			if (!batch->size) {
+				if (pfn != *pfn_base + pinned || !rsvd)
+					goto out;
+
+				pinned += ret;
+				npage -= ret;
+				vaddr += (PAGE_SIZE * ret);
+				iova += (PAGE_SIZE * ret);
+				continue;
+			}
 		}
 
 		/*
-		 * pfn is preset for the first iteration of this inner loop and
-		 * updated at the end to handle a VM_PFNMAP pfn.  In that case,
-		 * batch->pages isn't valid (there's no struct page), so allow
-		 * batch->pages to be touched only when there's more than one
-		 * pfn to check, which guarantees the pfns are from a
-		 * !VM_PFNMAP vma.
+		 * pfn is preset for the first iteration of this inner loop
+		 * due to the fact that vaddr_get_pfns() needs to provide the
+		 * initial pfn for pfnmaps.  Therefore to reduce redundancy,
+		 * the next pfn is fetched at the end of the loop.
+		 * A PageReserved() page could still qualify as page backed
+		 * and rsvd here, and therefore continues to use the batch.
 		 */
 		while (true) {
 			if (pfn != *pfn_base + pinned ||
@@ -688,21 +704,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 
 			pfn = page_to_pfn(batch->pages[batch->offset]);
 		}
-
-		if (unlikely(disable_hugepages))
-			break;
 	}
 
 out:
 	ret = vfio_lock_acct(dma, lock_acct, false);
 
 unpin_out:
-	if (batch->size == 1 && !batch->offset) {
-		/* May be a VM_PFNMAP pfn, which the batch can't remember. */
-		put_pfn(pfn, dma->prot);
-		batch->size = 0;
-	}
-
 	if (ret < 0) {
 		if (pinned && !rsvd) {
 			for (pfn = *pfn_base ; pinned ; pfn++, pinned--)
@@ -750,7 +757,7 @@ static int vfio_pin_page_external(struct vfio_dma *dma, unsigned long vaddr,
 
 	vfio_batch_init_single(&batch);
 
-	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, batch.pages);
+	ret = vaddr_get_pfns(mm, vaddr, 1, dma->prot, pfn_base, &batch);
 	if (ret != 1)
 		goto out;
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 4/6] vfio/type1: Use consistent types for page counts
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
                   ` (2 preceding siblings ...)
  2025-02-18 22:22 ` [PATCH v2 3/6] vfio/type1: Use vfio_batch for vaddr_get_pfns() Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
  2025-02-18 22:46   ` Peter Xu
  2025-02-25  0:17   ` Jason Gunthorpe
  2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg

Page count should more consistently be an unsigned long when passed as
an argument while functions returning a number of pages should use a
signed long to allow for -errno.

vaddr_get_pfns() can therefore be upgraded to return long, though in
practice it's currently limited by the batch capacity.  In fact, the
batch indexes are noted to never hold negative values, so while it
doesn't make sense to bloat the structure with unsigned longs in this
case, it does make sense to specify these as unsigned.

No change in behavior expected.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index fafd8af125c7..ce661f03f139 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -103,9 +103,9 @@ struct vfio_dma {
 struct vfio_batch {
 	struct page		**pages;	/* for pin_user_pages_remote */
 	struct page		*fallback_page; /* if pages alloc fails */
-	int			capacity;	/* length of pages array */
-	int			size;		/* of batch currently */
-	int			offset;		/* of next entry in pages */
+	unsigned int		capacity;	/* length of pages array */
+	unsigned int		size;		/* of batch currently */
+	unsigned int		offset;		/* of next entry in pages */
 };
 
 struct vfio_iommu_group {
@@ -560,14 +560,14 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
  * initial offset.  For VM_PFNMAP pfns, only the returned number of pfns and
  * returned initial pfn are provided; subsequent pfns are contiguous.
  */
-static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
-			  long npages, int prot, unsigned long *pfn,
-			  struct vfio_batch *batch)
+static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
+			   unsigned long npages, int prot, unsigned long *pfn,
+			   struct vfio_batch *batch)
 {
-	long pin_pages = min_t(long, npages, batch->capacity);
+	unsigned long pin_pages = min_t(unsigned long, npages, batch->capacity);
 	struct vm_area_struct *vma;
 	unsigned int flags = 0;
-	int ret;
+	long ret;
 
 	if (prot & IOMMU_WRITE)
 		flags |= FOLL_WRITE;
@@ -612,7 +612,7 @@ static int vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
  * first page and all consecutive pages with the same locking.
  */
 static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
-				  long npage, unsigned long *pfn_base,
+				  unsigned long npage, unsigned long *pfn_base,
 				  unsigned long limit, struct vfio_batch *batch)
 {
 	unsigned long pfn;
@@ -724,7 +724,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 }
 
 static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
-				    unsigned long pfn, long npage,
+				    unsigned long pfn, unsigned long npage,
 				    bool do_accounting)
 {
 	long unlocked = 0, locked = 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
                   ` (3 preceding siblings ...)
  2025-02-18 22:22 ` [PATCH v2 4/6] vfio/type1: Use consistent types for page counts Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
  2025-02-19  8:31   ` David Hildenbrand
  2025-02-18 22:22 ` [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
	linux-mm, david

follow_pfnmap_start() walks the page table for a given address and
fills out the struct follow_pfnmap_args in pfnmap_args_setup().
The address mask of the page table level is already provided to this
latter function for calculating the pfn.  This address mask can also
be useful for the caller to determine the extent of the contiguous
mapping.

For example, vfio-pci now supports huge_fault for pfnmaps and is able
to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
for a contiguous pfn range.  Providing the mapping address mask allows
us to skip the extent of the mapping level.  Assuming a 1GB pud level
and 4KB page size, iterations are reduced by a factor of 256K.  In wall
clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: linux-mm@kvack.org
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 include/linux/mm.h | 2 ++
 mm/memory.c        | 1 +
 2 files changed, 3 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1068ddcbb7..92b30dba7e38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2417,11 +2417,13 @@ struct follow_pfnmap_args {
 	 * Outputs:
 	 *
 	 * @pfn: the PFN of the address
+	 * @addr_mask: address mask covering pfn
 	 * @pgprot: the pgprot_t of the mapping
 	 * @writable: whether the mapping is writable
 	 * @special: whether the mapping is a special mapping (real PFN maps)
 	 */
 	unsigned long pfn;
+	unsigned long addr_mask;
 	pgprot_t pgprot;
 	bool writable;
 	bool special;
diff --git a/mm/memory.c b/mm/memory.c
index 539c0f7c6d54..8f0969f132fe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -6477,6 +6477,7 @@ static inline void pfnmap_args_setup(struct follow_pfnmap_args *args,
 	args->lock = lock;
 	args->ptep = ptep;
 	args->pfn = pfn_base + ((args->address & ~addr_mask) >> PAGE_SHIFT);
+	args->addr_mask = addr_mask;
 	args->pgprot = pgprot;
 	args->writable = writable;
 	args->special = special;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
                   ` (4 preceding siblings ...)
  2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
@ 2025-02-18 22:22 ` Alex Williamson
  2025-02-18 22:47   ` Peter Xu
  2025-02-25  0:17   ` Jason Gunthorpe
  2025-02-19  2:37 ` [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Mitchell Augustin
  2025-02-28 16:32 ` Alex Williamson
  7 siblings, 2 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 22:22 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, willy

vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
page table and therefore knows the page mask of the level where the
address is found and returns this through follow_pfnmap_args.pgmask.
Subsequent pfns from this address until the end of the mapping page are
necessarily consecutive.  Use this information to retrieve a range of
pfnmap pfns in a single pass.

With optimal mappings and alignment on systems with 1GB pud and 4KB
page size, this reduces iterations for DMA mapping PCI BARs by a
factor of 256K.  In real world testing, the overhead of iterating
pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
sub-millisecond overhead.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index ce661f03f139..0ac56072af9f 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
 
 static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 			    unsigned long vaddr, unsigned long *pfn,
-			    bool write_fault)
+			    unsigned long *addr_mask, bool write_fault)
 {
 	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
 	int ret;
@@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
 			return ret;
 	}
 
-	if (write_fault && !args.writable)
+	if (write_fault && !args.writable) {
 		ret = -EFAULT;
-	else
+	} else {
 		*pfn = args.pfn;
+		*addr_mask = args.addr_mask;
+	}
 
 	follow_pfnmap_end(&args);
 	return ret;
@@ -590,15 +592,22 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	vma = vma_lookup(mm, vaddr);
 
 	if (vma && vma->vm_flags & VM_PFNMAP) {
-		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
+		unsigned long addr_mask;
+
+		ret = follow_fault_pfn(vma, mm, vaddr, pfn, &addr_mask,
+				       prot & IOMMU_WRITE);
 		if (ret == -EAGAIN)
 			goto retry;
 
 		if (!ret) {
-			if (is_invalid_reserved_pfn(*pfn))
-				ret = 1;
-			else
+			if (is_invalid_reserved_pfn(*pfn)) {
+				unsigned long epfn;
+
+				epfn = (*pfn | (~addr_mask >> PAGE_SHIFT)) + 1;
+				ret = min_t(long, npages, epfn - *pfn);
+			} else {
 				ret = -EFAULT;
+			}
 		}
 	}
 done:
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/6] vfio/type1: Use consistent types for page counts
  2025-02-18 22:22 ` [PATCH v2 4/6] vfio/type1: Use consistent types for page counts Alex Williamson
@ 2025-02-18 22:46   ` Peter Xu
  2025-02-19  0:55     ` Mitchell Augustin
  2025-02-25  0:17   ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Xu @ 2025-02-18 22:46 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, mitchell.augustin, clg, jgg

On Tue, Feb 18, 2025 at 03:22:04PM -0700, Alex Williamson wrote:
> Page count should more consistently be an unsigned long when passed as
> an argument while functions returning a number of pages should use a
> signed long to allow for -errno.
> 
> vaddr_get_pfns() can therefore be upgraded to return long, though in
> practice it's currently limited by the batch capacity.  In fact, the
> batch indexes are noted to never hold negative values, so while it
> doesn't make sense to bloat the structure with unsigned longs in this
> case, it does make sense to specify these as unsigned.
> 
> No change in behavior expected.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-18 22:22 ` [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
@ 2025-02-18 22:47   ` Peter Xu
  2025-02-18 23:14     ` Alex Williamson
  2025-02-25  0:17   ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Peter Xu @ 2025-02-18 22:47 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, mitchell.augustin, clg, jgg, willy

On Tue, Feb 18, 2025 at 03:22:06PM -0700, Alex Williamson wrote:
> vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> page table and therefore knows the page mask of the level where the
> address is found and returns this through follow_pfnmap_args.pgmask.
> Subsequent pfns from this address until the end of the mapping page are
> necessarily consecutive.  Use this information to retrieve a range of
> pfnmap pfns in a single pass.
> 
> With optimal mappings and alignment on systems with 1GB pud and 4KB
> page size, this reduces iterations for DMA mapping PCI BARs by a
> factor of 256K.  In real world testing, the overhead of iterating
> pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> sub-millisecond overhead.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index ce661f03f139..0ac56072af9f 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
>  
>  static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
>  			    unsigned long vaddr, unsigned long *pfn,
> -			    bool write_fault)
> +			    unsigned long *addr_mask, bool write_fault)
>  {
>  	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
>  	int ret;
> @@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
>  			return ret;
>  	}
>  
> -	if (write_fault && !args.writable)
> +	if (write_fault && !args.writable) {
>  		ret = -EFAULT;
> -	else
> +	} else {
>  		*pfn = args.pfn;
> +		*addr_mask = args.addr_mask;
> +	}
>  
>  	follow_pfnmap_end(&args);
>  	return ret;
> @@ -590,15 +592,22 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>  	vma = vma_lookup(mm, vaddr);
>  
>  	if (vma && vma->vm_flags & VM_PFNMAP) {
> -		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> +		unsigned long addr_mask;
> +
> +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, &addr_mask,
> +				       prot & IOMMU_WRITE);
>  		if (ret == -EAGAIN)
>  			goto retry;
>  
>  		if (!ret) {
> -			if (is_invalid_reserved_pfn(*pfn))
> -				ret = 1;
> -			else
> +			if (is_invalid_reserved_pfn(*pfn)) {
> +				unsigned long epfn;
> +
> +				epfn = (*pfn | (~addr_mask >> PAGE_SHIFT)) + 1;
> +				ret = min_t(long, npages, epfn - *pfn);

s/long/unsigned long/?

Reviewed-by: Peter Xu <peterx@redhat.com>

> +			} else {
>  				ret = -EFAULT;
> +			}
>  		}
>  	}
>  done:
> -- 
> 2.48.1
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-18 22:47   ` Peter Xu
@ 2025-02-18 23:14     ` Alex Williamson
  2025-02-19  2:36       ` Mitchell Augustin
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Williamson @ 2025-02-18 23:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: kvm, linux-kernel, mitchell.augustin, clg, jgg, willy

On Tue, 18 Feb 2025 17:47:46 -0500
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Feb 18, 2025 at 03:22:06PM -0700, Alex Williamson wrote:
> > vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> > pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> > page table and therefore knows the page mask of the level where the
> > address is found and returns this through follow_pfnmap_args.pgmask.
> > Subsequent pfns from this address until the end of the mapping page are
> > necessarily consecutive.  Use this information to retrieve a range of
> > pfnmap pfns in a single pass.
> > 
> > With optimal mappings and alignment on systems with 1GB pud and 4KB
> > page size, this reduces iterations for DMA mapping PCI BARs by a
> > factor of 256K.  In real world testing, the overhead of iterating
> > pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> > sub-millisecond overhead.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
> >  1 file changed, 16 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index ce661f03f139..0ac56072af9f 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
> >  
> >  static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> >  			    unsigned long vaddr, unsigned long *pfn,
> > -			    bool write_fault)
> > +			    unsigned long *addr_mask, bool write_fault)
> >  {
> >  	struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
> >  	int ret;
> > @@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> >  			return ret;
> >  	}
> >  
> > -	if (write_fault && !args.writable)
> > +	if (write_fault && !args.writable) {
> >  		ret = -EFAULT;
> > -	else
> > +	} else {
> >  		*pfn = args.pfn;
> > +		*addr_mask = args.addr_mask;
> > +	}
> >  
> >  	follow_pfnmap_end(&args);
> >  	return ret;
> > @@ -590,15 +592,22 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >  	vma = vma_lookup(mm, vaddr);
> >  
> >  	if (vma && vma->vm_flags & VM_PFNMAP) {
> > -		ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > +		unsigned long addr_mask;
> > +
> > +		ret = follow_fault_pfn(vma, mm, vaddr, pfn, &addr_mask,
> > +				       prot & IOMMU_WRITE);
> >  		if (ret == -EAGAIN)
> >  			goto retry;
> >  
> >  		if (!ret) {
> > -			if (is_invalid_reserved_pfn(*pfn))
> > -				ret = 1;
> > -			else
> > +			if (is_invalid_reserved_pfn(*pfn)) {
> > +				unsigned long epfn;
> > +
> > +				epfn = (*pfn | (~addr_mask >> PAGE_SHIFT)) + 1;
> > +				ret = min_t(long, npages, epfn - *pfn);  
> 
> s/long/unsigned long/?

ret is signed long since it's the function return and needs to be able
to return -errno, so long was the intention here.  Thanks,

Alex
 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> > +			} else {
> >  				ret = -EFAULT;
> > +			}
> >  		}
> >  	}
> >  done:
> > -- 
> > 2.48.1
> >   
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/6] vfio/type1: Use consistent types for page counts
  2025-02-18 22:46   ` Peter Xu
@ 2025-02-19  0:55     ` Mitchell Augustin
  0 siblings, 0 replies; 23+ messages in thread
From: Mitchell Augustin @ 2025-02-19  0:55 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, kvm, linux-kernel, clg, jgg

No change in behavior observed from v1 on my config (DGX H100). Thanks!

Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>

On Tue, Feb 18, 2025 at 4:46 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Feb 18, 2025 at 03:22:04PM -0700, Alex Williamson wrote:
> > Page count should more consistently be an unsigned long when passed as
> > an argument while functions returning a number of pages should use a
> > signed long to allow for -errno.
> >
> > vaddr_get_pfns() can therefore be upgraded to return long, though in
> > practice it's currently limited by the batch capacity.  In fact, the
> > batch indexes are noted to never hold negative values, so while it
> > doesn't make sense to bloat the structure with unsigned longs in this
> > case, it does make sense to specify these as unsigned.
> >
> > No change in behavior expected.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>
> Reviewed-by: Peter Xu <peterx@redhat.com>
>
> --
> Peter Xu
>


-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-18 23:14     ` Alex Williamson
@ 2025-02-19  2:36       ` Mitchell Augustin
  2025-02-19 15:08         ` Alex Williamson
  0 siblings, 1 reply; 23+ messages in thread
From: Mitchell Augustin @ 2025-02-19  2:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, kvm, linux-kernel, clg, jgg, willy

/s/follow_pfnmap_args.pgmask/follow_pfnmap_args.addr_mask/ in v2
commit log. Aside from that, it works as expected.

Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>


On Tue, Feb 18, 2025 at 5:14 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Tue, 18 Feb 2025 17:47:46 -0500
> Peter Xu <peterx@redhat.com> wrote:
>
> > On Tue, Feb 18, 2025 at 03:22:06PM -0700, Alex Williamson wrote:
> > > vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> > > pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> > > page table and therefore knows the page mask of the level where the
> > > address is found and returns this through follow_pfnmap_args.pgmask.
> > > Subsequent pfns from this address until the end of the mapping page are
> > > necessarily consecutive.  Use this information to retrieve a range of
> > > pfnmap pfns in a single pass.
> > >
> > > With optimal mappings and alignment on systems with 1GB pud and 4KB
> > > page size, this reduces iterations for DMA mapping PCI BARs by a
> > > factor of 256K.  In real world testing, the overhead of iterating
> > > pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> > > sub-millisecond overhead.
> > >
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
> > >  1 file changed, 16 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index ce661f03f139..0ac56072af9f 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
> > >
> > >  static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > >                         unsigned long vaddr, unsigned long *pfn,
> > > -                       bool write_fault)
> > > +                       unsigned long *addr_mask, bool write_fault)
> > >  {
> > >     struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
> > >     int ret;
> > > @@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > >                     return ret;
> > >     }
> > >
> > > -   if (write_fault && !args.writable)
> > > +   if (write_fault && !args.writable) {
> > >             ret = -EFAULT;
> > > -   else
> > > +   } else {
> > >             *pfn = args.pfn;
> > > +           *addr_mask = args.addr_mask;
> > > +   }
> > >
> > >     follow_pfnmap_end(&args);
> > >     return ret;
> > > @@ -590,15 +592,22 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > >     vma = vma_lookup(mm, vaddr);
> > >
> > >     if (vma && vma->vm_flags & VM_PFNMAP) {
> > > -           ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > > +           unsigned long addr_mask;
> > > +
> > > +           ret = follow_fault_pfn(vma, mm, vaddr, pfn, &addr_mask,
> > > +                                  prot & IOMMU_WRITE);
> > >             if (ret == -EAGAIN)
> > >                     goto retry;
> > >
> > >             if (!ret) {
> > > -                   if (is_invalid_reserved_pfn(*pfn))
> > > -                           ret = 1;
> > > -                   else
> > > +                   if (is_invalid_reserved_pfn(*pfn)) {
> > > +                           unsigned long epfn;
> > > +
> > > +                           epfn = (*pfn | (~addr_mask >> PAGE_SHIFT)) + 1;
> > > +                           ret = min_t(long, npages, epfn - *pfn);
> >
> > s/long/unsigned long/?
>
> ret is signed long since it's the function return and needs to be able
> to return -errno, so long was the intention here.  Thanks,
>
> Alex
>
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> >
> > > +                   } else {
> > >                             ret = -EFAULT;
> > > +                   }
> > >             }
> > >     }
> > >  done:
> > > --
> > > 2.48.1
> > >
> >
>


--
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
                   ` (5 preceding siblings ...)
  2025-02-18 22:22 ` [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
@ 2025-02-19  2:37 ` Mitchell Augustin
  2025-02-28 16:32 ` Alex Williamson
  7 siblings, 0 replies; 23+ messages in thread
From: Mitchell Augustin @ 2025-02-19  2:37 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, clg, jgg, akpm, linux-mm, david, willy

No change in behavior observed from v1 on my config (DGX H100). Thanks!

Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>

On Tue, Feb 18, 2025 at 4:22 PM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> v2:
>  - Rewrapped comment block in 3/6
>  - Added 4/6 to use consistent types (Jason)
>  - Renamed s/pgmask/addr_mask/ (David)
>  - Updated 6/6 with proposed epfn algorithm (Jason)
>  - Applied and retained sign-offs for all but 6/6 where the epfn
>    calculation changed
>
> v1: https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/
>
> As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
> become a significant overhead for VMs making use of device assignment.
> Not only does each mapping require upwards of a few seconds, but BARs
> are mapped in and out of the VM address space multiple times during
> guest boot.  Also factor in that multi-GPU configurations are
> increasingly commonplace and BAR sizes are continuing to increase.
> Configurations today can already be delayed minutes during guest boot.
>
> We've taken steps to make Linux a better guest by batching PCI BAR
> sizing operations[1], but it only provides and incremental improvement.
>
> This series attempts to fully address the issue by leveraging the huge
> pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
> mappings, we can later take advantage of the knowledge of the mapping
> level page mask to iterate on the relevant mapping stride.  In the
> commonly achieved optimal case, this results in a reduction of pfn
> lookups by a factor of 256k.  For a local test system, an overhead of
> ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
> page sized operations reduced to 32 pud sized operations).
>
> Please review, test, and provide feedback.  I hope that mm folks can
> ack the trivial follow_pfnmap_args update to provide the mapping level
> page mask.  Naming is hard, so any preference other than pgmask is
> welcome.  Thanks,
>
> Alex
>
> [1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
>
>
> Alex Williamson (6):
>   vfio/type1: Catch zero from pin_user_pages_remote()
>   vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
>   vfio/type1: Use vfio_batch for vaddr_get_pfns()
>   vfio/type1: Use consistent types for page counts
>   mm: Provide address mask in struct follow_pfnmap_args
>   vfio/type1: Use mapping page mask for pfnmaps
>
>  drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
>  include/linux/mm.h              |   2 +
>  mm/memory.c                     |   1 +
>  3 files changed, 80 insertions(+), 46 deletions(-)
>
> --
> 2.48.1
>


-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering
Email:mitchell.augustin@canonical.com
Location:United States of America


canonical.com
ubuntu.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
  2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
@ 2025-02-19  8:31   ` David Hildenbrand
  2025-02-26 19:54     ` Alex Williamson
  0 siblings, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2025-02-19  8:31 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
	linux-mm

On 18.02.25 23:22, Alex Williamson wrote:
> follow_pfnmap_start() walks the page table for a given address and
> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> The address mask of the page table level is already provided to this
> latter function for calculating the pfn.  This address mask can also
> be useful for the caller to determine the extent of the contiguous
> mapping.
> 
> For example, vfio-pci now supports huge_fault for pfnmaps and is able
> to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> for a contiguous pfn range.  Providing the mapping address mask allows
> us to skip the extent of the mapping level.  Assuming a 1GB pud level
> and 4KB page size, iterations are reduced by a factor of 256K.  In wall
> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: linux-mm@kvack.org
> Reviewed-by: Peter Xu <peterx@redhat.com>
> Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-19  2:36       ` Mitchell Augustin
@ 2025-02-19 15:08         ` Alex Williamson
  2025-02-19 20:32           ` Mitchell Augustin
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Williamson @ 2025-02-19 15:08 UTC (permalink / raw)
  To: Mitchell Augustin; +Cc: Peter Xu, kvm, linux-kernel, clg, jgg, willy

On Tue, 18 Feb 2025 20:36:22 -0600
Mitchell Augustin <mitchell.augustin@canonical.com> wrote:

> /s/follow_pfnmap_args.pgmask/follow_pfnmap_args.addr_mask/ in v2
> commit log.

Thanks for spotting that, if there's no other cause for a re-spin I'll
fix that on commit.  Thanks for the review and testing!

Alex

> Aside from that, it works as expected.
> 
> Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> 
> 
> On Tue, Feb 18, 2025 at 5:14 PM Alex Williamson
> <alex.williamson@redhat.com> wrote:
> >
> > On Tue, 18 Feb 2025 17:47:46 -0500
> > Peter Xu <peterx@redhat.com> wrote:
> >  
> > > On Tue, Feb 18, 2025 at 03:22:06PM -0700, Alex Williamson wrote:  
> > > > vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> > > > pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> > > > page table and therefore knows the page mask of the level where the
> > > > address is found and returns this through follow_pfnmap_args.pgmask.
> > > > Subsequent pfns from this address until the end of the mapping page are
> > > > necessarily consecutive.  Use this information to retrieve a range of
> > > > pfnmap pfns in a single pass.
> > > >
> > > > With optimal mappings and alignment on systems with 1GB pud and 4KB
> > > > page size, this reduces iterations for DMA mapping PCI BARs by a
> > > > factor of 256K.  In real world testing, the overhead of iterating
> > > > pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> > > > sub-millisecond overhead.
> > > >
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > >  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
> > > >  1 file changed, 16 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index ce661f03f139..0ac56072af9f 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
> > > >
> > > >  static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > >                         unsigned long vaddr, unsigned long *pfn,
> > > > -                       bool write_fault)
> > > > +                       unsigned long *addr_mask, bool write_fault)
> > > >  {
> > > >     struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
> > > >     int ret;
> > > > @@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > >                     return ret;
> > > >     }
> > > >
> > > > -   if (write_fault && !args.writable)
> > > > +   if (write_fault && !args.writable) {
> > > >             ret = -EFAULT;
> > > > -   else
> > > > +   } else {
> > > >             *pfn = args.pfn;
> > > > +           *addr_mask = args.addr_mask;
> > > > +   }
> > > >
> > > >     follow_pfnmap_end(&args);
> > > >     return ret;
> > > > @@ -590,15 +592,22 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > > >     vma = vma_lookup(mm, vaddr);
> > > >
> > > >     if (vma && vma->vm_flags & VM_PFNMAP) {
> > > > -           ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > > > +           unsigned long addr_mask;
> > > > +
> > > > +           ret = follow_fault_pfn(vma, mm, vaddr, pfn, &addr_mask,
> > > > +                                  prot & IOMMU_WRITE);
> > > >             if (ret == -EAGAIN)
> > > >                     goto retry;
> > > >
> > > >             if (!ret) {
> > > > -                   if (is_invalid_reserved_pfn(*pfn))
> > > > -                           ret = 1;
> > > > -                   else
> > > > +                   if (is_invalid_reserved_pfn(*pfn)) {
> > > > +                           unsigned long epfn;
> > > > +
> > > > +                           epfn = (*pfn | (~addr_mask >> PAGE_SHIFT)) + 1;
> > > > +                           ret = min_t(long, npages, epfn - *pfn);  
> > >
> > > s/long/unsigned long/?  
> >
> > ret is signed long since it's the function return and needs to be able
> > to return -errno, so long was the intention here.  Thanks,
> >
> > Alex
> >  
> > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > >  
> > > > +                   } else {
> > > >                             ret = -EFAULT;
> > > > +                   }
> > > >             }
> > > >     }
> > > >  done:
> > > > --
> > > > 2.48.1
> > > >  
> > >  
> >  
> 
> 
> --
> Mitchell Augustin
> Software Engineer - Ubuntu Partner Engineering
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-19 15:08         ` Alex Williamson
@ 2025-02-19 20:32           ` Mitchell Augustin
  2025-02-26 17:55             ` Alex Williamson
  0 siblings, 1 reply; 23+ messages in thread
From: Mitchell Augustin @ 2025-02-19 20:32 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, kvm, linux-kernel, clg, jgg, willy

> Thanks for the review and testing!

Sure thing, thanks for the patch set!

If you happen to have a few minutes, I'm struggling to understand the
epfn computation and would appreciate some insight.

My current understanding (very possibly incorrect):
- epfn is intended to be the last page frame number that can be
represented at the mapping level corresponding to addr_mask. (so, if
addr_mask == PUD_MASK, epfn would be the highest pfn still in PUD
level).
- ret should be == npages if all pfns in the requested vma are within
the memory hierarchy level denoted by addr_mask. If npages is more
than can be represented at that level, ret == the max number of page
frames representable at addr_mask level.
- - (if the second case is true, that means we were not able to obtain
all requested pages due to running out of PFNs at the current mapping
level)

If the above is all correct, what is confusing me is where the "(*pfn)
| " comes into this equation. If epfn is meant to be the last pfn
representable at addr_mask level of the hierarchy, wouldn't that be
represented by (~pgmask >> PAGE_SHIFT) alone?

Thanks in advance,
Mitchell Augustin

On Wed, Feb 19, 2025 at 9:08 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Tue, 18 Feb 2025 20:36:22 -0600
> Mitchell Augustin <mitchell.augustin@canonical.com> wrote:
>
> > /s/follow_pfnmap_args.pgmask/follow_pfnmap_args.addr_mask/ in v2
> > commit log.
>
> Thanks for spotting that, if there's no other cause for a re-spin I'll
> fix that on commit.  Thanks for the review and testing!
>
> Alex
>
> > Aside from that, it works as expected.
> >
> > Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> > Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> >
> >
> > On Tue, Feb 18, 2025 at 5:14 PM Alex Williamson
> > <alex.williamson@redhat.com> wrote:
> > >
> > > On Tue, 18 Feb 2025 17:47:46 -0500
> > > Peter Xu <peterx@redhat.com> wrote:
> > >
> > > > On Tue, Feb 18, 2025 at 03:22:06PM -0700, Alex Williamson wrote:
> > > > > vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> > > > > pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> > > > > page table and therefore knows the page mask of the level where the
> > > > > address is found and returns this through follow_pfnmap_args.pgmask.
> > > > > Subsequent pfns from this address until the end of the mapping page are
> > > > > necessarily consecutive.  Use this information to retrieve a range of
> > > > > pfnmap pfns in a single pass.
> > > > >
> > > > > With optimal mappings and alignment on systems with 1GB pud and 4KB
> > > > > page size, this reduces iterations for DMA mapping PCI BARs by a
> > > > > factor of 256K.  In real world testing, the overhead of iterating
> > > > > pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> > > > > sub-millisecond overhead.
> > > > >
> > > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > > ---
> > > > >  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
> > > > >  1 file changed, 16 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > > index ce661f03f139..0ac56072af9f 100644
> > > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > > @@ -520,7 +520,7 @@ static void vfio_batch_fini(struct vfio_batch *batch)
> > > > >
> > > > >  static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > > >                         unsigned long vaddr, unsigned long *pfn,
> > > > > -                       bool write_fault)
> > > > > +                       unsigned long *addr_mask, bool write_fault)
> > > > >  {
> > > > >     struct follow_pfnmap_args args = { .vma = vma, .address = vaddr };
> > > > >     int ret;
> > > > > @@ -544,10 +544,12 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm,
> > > > >                     return ret;
> > > > >     }
> > > > >
> > > > > -   if (write_fault && !args.writable)
> > > > > +   if (write_fault && !args.writable) {
> > > > >             ret = -EFAULT;
> > > > > -   else
> > > > > +   } else {
> > > > >             *pfn = args.pfn;
> > > > > +           *addr_mask = args.addr_mask;
> > > > > +   }
> > > > >
> > > > >     follow_pfnmap_end(&args);
> > > > >     return ret;
> > > > > @@ -590,15 +592,22 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > > > >     vma = vma_lookup(mm, vaddr);
> > > > >
> > > > >     if (vma && vma->vm_flags & VM_PFNMAP) {
> > > > > -           ret = follow_fault_pfn(vma, mm, vaddr, pfn, prot & IOMMU_WRITE);
> > > > > +           unsigned long addr_mask;
> > > > > +
> > > > > +           ret = follow_fault_pfn(vma, mm, vaddr, pfn, &addr_mask,
> > > > > +                                  prot & IOMMU_WRITE);
> > > > >             if (ret == -EAGAIN)
> > > > >                     goto retry;
> > > > >
> > > > >             if (!ret) {
> > > > > -                   if (is_invalid_reserved_pfn(*pfn))
> > > > > -                           ret = 1;
> > > > > -                   else
> > > > > +                   if (is_invalid_reserved_pfn(*pfn)) {
> > > > > +                           unsigned long epfn;
> > > > > +
> > > > > +                           epfn = (*pfn | (~addr_mask >> PAGE_SHIFT)) + 1;
> > > > > +                           ret = min_t(long, npages, epfn - *pfn);
> > > >
> > > > s/long/unsigned long/?
> > >
> > > ret is signed long since it's the function return and needs to be able
> > > to return -errno, so long was the intention here.  Thanks,
> > >
> > > Alex
> > >
> > > > Reviewed-by: Peter Xu <peterx@redhat.com>
> > > >
> > > > > +                   } else {
> > > > >                             ret = -EFAULT;
> > > > > +                   }
> > > > >             }
> > > > >     }
> > > > >  done:
> > > > > --
> > > > > 2.48.1
> > > > >
> > > >
> > >
> >
> >
> > --
> > Mitchell Augustin
> > Software Engineer - Ubuntu Partner Engineering
> >
>


-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 4/6] vfio/type1: Use consistent types for page counts
  2025-02-18 22:22 ` [PATCH v2 4/6] vfio/type1: Use consistent types for page counts Alex Williamson
  2025-02-18 22:46   ` Peter Xu
@ 2025-02-25  0:17   ` Jason Gunthorpe
  1 sibling, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2025-02-25  0:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg

On Tue, Feb 18, 2025 at 03:22:04PM -0700, Alex Williamson wrote:
> Page count should more consistently be an unsigned long when passed as
> an argument while functions returning a number of pages should use a
> signed long to allow for -errno.
> 
> vaddr_get_pfns() can therefore be upgraded to return long, though in
> practice it's currently limited by the batch capacity.  In fact, the
> batch indexes are noted to never hold negative values, so while it
> doesn't make sense to bloat the structure with unsigned longs in this
> case, it does make sense to specify these as unsigned.
> 
> No change in behavior expected.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-18 22:22 ` [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
  2025-02-18 22:47   ` Peter Xu
@ 2025-02-25  0:17   ` Jason Gunthorpe
  1 sibling, 0 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2025-02-25  0:17 UTC (permalink / raw)
  To: Alex Williamson; +Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, willy

On Tue, Feb 18, 2025 at 03:22:06PM -0700, Alex Williamson wrote:
> vfio-pci supports huge_fault for PCI MMIO BARs and will insert pud and
> pmd mappings for well aligned mappings.  follow_pfnmap_start() walks the
> page table and therefore knows the page mask of the level where the
> address is found and returns this through follow_pfnmap_args.pgmask.
> Subsequent pfns from this address until the end of the mapping page are
> necessarily consecutive.  Use this information to retrieve a range of
> pfnmap pfns in a single pass.
> 
> With optimal mappings and alignment on systems with 1GB pud and 4KB
> page size, this reduces iterations for DMA mapping PCI BARs by a
> factor of 256K.  In real world testing, the overhead of iterating
> pfns for a VM DMA mapping a 32GB PCI BAR is reduced from ~1s to
> sub-millisecond overhead.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-19 20:32           ` Mitchell Augustin
@ 2025-02-26 17:55             ` Alex Williamson
  2025-02-27 20:22               ` Mitchell Augustin
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Williamson @ 2025-02-26 17:55 UTC (permalink / raw)
  To: Mitchell Augustin; +Cc: Peter Xu, kvm, linux-kernel, clg, jgg, willy

On Wed, 19 Feb 2025 14:32:35 -0600
Mitchell Augustin <mitchell.augustin@canonical.com> wrote:

> > Thanks for the review and testing!  
> 
> Sure thing, thanks for the patch set!
> 
> If you happen to have a few minutes, I'm struggling to understand the
> epfn computation and would appreciate some insight.

Sorry, this slipped off my todo list for a few days.

> My current understanding (very possibly incorrect):
> - epfn is intended to be the last page frame number that can be
> represented at the mapping level corresponding to addr_mask. (so, if
> addr_mask == PUD_MASK, epfn would be the highest pfn still in PUD
> level).

Actually epfn is the first pfn of the next addr_mask level page.  The
value in the parens (*pfn | (~addr_mask >> PAGE_SHIFT)) is the last pfn
within the same level page.  We could do it either way, it's just a
matter of where the +1 gets added.

> - ret should be == npages if all pfns in the requested vma are within
> the memory hierarchy level denoted by addr_mask. If npages is more
> than can be represented at that level, ret == the max number of page
> frames representable at addr_mask level.

Yes.

> - - (if the second case is true, that means we were not able to obtain
> all requested pages due to running out of PFNs at the current mapping
> level)

vaddr_get_pfns() is called again if we haven't reached npage.
Specifically, from vfio_pin_pages_remote() we hit the added continue
under the !batch->size branch.  If the pfnmaps are fully PUD aligned,
we'll call vaddr_get_pfns() once per PUD_SIZE, vfio_pin_pages_remote()
will only return with the full requested npage value, and we'll only
call vfio_iommu_map() once.  The latter has always been true, the
difference is the number of times we iterate calling vaddr_get_pfns().

> If the above is all correct, what is confusing me is where the "(*pfn)
> | " comes into this equation. If epfn is meant to be the last pfn
> representable at addr_mask level of the hierarchy, wouldn't that be
> represented by (~pgmask >> PAGE_SHIFT) alone?

(~addr_mask >> PAGE_SHIFT) gives us the last pfn relative to zero.  We
want the last pfn relative to *pfn, therefore we OR in *pfn.  The OR
handles any offset that *pfn might have within the addr_mask page, so
this operation always provides the last pfn of the addr_mask page
relative to *pfn.  +1 because we want to calculate the number of pfns
until the next page.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
  2025-02-19  8:31   ` David Hildenbrand
@ 2025-02-26 19:54     ` Alex Williamson
  2025-02-26 20:05       ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: Alex Williamson @ 2025-02-26 19:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
	linux-mm

On Wed, 19 Feb 2025 09:31:48 +0100
David Hildenbrand <david@redhat.com> wrote:

> On 18.02.25 23:22, Alex Williamson wrote:
> > follow_pfnmap_start() walks the page table for a given address and
> > fills out the struct follow_pfnmap_args in pfnmap_args_setup().
> > The address mask of the page table level is already provided to this
> > latter function for calculating the pfn.  This address mask can also
> > be useful for the caller to determine the extent of the contiguous
> > mapping.
> > 
> > For example, vfio-pci now supports huge_fault for pfnmaps and is able
> > to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
> > PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
> > for a contiguous pfn range.  Providing the mapping address mask allows
> > us to skip the extent of the mapping level.  Assuming a 1GB pud level
> > and 4KB page size, iterations are reduced by a factor of 256K.  In wall
> > clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
> > 
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: linux-mm@kvack.org
> > Reviewed-by: Peter Xu <peterx@redhat.com>
> > Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> > Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---  
> 
> Acked-by: David Hildenbrand <david@redhat.com>

Thanks, David!

Is there any objection from mm folks to bring this in through the vfio
tree?

Patch: https://lore.kernel.org/all/20250218222209.1382449-6-alex.williamson@redhat.com/
Series: https://lore.kernel.org/all/20250218222209.1382449-1-alex.williamson@redhat.com/

Thanks,
Alex


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args
  2025-02-26 19:54     ` Alex Williamson
@ 2025-02-26 20:05       ` David Hildenbrand
  0 siblings, 0 replies; 23+ messages in thread
From: David Hildenbrand @ 2025-02-26 20:05 UTC (permalink / raw)
  To: Alex Williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
	linux-mm

On 26.02.25 20:54, Alex Williamson wrote:
> On Wed, 19 Feb 2025 09:31:48 +0100
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 18.02.25 23:22, Alex Williamson wrote:
>>> follow_pfnmap_start() walks the page table for a given address and
>>> fills out the struct follow_pfnmap_args in pfnmap_args_setup().
>>> The address mask of the page table level is already provided to this
>>> latter function for calculating the pfn.  This address mask can also
>>> be useful for the caller to determine the extent of the contiguous
>>> mapping.
>>>
>>> For example, vfio-pci now supports huge_fault for pfnmaps and is able
>>> to insert pud and pmd mappings.  When we DMA map these pfnmaps, ex.
>>> PCI MMIO BARs, we iterate follow_pfnmap_start() to get each pfn to test
>>> for a contiguous pfn range.  Providing the mapping address mask allows
>>> us to skip the extent of the mapping level.  Assuming a 1GB pud level
>>> and 4KB page size, iterations are reduced by a factor of 256K.  In wall
>>> clock time, mapping a 32GB PCI BAR is reduced from ~1s to <1ms.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: linux-mm@kvack.org
>>> Reviewed-by: Peter Xu <peterx@redhat.com>
>>> Reviewed-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
>>> Tested-by: "Mitchell Augustin" <mitchell.augustin@canonical.com>
>>> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>
>> Acked-by: David Hildenbrand <david@redhat.com>
> 
> Thanks, David!
> 
> Is there any objection from mm folks to bring this in through the vfio
> tree?

I assume it's fine. Andrew is on CC, so he should be aware of it. I'm 
not aware of possible clashes.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps
  2025-02-26 17:55             ` Alex Williamson
@ 2025-02-27 20:22               ` Mitchell Augustin
  0 siblings, 0 replies; 23+ messages in thread
From: Mitchell Augustin @ 2025-02-27 20:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Peter Xu, kvm, linux-kernel, clg, jgg, willy

Thanks Alex, that's super helpful and makes sense to me now!

-Mitchell

On Wed, Feb 26, 2025 at 11:55 AM Alex Williamson
<alex.williamson@redhat.com> wrote:
>
> On Wed, 19 Feb 2025 14:32:35 -0600
> Mitchell Augustin <mitchell.augustin@canonical.com> wrote:
>
> > > Thanks for the review and testing!
> >
> > Sure thing, thanks for the patch set!
> >
> > If you happen to have a few minutes, I'm struggling to understand the
> > epfn computation and would appreciate some insight.
>
> Sorry, this slipped off my todo list for a few days.
>
> > My current understanding (very possibly incorrect):
> > - epfn is intended to be the last page frame number that can be
> > represented at the mapping level corresponding to addr_mask. (so, if
> > addr_mask == PUD_MASK, epfn would be the highest pfn still in PUD
> > level).
>
> Actually epfn is the first pfn of the next addr_mask level page.  The
> value in the parens (*pfn | (~addr_mask >> PAGE_SHIFT)) is the last pfn
> within the same level page.  We could do it either way, it's just a
> matter of where the +1 gets added.
>
> > - ret should be == npages if all pfns in the requested vma are within
> > the memory hierarchy level denoted by addr_mask. If npages is more
> > than can be represented at that level, ret == the max number of page
> > frames representable at addr_mask level.
>
> Yes.
>
> > - - (if the second case is true, that means we were not able to obtain
> > all requested pages due to running out of PFNs at the current mapping
> > level)
>
> vaddr_get_pfns() is called again if we haven't reached npage.
> Specifically, from vfio_pin_pages_remote() we hit the added continue
> under the !batch->size branch.  If the pfnmaps are fully PUD aligned,
> we'll call vaddr_get_pfns() once per PUD_SIZE, vfio_pin_pages_remote()
> will only return with the full requested npage value, and we'll only
> call vfio_iommu_map() once.  The latter has always been true, the
> difference is the number of times we iterate calling vaddr_get_pfns().
>
> > If the above is all correct, what is confusing me is where the "(*pfn)
> > | " comes into this equation. If epfn is meant to be the last pfn
> > representable at addr_mask level of the hierarchy, wouldn't that be
> > represented by (~pgmask >> PAGE_SHIFT) alone?
>
> (~addr_mask >> PAGE_SHIFT) gives us the last pfn relative to zero.  We
> want the last pfn relative to *pfn, therefore we OR in *pfn.  The OR
> handles any offset that *pfn might have within the addr_mask page, so
> this operation always provides the last pfn of the addr_mask page
> relative to *pfn.  +1 because we want to calculate the number of pfns
> until the next page.  Thanks,
>
> Alex
>


-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps
  2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
                   ` (6 preceding siblings ...)
  2025-02-19  2:37 ` [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Mitchell Augustin
@ 2025-02-28 16:32 ` Alex Williamson
  7 siblings, 0 replies; 23+ messages in thread
From: Alex Williamson @ 2025-02-28 16:32 UTC (permalink / raw)
  To: alex.williamson
  Cc: kvm, linux-kernel, peterx, mitchell.augustin, clg, jgg, akpm,
	linux-mm, david, willy

On Tue, 18 Feb 2025 15:22:00 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> v2:
>  - Rewrapped comment block in 3/6
>  - Added 4/6 to use consistent types (Jason)
>  - Renamed s/pgmask/addr_mask/ (David)
>  - Updated 6/6 with proposed epfn algorithm (Jason)
>  - Applied and retained sign-offs for all but 6/6 where the epfn
>    calculation changed
> 
> v1: https://lore.kernel.org/all/20250205231728.2527186-1-alex.williamson@redhat.com/
> 
> As GPU BAR sizes increase, the overhead of DMA mapping pfnmap ranges has
> become a significant overhead for VMs making use of device assignment.
> Not only does each mapping require upwards of a few seconds, but BARs
> are mapped in and out of the VM address space multiple times during
> guest boot.  Also factor in that multi-GPU configurations are
> increasingly commonplace and BAR sizes are continuing to increase.
> Configurations today can already be delayed minutes during guest boot.
> 
> We've taken steps to make Linux a better guest by batching PCI BAR
> sizing operations[1], but it only provides and incremental improvement.
> 
> This series attempts to fully address the issue by leveraging the huge
> pfnmap support added in v6.12.  When we insert pfnmaps using pud and pmd
> mappings, we can later take advantage of the knowledge of the mapping
> level page mask to iterate on the relevant mapping stride.  In the
> commonly achieved optimal case, this results in a reduction of pfn
> lookups by a factor of 256k.  For a local test system, an overhead of
> ~1s for DMA mapping a 32GB PCI BAR is reduced to sub-millisecond (8M
> page sized operations reduced to 32 pud sized operations).
> 
> Please review, test, and provide feedback.  I hope that mm folks can
> ack the trivial follow_pfnmap_args update to provide the mapping level
> page mask.  Naming is hard, so any preference other than pgmask is
> welcome.  Thanks,
> 
> Alex
> 
> [1]https://lore.kernel.org/all/20250120182202.1878581-1-alex.williamson@redhat.com/
> 
> 
> Alex Williamson (6):
>   vfio/type1: Catch zero from pin_user_pages_remote()
>   vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch
>   vfio/type1: Use vfio_batch for vaddr_get_pfns()
>   vfio/type1: Use consistent types for page counts
>   mm: Provide address mask in struct follow_pfnmap_args
>   vfio/type1: Use mapping page mask for pfnmaps
> 
>  drivers/vfio/vfio_iommu_type1.c | 123 ++++++++++++++++++++------------
>  include/linux/mm.h              |   2 +
>  mm/memory.c                     |   1 +
>  3 files changed, 80 insertions(+), 46 deletions(-)

With David's blessing relative to mm, applied to vfio next branch for
v6.15.  Thanks all for the reviews and testing!

Alex


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2025-02-28 16:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-18 22:22 [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Alex Williamson
2025-02-18 22:22 ` [PATCH v2 1/6] vfio/type1: Catch zero from pin_user_pages_remote() Alex Williamson
2025-02-18 22:22 ` [PATCH v2 2/6] vfio/type1: Convert all vaddr_get_pfns() callers to use vfio_batch Alex Williamson
2025-02-18 22:22 ` [PATCH v2 3/6] vfio/type1: Use vfio_batch for vaddr_get_pfns() Alex Williamson
2025-02-18 22:22 ` [PATCH v2 4/6] vfio/type1: Use consistent types for page counts Alex Williamson
2025-02-18 22:46   ` Peter Xu
2025-02-19  0:55     ` Mitchell Augustin
2025-02-25  0:17   ` Jason Gunthorpe
2025-02-18 22:22 ` [PATCH v2 5/6] mm: Provide address mask in struct follow_pfnmap_args Alex Williamson
2025-02-19  8:31   ` David Hildenbrand
2025-02-26 19:54     ` Alex Williamson
2025-02-26 20:05       ` David Hildenbrand
2025-02-18 22:22 ` [PATCH v2 6/6] vfio/type1: Use mapping page mask for pfnmaps Alex Williamson
2025-02-18 22:47   ` Peter Xu
2025-02-18 23:14     ` Alex Williamson
2025-02-19  2:36       ` Mitchell Augustin
2025-02-19 15:08         ` Alex Williamson
2025-02-19 20:32           ` Mitchell Augustin
2025-02-26 17:55             ` Alex Williamson
2025-02-27 20:22               ` Mitchell Augustin
2025-02-25  0:17   ` Jason Gunthorpe
2025-02-19  2:37 ` [PATCH v2 0/6] vfio: Improve DMA mapping performance for huge pfnmaps Mitchell Augustin
2025-02-28 16:32 ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox