linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote()
@ 2025-08-14  6:47 lizhe.67
  2025-08-14  6:47 ` [PATCH v5 1/5] mm: introduce num_pages_contiguous() lizhe.67
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  6:47 UTC (permalink / raw)
  To: alex.williamson, david, jgg; +Cc: torvalds, kvm, lizhe.67, linux-mm, farman

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is an integration of the two previous patchsets[1][2].

When vfio_pin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
statistics counting operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

The function vfio_unpin_pages_remote() has a similar issue, where executing
put_pfn() for each pfn brings considerable consumption.

This patchset primarily optimizes the performance of the relevant functions
by batching the less efficient operations mentioned before.

The first two patch optimizes the performance of the function
vfio_pin_pages_remote(), while the remaining patches optimize the
performance of the function vfio_unpin_pages_remote().

The performance test results, based on v6.16, for completing the 16G
VFIO MAP/UNMAP DMA, obtained through unit test[3] with slight
modifications[4], are as follows.

Base(6.16):
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.049 s (328.5 GB/s)
VFIO UNMAP DMA in 0.141 s (113.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.268 s (59.6 GB/s)
VFIO UNMAP DMA in 0.307 s (52.2 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.051 s (310.9 GB/s)
VFIO UNMAP DMA in 0.135 s (118.6 GB/s)

With this patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.025 s (633.1 GB/s)
VFIO UNMAP DMA in 0.044 s (363.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.249 s (64.2 GB/s)
VFIO UNMAP DMA in 0.289 s (55.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.030 s (533.2 GB/s)
VFIO UNMAP DMA in 0.044 s (361.3 GB/s)

For large folio, we achieve an over 40% performance improvement for VFIO
MAP DMA and an over 67% performance improvement for VFIO DMA UNMAP. For
small folios, the performance test results show a slight improvement with
the performance before optimization.

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
[2]: https://lore.kernel.org/all/20250620032344.13382-1-lizhe.67@bytedance.com/#t
[3]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[4]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/

Li Zhe (5):
  mm: introduce num_pages_contiguous()
  vfio/type1: optimize vfio_pin_pages_remote()
  vfio/type1: batch vfio_find_vpfn() in function
    vfio_unpin_pages_remote()
  vfio/type1: introduce a new member has_rsvd for struct vfio_dma
  vfio/type1: optimize vfio_unpin_pages_remote()

 drivers/vfio/vfio_iommu_type1.c | 112 ++++++++++++++++++++++++++------
 include/linux/mm.h              |   7 +-
 include/linux/mm_inline.h       |  35 ++++++++++
 3 files changed, 132 insertions(+), 22 deletions(-)

---
Changelogs:

v4->v5:
- Update the performance test results based on v6.16.
- Re-implement num_pages_contiguous() without relying on nth_page(),
  and relocate it into mm_inline.h.
- Merge the fixup patch into the original patch (patch #2).

v3->v4:
- Fix an indentation issue in patch #2.

v2->v3:
- Add a "Suggested-by" and a "Reviewed-by" tag.
- Address the compilation errors introduced by patch #1.
- Resolved several variable type issues.
- Add clarification for function num_pages_contiguous().

v1->v2:
- Update the performance test results.
- The function num_pages_contiguous() is extracted and placed in a
  separate commit.
- The phrase 'for large folio' has been removed from the patchset title.

v4: https://lore.kernel.org/all/20250710085355.54208-1-lizhe.67@bytedance.com/
v3: https://lore.kernel.org/all/20250707064950.72048-1-lizhe.67@bytedance.com/
v2: https://lore.kernel.org/all/20250704062602.33500-1-lizhe.67@bytedance.com/
v1: https://lore.kernel.org/all/20250630072518.31846-1-lizhe.67@bytedance.com/
-- 
2.20.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v5 1/5] mm: introduce num_pages_contiguous()
  2025-08-14  6:47 [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() lizhe.67
@ 2025-08-14  6:47 ` lizhe.67
  2025-08-14  6:54   ` David Hildenbrand
  2025-08-27 18:10   ` Alex Williamson
  2025-08-14  6:47 ` [PATCH v5 2/5] vfio/type1: optimize vfio_pin_pages_remote() lizhe.67
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  6:47 UTC (permalink / raw)
  To: alex.williamson, david, jgg
  Cc: torvalds, kvm, lizhe.67, linux-mm, farman, Jason Gunthorpe

From: Li Zhe <lizhe.67@bytedance.com>

Let's add a simple helper for determining the number of contiguous pages
that represent contiguous PFNs.

In an ideal world, this helper would be simpler or not even required.
Unfortunately, on some configs we still have to maintain (SPARSEMEM
without VMEMMAP), the memmap is allocated per memory section, and we might
run into weird corner cases of false positives when blindly testing for
contiguous pages only.

One example of such false positives would be a memory section-sized hole
that does not have a memmap. The surrounding memory sections might get
"struct pages" that are contiguous, but the PFNs are actually not.

This helper will, for example, be useful for determining contiguous PFNs
in a GUP result, to batch further operations across returned "struct
page"s. VFIO will utilize this interface to accelerate the VFIO DMA map
process.

Implementation based on Linus' suggestions to avoid new usage of
nth_page() where avoidable.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h        |  7 ++++++-
 include/linux/mm_inline.h | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..ead6724972cf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1763,7 +1763,12 @@ static inline unsigned long page_to_section(const struct page *page)
 {
 	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
 }
-#endif
+#else /* !SECTION_IN_PAGE_FLAGS */
+static inline unsigned long page_to_section(const struct page *page)
+{
+	return 0;
+}
+#endif /* SECTION_IN_PAGE_FLAGS */
 
 /**
  * folio_pfn - Return the Page Frame Number of a folio.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 89b518ff097e..5ea23891fe4c 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -616,4 +616,39 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
 	return true;
 }
 
+/**
+ * num_pages_contiguous() - determine the number of contiguous pages
+ *			    that represent contiguous PFNs
+ * @pages: an array of page pointers
+ * @nr_pages: length of the array, at least 1
+ *
+ * Determine the number of contiguous pages that represent contiguous PFNs
+ * in @pages, starting from the first page.
+ *
+ * In kernel configs where contiguous pages might not imply contiguous PFNs
+ * over memory section boundaries, this function will stop at the memory
+ * section boundary.
+ *
+ * Returns the number of contiguous pages.
+ */
+static inline size_t num_pages_contiguous(struct page **pages, size_t nr_pages)
+{
+	struct page *cur_page = pages[0];
+	unsigned long section = page_to_section(cur_page);
+	size_t i;
+
+	for (i = 1; i < nr_pages; i++) {
+		if (++cur_page != pages[i])
+			break;
+		/*
+		 * In unproblematic kernel configs, page_to_section() == 0 and
+		 * the whole check will get optimized out.
+		 */
+		if (page_to_section(cur_page) != section)
+			break;
+	}
+
+	return i;
+}
+
 #endif
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 2/5] vfio/type1: optimize vfio_pin_pages_remote()
  2025-08-14  6:47 [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() lizhe.67
  2025-08-14  6:47 ` [PATCH v5 1/5] mm: introduce num_pages_contiguous() lizhe.67
@ 2025-08-14  6:47 ` lizhe.67
  2025-08-14  6:47 ` [PATCH v5 3/5] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  6:47 UTC (permalink / raw)
  To: alex.williamson, david, jgg; +Cc: torvalds, kvm, lizhe.67, linux-mm, farman

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_pin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
statistics counting operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.
Batch processing of statistical counting operations can effectively enhance
performance.

In addition, the pages obtained through longterm GUP are neither invalid
nor reserved. Therefore, we can reduce the overhead associated with some
calls to function is_invalid_reserved_pfn().

The performance test results for completing the 16G VFIO IOMMU DMA mapping
are as follows.

Base(v6.16):
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.049 s (328.5 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.268 s (59.6 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.051 s (310.9 GB/s)

With this patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.025 s (629.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.253 s (63.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.030 s (530.5 GB/s)

For large folio, we achieve an over 40% performance improvement.
For small folios, the performance test results indicate a
slight improvement.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Eric Farman <farman@linux.ibm.com>
---
 drivers/vfio/vfio_iommu_type1.c | 84 ++++++++++++++++++++++++++++-----
 1 file changed, 72 insertions(+), 12 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index f8d68fe77b41..7829b5e268c2 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -37,6 +37,7 @@
 #include <linux/vfio.h>
 #include <linux/workqueue.h>
 #include <linux/notifier.h>
+#include <linux/mm_inline.h>
 #include "vfio.h"
 
 #define DRIVER_VERSION  "0.2"
@@ -318,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
 /*
  * Helper Functions for host iova-pfn list
  */
-static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+
+/*
+ * Find the highest vfio_pfn that overlapping the range
+ * [iova_start, iova_end) in rb tree.
+ */
+static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
+		dma_addr_t iova_start, dma_addr_t iova_end)
 {
 	struct vfio_pfn *vpfn;
 	struct rb_node *node = dma->pfn_list.rb_node;
@@ -326,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	while (node) {
 		vpfn = rb_entry(node, struct vfio_pfn, node);
 
-		if (iova < vpfn->iova)
+		if (iova_end <= vpfn->iova)
 			node = node->rb_left;
-		else if (iova > vpfn->iova)
+		else if (iova_start > vpfn->iova)
 			node = node->rb_right;
 		else
 			return vpfn;
@@ -336,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	return NULL;
 }
 
+static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+{
+	return vfio_find_vpfn_range(dma, iova, iova + 1);
+}
+
 static void vfio_link_pfn(struct vfio_dma *dma,
 			  struct vfio_pfn *new)
 {
@@ -614,6 +626,39 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	return ret;
 }
 
+
+static long vpfn_pages(struct vfio_dma *dma,
+		dma_addr_t iova_start, long nr_pages)
+{
+	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
+	struct vfio_pfn *top = vfio_find_vpfn_range(dma, iova_start, iova_end);
+	long ret = 1;
+	struct vfio_pfn *vpfn;
+	struct rb_node *prev;
+	struct rb_node *next;
+
+	if (likely(!top))
+		return 0;
+
+	prev = next = &top->node;
+
+	while ((prev = rb_prev(prev))) {
+		vpfn = rb_entry(prev, struct vfio_pfn, node);
+		if (vpfn->iova < iova_start)
+			break;
+		ret++;
+	}
+
+	while ((next = rb_next(next))) {
+		vpfn = rb_entry(next, struct vfio_pfn, node);
+		if (vpfn->iova >= iova_end)
+			break;
+		ret++;
+	}
+
+	return ret;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -687,32 +732,47 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 		 * and rsvd here, and therefore continues to use the batch.
 		 */
 		while (true) {
+			long nr_pages, acct_pages = 0;
+
 			if (pfn != *pfn_base + pinned ||
 			    rsvd != is_invalid_reserved_pfn(pfn))
 				goto out;
 
+			/*
+			 * Using GUP with the FOLL_LONGTERM in
+			 * vaddr_get_pfns() will not return invalid
+			 * or reserved pages.
+			 */
+			nr_pages = num_pages_contiguous(
+					&batch->pages[batch->offset],
+					batch->size);
+			if (!rsvd) {
+				acct_pages = nr_pages;
+				acct_pages -= vpfn_pages(dma, iova, nr_pages);
+			}
+
 			/*
 			 * Reserved pages aren't counted against the user,
 			 * externally pinned pages are already counted against
 			 * the user.
 			 */
-			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
+			if (acct_pages) {
 				if (!dma->lock_cap &&
-				    mm->locked_vm + lock_acct + 1 > limit) {
+				    mm->locked_vm + lock_acct + acct_pages > limit) {
 					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 						__func__, limit << PAGE_SHIFT);
 					ret = -ENOMEM;
 					goto unpin_out;
 				}
-				lock_acct++;
+				lock_acct += acct_pages;
 			}
 
-			pinned++;
-			npage--;
-			vaddr += PAGE_SIZE;
-			iova += PAGE_SIZE;
-			batch->offset++;
-			batch->size--;
+			pinned += nr_pages;
+			npage -= nr_pages;
+			vaddr += PAGE_SIZE * nr_pages;
+			iova += PAGE_SIZE * nr_pages;
+			batch->offset += nr_pages;
+			batch->size -= nr_pages;
 
 			if (!batch->size)
 				break;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 3/5] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote()
  2025-08-14  6:47 [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() lizhe.67
  2025-08-14  6:47 ` [PATCH v5 1/5] mm: introduce num_pages_contiguous() lizhe.67
  2025-08-14  6:47 ` [PATCH v5 2/5] vfio/type1: optimize vfio_pin_pages_remote() lizhe.67
@ 2025-08-14  6:47 ` lizhe.67
  2025-08-14  6:47 ` [PATCH v5 4/5] vfio/type1: introduce a new member has_rsvd for struct vfio_dma lizhe.67
  2025-08-14  6:47 ` [PATCH v5 5/5] vfio/type1: optimize vfio_unpin_pages_remote() lizhe.67
  4 siblings, 0 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  6:47 UTC (permalink / raw)
  To: alex.williamson, david, jgg; +Cc: torvalds, kvm, lizhe.67, linux-mm, farman

From: Li Zhe <lizhe.67@bytedance.com>

The function vpfn_pages() can help us determine the number of vpfn
nodes on the vpfn rb tree within a specified range. This allows us
to avoid searching for each vpfn individually in the function
vfio_unpin_pages_remote(). This patch batches the vfio_find_vpfn()
calls in function vfio_unpin_pages_remote().

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 7829b5e268c2..dbacd852efae 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -802,16 +802,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    unsigned long pfn, unsigned long npage,
 				    bool do_accounting)
 {
-	long unlocked = 0, locked = 0;
+	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
 	long i;
 
-	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
-		if (put_pfn(pfn++, dma->prot)) {
+	for (i = 0; i < npage; i++)
+		if (put_pfn(pfn++, dma->prot))
 			unlocked++;
-			if (vfio_find_vpfn(dma, iova))
-				locked++;
-		}
-	}
 
 	if (do_accounting)
 		vfio_lock_acct(dma, locked - unlocked, true);
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 4/5] vfio/type1: introduce a new member has_rsvd for struct vfio_dma
  2025-08-14  6:47 [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() lizhe.67
                   ` (2 preceding siblings ...)
  2025-08-14  6:47 ` [PATCH v5 3/5] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
@ 2025-08-14  6:47 ` lizhe.67
  2025-08-14  6:47 ` [PATCH v5 5/5] vfio/type1: optimize vfio_unpin_pages_remote() lizhe.67
  4 siblings, 0 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  6:47 UTC (permalink / raw)
  To: alex.williamson, david, jgg; +Cc: torvalds, kvm, lizhe.67, linux-mm, farman

From: Li Zhe <lizhe.67@bytedance.com>

Introduce a new member has_rsvd for struct vfio_dma. This member is
used to indicate whether there are any reserved or invalid pfns in
the region represented by this vfio_dma. If it is true, it indicates
that there is at least one pfn in this region that is either reserved
or invalid.

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index dbacd852efae..30e1b54f6c25 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -93,6 +93,7 @@ struct vfio_dma {
 	bool			iommu_mapped;
 	bool			lock_cap;	/* capable(CAP_IPC_LOCK) */
 	bool			vaddr_invalid;
+	bool			has_rsvd;	/* has 1 or more rsvd pfns */
 	struct task_struct	*task;
 	struct rb_root		pfn_list;	/* Ex-user pinned pfn list */
 	unsigned long		*bitmap;
@@ -782,6 +783,7 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	}
 
 out:
+	dma->has_rsvd |= rsvd;
 	ret = vfio_lock_acct(dma, lock_acct, false);
 
 unpin_out:
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v5 5/5] vfio/type1: optimize vfio_unpin_pages_remote()
  2025-08-14  6:47 [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() lizhe.67
                   ` (3 preceding siblings ...)
  2025-08-14  6:47 ` [PATCH v5 4/5] vfio/type1: introduce a new member has_rsvd for struct vfio_dma lizhe.67
@ 2025-08-14  6:47 ` lizhe.67
  4 siblings, 0 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  6:47 UTC (permalink / raw)
  To: alex.williamson, david, jgg
  Cc: torvalds, kvm, lizhe.67, linux-mm, farman, Jason Gunthorpe

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

It would be very rare for reserved PFNs and non reserved will to be mixed
within the same range. So this patch utilizes the has_rsvd variable
introduced in the previous patch to determine whether batch put_pfn()
operations can be performed. Moreover, compared to put_pfn(),
unpin_user_page_range_dirty_lock() is capable of handling large folio
scenarios more efficiently.

The performance test results for completing the 16G VFIO IOMMU DMA
unmapping are as follows.

Base(v6.16):
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO UNMAP DMA in 0.141 s (113.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO UNMAP DMA in 0.307 s (52.2 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO UNMAP DMA in 0.135 s (118.6 GB/s)

With this patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO UNMAP DMA in 0.044 s (363.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO UNMAP DMA in 0.289 s (55.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO UNMAP DMA in 0.044 s (361.3 GB/s)

For large folio, we achieve an over 67% performance improvement in
the VFIO UNMAP DMA item. For small folios, the performance test
results appear to show a slight improvement.

Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 drivers/vfio/vfio_iommu_type1.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 30e1b54f6c25..916cad80941c 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -800,17 +800,29 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	return pinned;
 }
 
+static inline void put_valid_unreserved_pfns(unsigned long start_pfn,
+		unsigned long npage, int prot)
+{
+	unpin_user_page_range_dirty_lock(pfn_to_page(start_pfn), npage,
+					 prot & IOMMU_WRITE);
+}
+
 static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    unsigned long pfn, unsigned long npage,
 				    bool do_accounting)
 {
 	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
-	long i;
 
-	for (i = 0; i < npage; i++)
-		if (put_pfn(pfn++, dma->prot))
-			unlocked++;
+	if (dma->has_rsvd) {
+		unsigned long i;
 
+		for (i = 0; i < npage; i++)
+			if (put_pfn(pfn++, dma->prot))
+				unlocked++;
+	} else {
+		put_valid_unreserved_pfns(pfn, npage, dma->prot);
+		unlocked = npage;
+	}
 	if (do_accounting)
 		vfio_lock_acct(dma, locked - unlocked, true);
 
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/5] mm: introduce num_pages_contiguous()
  2025-08-14  6:47 ` [PATCH v5 1/5] mm: introduce num_pages_contiguous() lizhe.67
@ 2025-08-14  6:54   ` David Hildenbrand
  2025-08-14  7:58     ` lizhe.67
  2025-08-27 18:10   ` Alex Williamson
  1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-08-14  6:54 UTC (permalink / raw)
  To: lizhe.67, alex.williamson, jgg
  Cc: torvalds, kvm, linux-mm, farman, Jason Gunthorpe

On 14.08.25 08:47, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> Let's add a simple helper for determining the number of contiguous pages
> that represent contiguous PFNs.
> 
> In an ideal world, this helper would be simpler or not even required.
> Unfortunately, on some configs we still have to maintain (SPARSEMEM
> without VMEMMAP), the memmap is allocated per memory section, and we might
> run into weird corner cases of false positives when blindly testing for
> contiguous pages only.
> 
> One example of such false positives would be a memory section-sized hole
> that does not have a memmap. The surrounding memory sections might get
> "struct pages" that are contiguous, but the PFNs are actually not.
> 
> This helper will, for example, be useful for determining contiguous PFNs
> in a GUP result, to batch further operations across returned "struct
> page"s. VFIO will utilize this interface to accelerate the VFIO DMA map
> process.
> 
> Implementation based on Linus' suggestions to avoid new usage of
> nth_page() where avoidable.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   include/linux/mm.h        |  7 ++++++-
>   include/linux/mm_inline.h | 35 +++++++++++++++++++++++++++++++++++
>   2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..ead6724972cf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1763,7 +1763,12 @@ static inline unsigned long page_to_section(const struct page *page)
>   {
>   	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
>   }
> -#endif
> +#else /* !SECTION_IN_PAGE_FLAGS */
> +static inline unsigned long page_to_section(const struct page *page)
> +{
> +	return 0;
> +}
> +#endif /* SECTION_IN_PAGE_FLAGS */
>   
>   /**
>    * folio_pfn - Return the Page Frame Number of a folio.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 89b518ff097e..5ea23891fe4c 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -616,4 +616,39 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
>   	return true;
>   }
>   
> +/**
> + * num_pages_contiguous() - determine the number of contiguous pages
> + *			    that represent contiguous PFNs
> + * @pages: an array of page pointers
> + * @nr_pages: length of the array, at least 1
> + *
> + * Determine the number of contiguous pages that represent contiguous PFNs
> + * in @pages, starting from the first page.
> + *
> + * In kernel configs where contiguous pages might not imply contiguous PFNs
> + * over memory section boundaries, this function will stop at the memory
 > + * section boundary.

Jason suggested here instead:

"
In some kernel configs contiguous PFNs will not have contiguous struct
pages. In these configurations num_pages_contiguous() will return a
smaller than ideal number. The caller should continue to check for pfn
contiguity after each call to num_pages_contiguous().
"

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/5] mm: introduce num_pages_contiguous()
  2025-08-14  6:54   ` David Hildenbrand
@ 2025-08-14  7:58     ` lizhe.67
  0 siblings, 0 replies; 10+ messages in thread
From: lizhe.67 @ 2025-08-14  7:58 UTC (permalink / raw)
  To: david; +Cc: alex.williamson, farman, jgg, jgg, kvm, linux-mm, lizhe.67,
	torvalds

On Thu, 14 Aug 2025 08:54:44 +0200, david@redhat.com wrote:

> On 14.08.25 08:47, lizhe.67@bytedance.com wrote:
> > From: Li Zhe <lizhe.67@bytedance.com>
> > 
> > Let's add a simple helper for determining the number of contiguous pages
> > that represent contiguous PFNs.
> > 
> > In an ideal world, this helper would be simpler or not even required.
> > Unfortunately, on some configs we still have to maintain (SPARSEMEM
> > without VMEMMAP), the memmap is allocated per memory section, and we might
> > run into weird corner cases of false positives when blindly testing for
> > contiguous pages only.
> > 
> > One example of such false positives would be a memory section-sized hole
> > that does not have a memmap. The surrounding memory sections might get
> > "struct pages" that are contiguous, but the PFNs are actually not.
> > 
> > This helper will, for example, be useful for determining contiguous PFNs
> > in a GUP result, to batch further operations across returned "struct
> > page"s. VFIO will utilize this interface to accelerate the VFIO DMA map
> > process.
> > 
> > Implementation based on Linus' suggestions to avoid new usage of
> > nth_page() where avoidable.
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > Co-developed-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >   include/linux/mm.h        |  7 ++++++-
> >   include/linux/mm_inline.h | 35 +++++++++++++++++++++++++++++++++++
> >   2 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1ae97a0b8ec7..ead6724972cf 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1763,7 +1763,12 @@ static inline unsigned long page_to_section(const struct page *page)
> >   {
> >   	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
> >   }
> > -#endif
> > +#else /* !SECTION_IN_PAGE_FLAGS */
> > +static inline unsigned long page_to_section(const struct page *page)
> > +{
> > +	return 0;
> > +}
> > +#endif /* SECTION_IN_PAGE_FLAGS */
> >   
> >   /**
> >    * folio_pfn - Return the Page Frame Number of a folio.
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index 89b518ff097e..5ea23891fe4c 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -616,4 +616,39 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
> >   	return true;
> >   }
> >   
> > +/**
> > + * num_pages_contiguous() - determine the number of contiguous pages
> > + *			    that represent contiguous PFNs
> > + * @pages: an array of page pointers
> > + * @nr_pages: length of the array, at least 1
> > + *
> > + * Determine the number of contiguous pages that represent contiguous PFNs
> > + * in @pages, starting from the first page.
> > + *
> > + * In kernel configs where contiguous pages might not imply contiguous PFNs
> > + * over memory section boundaries, this function will stop at the memory
>  > + * section boundary.
> 
> Jason suggested here instead:
> 
> "
> In some kernel configs contiguous PFNs will not have contiguous struct
> pages. In these configurations num_pages_contiguous() will return a
> smaller than ideal number. The caller should continue to check for pfn
> contiguity after each call to num_pages_contiguous().
> "

Thank you for the reminder! The comment here should be revised as
follows:

/**
 * num_pages_contiguous() - determine the number of contiguous pages
 *			    that represent contiguous PFNs
 * @pages: an array of page pointers
 * @nr_pages: length of the array, at least 1
 *
 * Determine the number of contiguous pages that represent contiguous PFNs
 * in @pages, starting from the first page.
 *
 * In some kernel configs contiguous PFNs will not have contiguous struct
 * pages. In these configurations num_pages_contiguous() will return a num
 * smaller than ideal number. The caller should continue to check for pfn
 * contiguity after each call to num_pages_contiguous().
 *
 * Returns the number of contiguous pages.
 */

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/5] mm: introduce num_pages_contiguous()
  2025-08-14  6:47 ` [PATCH v5 1/5] mm: introduce num_pages_contiguous() lizhe.67
  2025-08-14  6:54   ` David Hildenbrand
@ 2025-08-27 18:10   ` Alex Williamson
  2025-09-01  3:25     ` lizhe.67
  1 sibling, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2025-08-27 18:10 UTC (permalink / raw)
  To: lizhe.67
  Cc: david, jgg, torvalds, kvm, linux-mm, farman, Jason Gunthorpe,
	Matthew Wilcox (Oracle), Andrew Morton

On Thu, 14 Aug 2025 14:47:10 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> Let's add a simple helper for determining the number of contiguous pages
> that represent contiguous PFNs.
> 
> In an ideal world, this helper would be simpler or not even required.
> Unfortunately, on some configs we still have to maintain (SPARSEMEM
> without VMEMMAP), the memmap is allocated per memory section, and we might
> run into weird corner cases of false positives when blindly testing for
> contiguous pages only.
> 
> One example of such false positives would be a memory section-sized hole
> that does not have a memmap. The surrounding memory sections might get
> "struct pages" that are contiguous, but the PFNs are actually not.
> 
> This helper will, for example, be useful for determining contiguous PFNs
> in a GUP result, to batch further operations across returned "struct
> page"s. VFIO will utilize this interface to accelerate the VFIO DMA map
> process.
> 
> Implementation based on Linus' suggestions to avoid new usage of
> nth_page() where avoidable.
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm.h        |  7 ++++++-
>  include/linux/mm_inline.h | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+), 1 deletion(-)


Does this need any re-evaluation after Willy's series?[1]  Patch 2/
changes page_to_section() to memdesc_section() which takes a new
memdesc_flags_t, ie. page->flags.  The conversion appears trivial, but
mm has many subtleties.

Ideally we could also avoid merge-time fixups for linux-next and
mainline.

Andrew, are you open to topic branch for Willy's series that I could
merge into vfio/next when it's considered stable?  Thanks,

Alex

[1]https://lore.kernel.org/all/20250805172307.1302730-3-willy@infradead.org/T/#u

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1ae97a0b8ec7..ead6724972cf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1763,7 +1763,12 @@ static inline unsigned long page_to_section(const struct page *page)
>  {
>  	return (page->flags >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
>  }
> -#endif
> +#else /* !SECTION_IN_PAGE_FLAGS */
> +static inline unsigned long page_to_section(const struct page *page)
> +{
> +	return 0;
> +}
> +#endif /* SECTION_IN_PAGE_FLAGS */
>  
>  /**
>   * folio_pfn - Return the Page Frame Number of a folio.
> diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> index 89b518ff097e..5ea23891fe4c 100644
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -616,4 +616,39 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
>  	return true;
>  }
>  
> +/**
> + * num_pages_contiguous() - determine the number of contiguous pages
> + *			    that represent contiguous PFNs
> + * @pages: an array of page pointers
> + * @nr_pages: length of the array, at least 1
> + *
> + * Determine the number of contiguous pages that represent contiguous PFNs
> + * in @pages, starting from the first page.
> + *
> + * In kernel configs where contiguous pages might not imply contiguous PFNs
> + * over memory section boundaries, this function will stop at the memory
> + * section boundary.
> + *
> + * Returns the number of contiguous pages.
> + */
> +static inline size_t num_pages_contiguous(struct page **pages, size_t nr_pages)
> +{
> +	struct page *cur_page = pages[0];
> +	unsigned long section = page_to_section(cur_page);
> +	size_t i;
> +
> +	for (i = 1; i < nr_pages; i++) {
> +		if (++cur_page != pages[i])
> +			break;
> +		/*
> +		 * In unproblematic kernel configs, page_to_section() == 0 and
> +		 * the whole check will get optimized out.
> +		 */
> +		if (page_to_section(cur_page) != section)
> +			break;
> +	}
> +
> +	return i;
> +}
> +
>  #endif



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v5 1/5] mm: introduce num_pages_contiguous()
  2025-08-27 18:10   ` Alex Williamson
@ 2025-09-01  3:25     ` lizhe.67
  0 siblings, 0 replies; 10+ messages in thread
From: lizhe.67 @ 2025-09-01  3:25 UTC (permalink / raw)
  To: alex.williamson
  Cc: akpm, david, farman, jgg, jgg, kvm, linux-mm, lizhe.67, torvalds,
	willy

On Wed, 27 Aug 2025 12:10:55 -0600, alex.williamson@redhat.com wrote:

> On Thu, 14 Aug 2025 14:47:10 +0800
> lizhe.67@bytedance.com wrote:
> 
> > From: Li Zhe <lizhe.67@bytedance.com>
> > 
> > Let's add a simple helper for determining the number of contiguous pages
> > that represent contiguous PFNs.
> > 
> > In an ideal world, this helper would be simpler or not even required.
> > Unfortunately, on some configs we still have to maintain (SPARSEMEM
> > without VMEMMAP), the memmap is allocated per memory section, and we might
> > run into weird corner cases of false positives when blindly testing for
> > contiguous pages only.
> > 
> > One example of such false positives would be a memory section-sized hole
> > that does not have a memmap. The surrounding memory sections might get
> > "struct pages" that are contiguous, but the PFNs are actually not.
> > 
> > This helper will, for example, be useful for determining contiguous PFNs
> > in a GUP result, to batch further operations across returned "struct
> > page"s. VFIO will utilize this interface to accelerate the VFIO DMA map
> > process.
> > 
> > Implementation based on Linus' suggestions to avoid new usage of
> > nth_page() where avoidable.
> > 
> > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> > Suggested-by: Jason Gunthorpe <jgg@ziepe.ca>
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > Co-developed-by: David Hildenbrand <david@redhat.com>
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  include/linux/mm.h        |  7 ++++++-
> >  include/linux/mm_inline.h | 35 +++++++++++++++++++++++++++++++++++
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> 
> Does this need any re-evaluation after Willy's series?[1]  Patch 2/
> changes page_to_section() to memdesc_section() which takes a new
> memdesc_flags_t, ie. page->flags.  The conversion appears trivial, but
> mm has many subtleties.
> 
> Ideally we could also avoid merge-time fixups for linux-next and
> mainline.

Thank you for your reminder.

In my view, if Willy's series is integrated, this patch will need to
be revised as follows. Please correct me if I'm wrong.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab4d979f4eec..bad0373099ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1763,7 +1763,12 @@ static inline unsigned long memdesc_section(memdesc_flags_t mdf)
 {
 	return (mdf.f >> SECTIONS_PGSHIFT) & SECTIONS_MASK;
 }
-#endif
+#else /* !SECTION_IN_PAGE_FLAGS */
+static inline unsigned long memdesc_section(memdesc_flags_t mdf)
+{
+	return 0;
+}
+#endif /* SECTION_IN_PAGE_FLAGS */
 
 /**
  * folio_pfn - Return the Page Frame Number of a folio.
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 150302b4a905..bb23496d465b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -616,4 +616,40 @@ static inline bool vma_has_recency(struct vm_area_struct *vma)
 	return true;
 }
 
+/**
+ * num_pages_contiguous() - determine the number of contiguous pages
+ *			    that represent contiguous PFNs
+ * @pages: an array of page pointers
+ * @nr_pages: length of the array, at least 1
+ *
+ * Determine the number of contiguous pages that represent contiguous PFNs
+ * in @pages, starting from the first page.
+ *
+ * In some kernel configs contiguous PFNs will not have contiguous struct
+ * pages. In these configurations num_pages_contiguous() will return a num
+ * smaller than ideal number. The caller should continue to check for pfn
+ * contiguity after each call to num_pages_contiguous().
+ *
+ * Returns the number of contiguous pages.
+ */
+static inline size_t num_pages_contiguous(struct page **pages, size_t nr_pages)
+{
+	struct page *cur_page = pages[0];
+	unsigned long section = memdesc_section(cur_page->flags);
+	size_t i;
+
+	for (i = 1; i < nr_pages; i++) {
+		if (++cur_page != pages[i])
+			break;
+		/*
+		 * In unproblematic kernel configs, page_to_section() == 0 and
+		 * the whole check will get optimized out.
+		 */
+		if (memdesc_section(cur_page->flags) != section)
+			break;
+	}
+
+	return i;
+}
+
 #endif
---

Thanks,
Zhe


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-09-01  3:25 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-14  6:47 [PATCH v5 0/5] vfio/type1: optimize vfio_pin_pages_remote() and vfio_unpin_pages_remote() lizhe.67
2025-08-14  6:47 ` [PATCH v5 1/5] mm: introduce num_pages_contiguous() lizhe.67
2025-08-14  6:54   ` David Hildenbrand
2025-08-14  7:58     ` lizhe.67
2025-08-27 18:10   ` Alex Williamson
2025-09-01  3:25     ` lizhe.67
2025-08-14  6:47 ` [PATCH v5 2/5] vfio/type1: optimize vfio_pin_pages_remote() lizhe.67
2025-08-14  6:47 ` [PATCH v5 3/5] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
2025-08-14  6:47 ` [PATCH v5 4/5] vfio/type1: introduce a new member has_rsvd for struct vfio_dma lizhe.67
2025-08-14  6:47 ` [PATCH v5 5/5] vfio/type1: optimize vfio_unpin_pages_remote() lizhe.67

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).