[RFC] vfio/type1: optimize vfio_unpin_pages

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
@ 2025-06-05 12:49 lizhe.67
  2025-06-05 16:09 ` Alex Williamson
  0 siblings, 1 reply; 3+ messages in thread
From: lizhe.67 @ 2025-06-05 12:49 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
for large folios'[1].

When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

This patch optimize this process by batching the put_pfn() operations.

The performance test results, based on v6.15, for completing the 8G VFIO
IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
case, the 8G virtual address space has been separately mapped to small
folio and physical memory using hugetlbfs with pagesize=2M. For large
folio, we achieve an approximate 66% performance improvement. However,
for small folios, there is an approximate 11% performance degradation.

Before this patch:

    hugetlbfs with pagesize=2M:
    funcgraph_entry:      # 94413.092 us |  vfio_unmap_unpin();

    small folio:
    funcgraph_entry:      # 118273.331 us |  vfio_unmap_unpin();

After this patch:

    hugetlbfs with pagesize=2M:
    funcgraph_entry:      # 31260.124 us |  vfio_unmap_unpin();

    small folio:
    funcgraph_entry:      # 131945.796 us |  vfio_unmap_unpin();

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 58 ++++++++++++++++++++++++++-------
 1 file changed, 47 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 28ee4b8d39ae..9d3ee0f1b298 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -469,17 +469,24 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
 	return true;
 }
 
-static int put_pfn(unsigned long pfn, int prot)
+/*
+ * The caller must ensure that these npages PFNs belong to the same folio.
+ */
+static int put_pfns(unsigned long pfn, int prot, int npages)
 {
 	if (!is_invalid_reserved_pfn(pfn)) {
-		struct page *page = pfn_to_page(pfn);
-
-		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
-		return 1;
+		unpin_user_page_range_dirty_lock(pfn_to_page(pfn),
+				npages, prot & IOMMU_WRITE);
+		return npages;
 	}
 	return 0;
 }
 
+static int put_pfn(unsigned long pfn, int prot)
+{
+	return put_pfns(pfn, prot, 1);
+}
+
 #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
 
 static void __vfio_batch_init(struct vfio_batch *batch, bool single)
@@ -801,19 +808,48 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 	return pinned;
 }
 
+static long get_step(unsigned long pfn, unsigned long npage)
+{
+	struct folio *folio;
+	struct page *page;
+
+	if (is_invalid_reserved_pfn(pfn))
+		return 1;
+
+	page = pfn_to_page(pfn);
+	folio = page_folio(page);
+
+	if (!folio_test_large(folio))
+		return 1;
+
+	/*
+	 * The precondition for doing this here is that pfn is contiguous
+	 */
+	return min_t(long, npage,
+			folio_nr_pages(folio) - folio_page_idx(folio, page));
+}
+
 static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    unsigned long pfn, unsigned long npage,
 				    bool do_accounting)
 {
 	long unlocked = 0, locked = 0;
-	long i;
 
-	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
-		if (put_pfn(pfn++, dma->prot)) {
-			unlocked++;
-			if (vfio_find_vpfn(dma, iova))
-				locked++;
+	while (npage) {
+		long step = get_step(pfn, npage);
+
+		/*
+		 * Although the third parameter of put_pfns() is of type int,
+		 * the value of step here will not exceed the range that int
+		 * can represent. Therefore, it is safe to pass step.
+		 */
+		if (put_pfns(pfn, dma->prot, step)) {
+			unlocked += step;
+			locked += vpfn_pages(dma, iova, step);
 		}
+		pfn += step;
+		iova += PAGE_SIZE * step;
+		npage -= step;
 	}
 
 	if (do_accounting)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-05 12:49 [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-05 16:09 ` Alex Williamson
  2025-06-10  3:10   ` lizhe.67
  0 siblings, 1 reply; 3+ messages in thread
From: Alex Williamson @ 2025-06-05 16:09 UTC (permalink / raw)
  To: lizhe.67; +Cc: kvm, linux-kernel

On Thu,  5 Jun 2025 20:49:23 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> for large folios'[1].
> 
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the put_pfn() operations.
> 
> The performance test results, based on v6.15, for completing the 8G VFIO
> IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> case, the 8G virtual address space has been separately mapped to small
> folio and physical memory using hugetlbfs with pagesize=2M. For large
> folio, we achieve an approximate 66% performance improvement. However,
> for small folios, there is an approximate 11% performance degradation.
> 
> Before this patch:
> 
>     hugetlbfs with pagesize=2M:
>     funcgraph_entry:      # 94413.092 us |  vfio_unmap_unpin();
> 
>     small folio:
>     funcgraph_entry:      # 118273.331 us |  vfio_unmap_unpin();
> 
> After this patch:
> 
>     hugetlbfs with pagesize=2M:
>     funcgraph_entry:      # 31260.124 us |  vfio_unmap_unpin();
> 
>     small folio:
>     funcgraph_entry:      # 131945.796 us |  vfio_unmap_unpin();

I was just playing with a unit test[1] to validate your previous patch
and added this as well:

Test options:

	vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]

I'm running it once with device and size for the madvise and populate
tests, then again adding /dev/hugepages (1G) for the remaining test:

Base:
# vfio-pci-mem-dma-map 0000:0b:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.726 s (22.0 GB/s)
VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.071 s (224.0 GB/s)
VFIO UNMAP DMA in 0.103 s (156.0 GB/s)

Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.296 s (54.0 GB/s)
VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.741 s (21.6 GB/s)
VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.010 s (1542.9 GB/s)
VFIO UNMAP DMA in 0.109 s (146.1 GB/s)

Map + Unmap patches:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.301 s (53.2 GB/s)
VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.735 s (21.8 GB/s)
VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.011 s (1434.7 GB/s)
VFIO UNMAP DMA in 0.023 s (686.5 GB/s)

So overall the map optimization shows a nice improvement in hugetlbfs
mapping performance.  I was hoping we'd see some improvement in THP,
but that doesn't appear to be the case.  Will folio_nr_pages() ever be
more than 1 for THP?  The degradation in non-hugetlbfs case is small,
but notable.

The unmap optimization shows a pretty substantial decline in the
non-hugetlbfs cases.  I don't think that can be overlooked.  Thanks,

Alex

[1]https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-05 16:09 ` Alex Williamson
@ 2025-06-10  3:10   ` lizhe.67
  0 siblings, 0 replies; 3+ messages in thread
From: lizhe.67 @ 2025-06-10  3:10 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, lizhe.67

On Thu, 5 Jun 2025 10:09:58 -0600, alex.williamson@redhat.com wrote:

> On Thu,  5 Jun 2025 20:49:23 +0800
> lizhe.67@bytedance.com wrote:
> 
> > From: Li Zhe <lizhe.67@bytedance.com>
> > 
> > This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> > for large folios'[1].
> > 
> > When vfio_unpin_pages_remote() is called with a range of addresses that
> > includes large folios, the function currently performs individual
> > put_pfn() operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> > 
> > This patch optimize this process by batching the put_pfn() operations.
> > 
> > The performance test results, based on v6.15, for completing the 8G VFIO
> > IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> > case, the 8G virtual address space has been separately mapped to small
> > folio and physical memory using hugetlbfs with pagesize=2M. For large
> > folio, we achieve an approximate 66% performance improvement. However,
> > for small folios, there is an approximate 11% performance degradation.
> > 
> > Before this patch:
> > 
> >     hugetlbfs with pagesize=2M:
> >     funcgraph_entry:      # 94413.092 us |  vfio_unmap_unpin();
> > 
> >     small folio:
> >     funcgraph_entry:      # 118273.331 us |  vfio_unmap_unpin();
> > 
> > After this patch:
> > 
> >     hugetlbfs with pagesize=2M:
> >     funcgraph_entry:      # 31260.124 us |  vfio_unmap_unpin();
> > 
> >     small folio:
> >     funcgraph_entry:      # 131945.796 us |  vfio_unmap_unpin();
> 
> I was just playing with a unit test[1] to validate your previous patch
> and added this as well:
> 
> Test options:
> 
> 	vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]
> 
> I'm running it once with device and size for the madvise and populate
> tests, then again adding /dev/hugepages (1G) for the remaining test:
> 
> Base:
> # vfio-pci-mem-dma-map 0000:0b:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.294 s (54.4 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.726 s (22.0 GB/s)
> VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.071 s (224.0 GB/s)
> VFIO UNMAP DMA in 0.103 s (156.0 GB/s)
> 
> Map patch:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.296 s (54.0 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.741 s (21.6 GB/s)
> VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.010 s (1542.9 GB/s)
> VFIO UNMAP DMA in 0.109 s (146.1 GB/s)
> 
> Map + Unmap patches:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.301 s (53.2 GB/s)
> VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.735 s (21.8 GB/s)
> VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.011 s (1434.7 GB/s)
> VFIO UNMAP DMA in 0.023 s (686.5 GB/s)
> 
> So overall the map optimization shows a nice improvement in hugetlbfs
> mapping performance.  I was hoping we'd see some improvement in THP,
> but that doesn't appear to be the case.  Will folio_nr_pages() ever be
> more than 1 for THP?  The degradation in non-hugetlbfs case is small,
> but notable.

After I made the following modifications to the unit test, the
performance test results met the expectations.

diff --git a/vfio-pci-mem-dma-map.c b/vfio-pci-mem-dma-map.c
index 6fd3e83..58ea363 100644
--- a/vfio-pci-mem-dma-map.c
+++ b/vfio-pci-mem-dma-map.c
@@ -40,7 +40,7 @@ int main(int argc, char **argv)
 {
        int container, device, ret, huge_fd = -1, pgsize = getpagesize(), i;
        int prot = PROT_READ | PROT_WRITE;
-       int flags = MAP_SHARED | MAP_ANONYMOUS;
+       int flags = MAP_PRIVATE | MAP_ANONYMOUS;
        char mempath[PATH_MAX] = "";
        unsigned long size_gb, j, map_total, unmap_total, start, elapsed;
        float secs;
@@ -131,7 +131,7 @@ int main(int argc, char **argv)
        
                start = now_nsec();
                for (j = 0, ptr = map; j < size_gb << 30; j += pgsize)
-                       (void)ptr[j];
+                       ptr[j] = 1;
                elapsed = now_nsec() - start;
                secs = (float)elapsed / NSEC_PER_SEC;
                fprintf(stderr, "%d: mmap populated in %.3fs\n", i, secs);

It seems that we need to use MAP_PRIVATE in this unit test to utilize
THP, rather than MAP_SHARED. My understanding is that for MAP_SHARED,
we call the function shmem_zero_setup() to map anonymous pages with
"dev/zero." In the case of MAP_PRIVATE, we directly call the function
vma_set_anonymous() (as referenced in the function __mmap_new_vma()).
Since the vm_ops for "dev/zero" does not implement the (*huge_fault)()
callback, this effectively precludes the use of THP.

In addition, the expression (void)ptr[j] might be ignored by the
compiler. It seems like a better strategy to simply assign a value
to it.

After making this modification to the unit test, there is almost no
difference in performance between the THP scenario and the hugetlbfs
scenario.

Base(v6.15):
#./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.048 s (331.3 GB/s)
VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.281 s (57.0 GB/s)
VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.053 s (301.2 GB/s)
VFIO UNMAP DMA in 0.139 s (115.2 GB/s)

Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (581.7 GB/s)
VFIO UNMAP DMA in 0.138 s (115.5 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.288 s (55.5 GB/s)
VFIO UNMAP DMA in 0.308 s (52.0 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (496.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.4 GB/s)

> The unmap optimization shows a pretty substantial decline in the
> non-hugetlbfs cases.  I don't think that can be overlooked.  Thanks,

Yes, the performance in the MAP_POPULATE scenario will experience
a significant drop. I've recently come up with a better idea to
address this performance issue. I will send the v2 patch later.

Thanks,
Zhe

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-06-10  3:10 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 12:49 [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-05 16:09 ` Alex Williamson
2025-06-10  3:10   ` lizhe.67

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).