* [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
@ 2025-06-05 12:49 lizhe.67
2025-06-05 16:09 ` Alex Williamson
0 siblings, 1 reply; 3+ messages in thread
From: lizhe.67 @ 2025-06-05 12:49 UTC (permalink / raw)
To: alex.williamson; +Cc: kvm, linux-kernel, lizhe.67
From: Li Zhe <lizhe.67@bytedance.com>
This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
for large folios'[1].
When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.
This patch optimize this process by batching the put_pfn() operations.
The performance test results, based on v6.15, for completing the 8G VFIO
IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
case, the 8G virtual address space has been separately mapped to small
folio and physical memory using hugetlbfs with pagesize=2M. For large
folio, we achieve an approximate 66% performance improvement. However,
for small folios, there is an approximate 11% performance degradation.
Before this patch:
hugetlbfs with pagesize=2M:
funcgraph_entry: # 94413.092 us | vfio_unmap_unpin();
small folio:
funcgraph_entry: # 118273.331 us | vfio_unmap_unpin();
After this patch:
hugetlbfs with pagesize=2M:
funcgraph_entry: # 31260.124 us | vfio_unmap_unpin();
small folio:
funcgraph_entry: # 131945.796 us | vfio_unmap_unpin();
[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
drivers/vfio/vfio_iommu_type1.c | 58 ++++++++++++++++++++++++++-------
1 file changed, 47 insertions(+), 11 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 28ee4b8d39ae..9d3ee0f1b298 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -469,17 +469,24 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
return true;
}
-static int put_pfn(unsigned long pfn, int prot)
+/*
+ * The caller must ensure that these npages PFNs belong to the same folio.
+ */
+static int put_pfns(unsigned long pfn, int prot, int npages)
{
if (!is_invalid_reserved_pfn(pfn)) {
- struct page *page = pfn_to_page(pfn);
-
- unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
- return 1;
+ unpin_user_page_range_dirty_lock(pfn_to_page(pfn),
+ npages, prot & IOMMU_WRITE);
+ return npages;
}
return 0;
}
+static int put_pfn(unsigned long pfn, int prot)
+{
+ return put_pfns(pfn, prot, 1);
+}
+
#define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
static void __vfio_batch_init(struct vfio_batch *batch, bool single)
@@ -801,19 +808,48 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
return pinned;
}
+static long get_step(unsigned long pfn, unsigned long npage)
+{
+ struct folio *folio;
+ struct page *page;
+
+ if (is_invalid_reserved_pfn(pfn))
+ return 1;
+
+ page = pfn_to_page(pfn);
+ folio = page_folio(page);
+
+ if (!folio_test_large(folio))
+ return 1;
+
+ /*
+ * The precondition for doing this here is that pfn is contiguous
+ */
+ return min_t(long, npage,
+ folio_nr_pages(folio) - folio_page_idx(folio, page));
+}
+
static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
unsigned long pfn, unsigned long npage,
bool do_accounting)
{
long unlocked = 0, locked = 0;
- long i;
- for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
- if (put_pfn(pfn++, dma->prot)) {
- unlocked++;
- if (vfio_find_vpfn(dma, iova))
- locked++;
+ while (npage) {
+ long step = get_step(pfn, npage);
+
+ /*
+ * Although the third parameter of put_pfns() is of type int,
+ * the value of step here will not exceed the range that int
+ * can represent. Therefore, it is safe to pass step.
+ */
+ if (put_pfns(pfn, dma->prot, step)) {
+ unlocked += step;
+ locked += vpfn_pages(dma, iova, step);
}
+ pfn += step;
+ iova += PAGE_SIZE * step;
+ npage -= step;
}
if (do_accounting)
--
2.20.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-05 12:49 [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-05 16:09 ` Alex Williamson
2025-06-10 3:10 ` lizhe.67
0 siblings, 1 reply; 3+ messages in thread
From: Alex Williamson @ 2025-06-05 16:09 UTC (permalink / raw)
To: lizhe.67; +Cc: kvm, linux-kernel
On Thu, 5 Jun 2025 20:49:23 +0800
lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
>
> This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> for large folios'[1].
>
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
>
> This patch optimize this process by batching the put_pfn() operations.
>
> The performance test results, based on v6.15, for completing the 8G VFIO
> IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> case, the 8G virtual address space has been separately mapped to small
> folio and physical memory using hugetlbfs with pagesize=2M. For large
> folio, we achieve an approximate 66% performance improvement. However,
> for small folios, there is an approximate 11% performance degradation.
>
> Before this patch:
>
> hugetlbfs with pagesize=2M:
> funcgraph_entry: # 94413.092 us | vfio_unmap_unpin();
>
> small folio:
> funcgraph_entry: # 118273.331 us | vfio_unmap_unpin();
>
> After this patch:
>
> hugetlbfs with pagesize=2M:
> funcgraph_entry: # 31260.124 us | vfio_unmap_unpin();
>
> small folio:
> funcgraph_entry: # 131945.796 us | vfio_unmap_unpin();
I was just playing with a unit test[1] to validate your previous patch
and added this as well:
Test options:
vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]
I'm running it once with device and size for the madvise and populate
tests, then again adding /dev/hugepages (1G) for the remaining test:
Base:
# vfio-pci-mem-dma-map 0000:0b:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.726 s (22.0 GB/s)
VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.071 s (224.0 GB/s)
VFIO UNMAP DMA in 0.103 s (156.0 GB/s)
Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.296 s (54.0 GB/s)
VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.741 s (21.6 GB/s)
VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.010 s (1542.9 GB/s)
VFIO UNMAP DMA in 0.109 s (146.1 GB/s)
Map + Unmap patches:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.301 s (53.2 GB/s)
VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.735 s (21.8 GB/s)
VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.011 s (1434.7 GB/s)
VFIO UNMAP DMA in 0.023 s (686.5 GB/s)
So overall the map optimization shows a nice improvement in hugetlbfs
mapping performance. I was hoping we'd see some improvement in THP,
but that doesn't appear to be the case. Will folio_nr_pages() ever be
more than 1 for THP? The degradation in non-hugetlbfs case is small,
but notable.
The unmap optimization shows a pretty substantial decline in the
non-hugetlbfs cases. I don't think that can be overlooked. Thanks,
Alex
[1]https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-05 16:09 ` Alex Williamson
@ 2025-06-10 3:10 ` lizhe.67
0 siblings, 0 replies; 3+ messages in thread
From: lizhe.67 @ 2025-06-10 3:10 UTC (permalink / raw)
To: alex.williamson; +Cc: kvm, linux-kernel, lizhe.67
On Thu, 5 Jun 2025 10:09:58 -0600, alex.williamson@redhat.com wrote:
> On Thu, 5 Jun 2025 20:49:23 +0800
> lizhe.67@bytedance.com wrote:
>
> > From: Li Zhe <lizhe.67@bytedance.com>
> >
> > This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> > for large folios'[1].
> >
> > When vfio_unpin_pages_remote() is called with a range of addresses that
> > includes large folios, the function currently performs individual
> > put_pfn() operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> >
> > This patch optimize this process by batching the put_pfn() operations.
> >
> > The performance test results, based on v6.15, for completing the 8G VFIO
> > IOMMU DMA unmapping, obtained through trace-cmd, are as follows. In this
> > case, the 8G virtual address space has been separately mapped to small
> > folio and physical memory using hugetlbfs with pagesize=2M. For large
> > folio, we achieve an approximate 66% performance improvement. However,
> > for small folios, there is an approximate 11% performance degradation.
> >
> > Before this patch:
> >
> > hugetlbfs with pagesize=2M:
> > funcgraph_entry: # 94413.092 us | vfio_unmap_unpin();
> >
> > small folio:
> > funcgraph_entry: # 118273.331 us | vfio_unmap_unpin();
> >
> > After this patch:
> >
> > hugetlbfs with pagesize=2M:
> > funcgraph_entry: # 31260.124 us | vfio_unmap_unpin();
> >
> > small folio:
> > funcgraph_entry: # 131945.796 us | vfio_unmap_unpin();
>
> I was just playing with a unit test[1] to validate your previous patch
> and added this as well:
>
> Test options:
>
> vfio-pci-mem-dma-map <ssss:bb:dd.f> <size GB> [hugetlb path]
>
> I'm running it once with device and size for the madvise and populate
> tests, then again adding /dev/hugepages (1G) for the remaining test:
>
> Base:
> # vfio-pci-mem-dma-map 0000:0b:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.294 s (54.4 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.3 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.726 s (22.0 GB/s)
> VFIO UNMAP DMA in 0.169 s (94.5 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.071 s (224.0 GB/s)
> VFIO UNMAP DMA in 0.103 s (156.0 GB/s)
>
> Map patch:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.296 s (54.0 GB/s)
> VFIO UNMAP DMA in 0.175 s (91.7 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.741 s (21.6 GB/s)
> VFIO UNMAP DMA in 0.184 s (86.7 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.010 s (1542.9 GB/s)
> VFIO UNMAP DMA in 0.109 s (146.1 GB/s)
>
> Map + Unmap patches:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.301 s (53.2 GB/s)
> VFIO UNMAP DMA in 0.236 s (67.8 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.735 s (21.8 GB/s)
> VFIO UNMAP DMA in 0.234 s (68.4 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.011 s (1434.7 GB/s)
> VFIO UNMAP DMA in 0.023 s (686.5 GB/s)
>
> So overall the map optimization shows a nice improvement in hugetlbfs
> mapping performance. I was hoping we'd see some improvement in THP,
> but that doesn't appear to be the case. Will folio_nr_pages() ever be
> more than 1 for THP? The degradation in non-hugetlbfs case is small,
> but notable.
After I made the following modifications to the unit test, the
performance test results met the expectations.
diff --git a/vfio-pci-mem-dma-map.c b/vfio-pci-mem-dma-map.c
index 6fd3e83..58ea363 100644
--- a/vfio-pci-mem-dma-map.c
+++ b/vfio-pci-mem-dma-map.c
@@ -40,7 +40,7 @@ int main(int argc, char **argv)
{
int container, device, ret, huge_fd = -1, pgsize = getpagesize(), i;
int prot = PROT_READ | PROT_WRITE;
- int flags = MAP_SHARED | MAP_ANONYMOUS;
+ int flags = MAP_PRIVATE | MAP_ANONYMOUS;
char mempath[PATH_MAX] = "";
unsigned long size_gb, j, map_total, unmap_total, start, elapsed;
float secs;
@@ -131,7 +131,7 @@ int main(int argc, char **argv)
start = now_nsec();
for (j = 0, ptr = map; j < size_gb << 30; j += pgsize)
- (void)ptr[j];
+ ptr[j] = 1;
elapsed = now_nsec() - start;
secs = (float)elapsed / NSEC_PER_SEC;
fprintf(stderr, "%d: mmap populated in %.3fs\n", i, secs);
It seems that we need to use MAP_PRIVATE in this unit test to utilize
THP, rather than MAP_SHARED. My understanding is that for MAP_SHARED,
we call the function shmem_zero_setup() to map anonymous pages with
"dev/zero." In the case of MAP_PRIVATE, we directly call the function
vma_set_anonymous() (as referenced in the function __mmap_new_vma()).
Since the vm_ops for "dev/zero" does not implement the (*huge_fault)()
callback, this effectively precludes the use of THP.
In addition, the expression (void)ptr[j] might be ignored by the
compiler. It seems like a better strategy to simply assign a value
to it.
After making this modification to the unit test, there is almost no
difference in performance between the THP scenario and the hugetlbfs
scenario.
Base(v6.15):
#./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.048 s (331.3 GB/s)
VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.281 s (57.0 GB/s)
VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.053 s (301.2 GB/s)
VFIO UNMAP DMA in 0.139 s (115.2 GB/s)
Map patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (581.7 GB/s)
VFIO UNMAP DMA in 0.138 s (115.5 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.288 s (55.5 GB/s)
VFIO UNMAP DMA in 0.308 s (52.0 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (496.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.4 GB/s)
> The unmap optimization shows a pretty substantial decline in the
> non-hugetlbfs cases. I don't think that can be overlooked. Thanks,
Yes, the performance in the MAP_POPULATE scenario will experience
a significant drop. I've recently come up with a better idea to
address this performance issue. I will send the v2 patch later.
Thanks,
Zhe
^ permalink raw reply related [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-06-10 3:10 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-05 12:49 [RFC] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-05 16:09 ` Alex Williamson
2025-06-10 3:10 ` lizhe.67
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).