[PATCH v3 0/2] vfio/type1: optimize vfio_unpin_pages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
@ 2025-06-16  7:52 lizhe.67
  2025-06-16  7:52 ` [PATCH v3 1/2] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
  2025-06-16  7:52 ` [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
  0 siblings, 2 replies; 9+ messages in thread
From: lizhe.67 @ 2025-06-16  7:52 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, david, peterx, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is based on patch 'vfio/type1: optimize
vfio_pin_pages_remote() for large folios'[1].

When vfio_unpin_pages_remote() is called with a range of addresses
that includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of
pages. We can optimize this process by batching the put_pfn()
operations.

The first patch batches the vfio_find_vpfn() calls in function
vfio_unpin_pages_remote(). However, performance testing indicates that
this patch does not seem to have a significant impact. The primary
reason is that the vpfn rb tree is generally empty. Nevertheless, we
believe it can still offer performance benefits in certain scenarios
and also lays the groundwork for the second patch. The second patch,
using the method described earlier, optimizes the performance of
vfio_unpin_pages_remote() for large folio scenarios.

The performance test results, based on v6.15, for completing the 16G VFIO
IOMMU DMA unmapping, obtained through unit test[2] with slight
modifications[3], are as follows.

Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.047 s (338.6 GB/s)
VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.280 s (57.2 GB/s)
VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (308.3 GB/s)
VFIO UNMAP DMA in 0.139 s (115.1 GB/s)

Map[1] + first patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (596.1 GB/s)
VFIO UNMAP DMA in 0.138 s (115.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.292 s (54.8 GB/s)
VFIO UNMAP DMA in 0.310 s (51.6 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (506.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.1 GB/s)

Map[1] + first + sencond patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (598.2 GB/s)
VFIO UNMAP DMA in 0.049 s (328.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.289 s (55.3 GB/s)
VFIO UNMAP DMA in 0.303 s (52.9 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (506.8 GB/s)
VFIO UNMAP DMA in 0.049 s (326.7 GB/s)

The first patch appears to have negligible impact on the performance
of VFIO UNMAP DMA.

With the second patch, we achieve an approximate 64% performance
improvement in the VFIO UNMAP DMA item for large folios. For small
folios, the performance test results appear to show no significant
changes.

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
[2]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[3]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/

Changelogs:

v2->v3:
- Split the original patch into two separate patches.
- Add several comments specific to large folio scenarios.
- Rename two variables.
- The update to iova has been removed within the loop in
  vfio_unpin_pages_remote().
- Update the performance test results.

v1->v2:
- Refactor the implementation of the optimized code

v2: https://lore.kernel.org/all/20250610045753.6405-1-lizhe.67@bytedance.com/
v1: https://lore.kernel.org/all/20250605124923.21896-1-lizhe.67@bytedance.com/

Li Zhe (2):
  vfio/type1: batch vfio_find_vpfn() in function
    vfio_unpin_pages_remote()
  vfio/type1: optimize vfio_unpin_pages_remote() for large folio

 drivers/vfio/vfio_iommu_type1.c | 57 ++++++++++++++++++++++++++-------
 1 file changed, 45 insertions(+), 12 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 1/2] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote()
  2025-06-16  7:52 [PATCH v3 0/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-16  7:52 ` lizhe.67
  2025-06-16  7:52 ` [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
  1 sibling, 0 replies; 9+ messages in thread
From: lizhe.67 @ 2025-06-16  7:52 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, david, peterx, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patch is based on patch 'vfio/type1: optimize
vfio_pin_pages_remote() for large folios'[1].

The function vpfn_pages() can help us determine the number of vpfn
nodes on the vpfn rb tree within a specified range. This allows us
to avoid searching for each vpfn individually in the function
vfio_unpin_pages_remote(). This patch batches the vfio_find_vpfn()
calls in function vfio_unpin_pages_remote().

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 28ee4b8d39ae..e952bf8bdfab 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    unsigned long pfn, unsigned long npage,
 				    bool do_accounting)
 {
-	long unlocked = 0, locked = 0;
+	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
 	long i;
 
-	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
-		if (put_pfn(pfn++, dma->prot)) {
+	for (i = 0; i < npage; i++)
+		if (put_pfn(pfn++, dma->prot))
 			unlocked++;
-			if (vfio_find_vpfn(dma, iova))
-				locked++;
-		}
-	}
 
 	if (do_accounting)
 		vfio_lock_acct(dma, locked - unlocked, true);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16  7:52 [PATCH v3 0/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
  2025-06-16  7:52 ` [PATCH v3 1/2] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
@ 2025-06-16  7:52 ` lizhe.67
  2025-06-16  8:14   ` David Hildenbrand
  1 sibling, 1 reply; 9+ messages in thread
From: lizhe.67 @ 2025-06-16  7:52 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, david, peterx, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

This patch optimize this process by batching the put_pfn() operations.

The performance test results, based on v6.15, for completing the 16G VFIO
IOMMU DMA unmapping, obtained through unit test[1] with slight
modifications[2], are as follows.

Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.047 s (338.6 GB/s)
VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.280 s (57.2 GB/s)
VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (308.3 GB/s)
VFIO UNMAP DMA in 0.139 s (115.1 GB/s)

Map[3] + This patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (598.2 GB/s)
VFIO UNMAP DMA in 0.049 s (328.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.289 s (55.3 GB/s)
VFIO UNMAP DMA in 0.303 s (52.9 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (506.8 GB/s)
VFIO UNMAP DMA in 0.049 s (326.7 GB/s)

For large folio, we achieve an approximate 64% performance improvement
in the VFIO UNMAP DMA item. For small folios, the performance test
results appear to show no significant changes.

[1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
[3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
 1 file changed, 46 insertions(+), 9 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e952bf8bdfab..09ecc546ece8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
 	return true;
 }
 
-static int put_pfn(unsigned long pfn, int prot)
+static inline void _put_pfns(struct page *page, int npages, int prot)
 {
-	if (!is_invalid_reserved_pfn(pfn)) {
-		struct page *page = pfn_to_page(pfn);
+	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
+}
 
-		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
-		return 1;
+/*
+ * The caller must ensure that these npages PFNs belong to the same folio.
+ */
+static inline int put_pfns(unsigned long pfn, int npages, int prot)
+{
+	if (!is_invalid_reserved_pfn(pfn)) {
+		_put_pfns(pfn_to_page(pfn), npages, prot);
+		return npages;
 	}
 	return 0;
 }
 
+static inline int put_pfn(unsigned long pfn, int prot)
+{
+	return put_pfns(pfn, 1, prot);
+}
+
 #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
 
 static void __vfio_batch_init(struct vfio_batch *batch, bool single)
@@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    bool do_accounting)
 {
 	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
-	long i;
 
-	for (i = 0; i < npage; i++)
-		if (put_pfn(pfn++, dma->prot))
-			unlocked++;
+	while (npage) {
+		long nr_pages = 1;
+
+		if (!is_invalid_reserved_pfn(pfn)) {
+			struct page *page = pfn_to_page(pfn);
+			struct folio *folio = page_folio(page);
+			long folio_pages_num = folio_nr_pages(folio);
+
+			/*
+			 * For a folio, it represents a physically
+			 * contiguous set of bytes, and all of its pages
+			 * share the same invalid/reserved state.
+			 *
+			 * Here, our PFNs are contiguous. Therefore, if we
+			 * detect that the current PFN belongs to a large
+			 * folio, we can batch the operations for the next
+			 * nr_pages PFNs.
+			 */
+			if (folio_pages_num > 1)
+				nr_pages = min_t(long, npage,
+					folio_pages_num -
+					folio_page_idx(folio, page));
+
+			_put_pfns(page, nr_pages, dma->prot);
+			unlocked += nr_pages;
+		}
+
+		pfn += nr_pages;
+		npage -= nr_pages;
+	}
 
 	if (do_accounting)
 		vfio_lock_acct(dma, locked - unlocked, true);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16  7:52 ` [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-16  8:14   ` David Hildenbrand
  2025-06-16  8:27     ` David Hildenbrand
  2025-06-16 11:13     ` lizhe.67
  0 siblings, 2 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-06-16  8:14 UTC (permalink / raw)
  To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, peterx

On 16.06.25 09:52, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the put_pfn() operations.
> 
> The performance test results, based on v6.15, for completing the 16G VFIO
> IOMMU DMA unmapping, obtained through unit test[1] with slight
> modifications[2], are as follows.
> 
> Base(v6.15):
> ./vfio-pci-mem-dma-map 0000:03:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.047 s (338.6 GB/s)
> VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.280 s (57.2 GB/s)
> VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.052 s (308.3 GB/s)
> VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
> 
> Map[3] + This patchset:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.027 s (598.2 GB/s)
> VFIO UNMAP DMA in 0.049 s (328.7 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.289 s (55.3 GB/s)
> VFIO UNMAP DMA in 0.303 s (52.9 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.032 s (506.8 GB/s)
> VFIO UNMAP DMA in 0.049 s (326.7 GB/s)
> 
> For large folio, we achieve an approximate 64% performance improvement
> in the VFIO UNMAP DMA item. For small folios, the performance test
> results appear to show no significant changes.
> 
> [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
> [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
> [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>   drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
>   1 file changed, 46 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e952bf8bdfab..09ecc546ece8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>   	return true;
>   }
>   
> -static int put_pfn(unsigned long pfn, int prot)
> +static inline void _put_pfns(struct page *page, int npages, int prot)
>   {
> -	if (!is_invalid_reserved_pfn(pfn)) {
> -		struct page *page = pfn_to_page(pfn);
> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> +}
>   
> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> -		return 1;
> +/*
> + * The caller must ensure that these npages PFNs belong to the same folio.
> + */
> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> +{
> +	if (!is_invalid_reserved_pfn(pfn)) {
> +		_put_pfns(pfn_to_page(pfn), npages, prot);
> +		return npages;
>   	}
>   	return 0;
>   }
>   
> +static inline int put_pfn(unsigned long pfn, int prot)
> +{
> +	return put_pfns(pfn, 1, prot);
> +}
> +
>   #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>   
>   static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> @@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>   				    bool do_accounting)
>   {
>   	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> -	long i;
>   
> -	for (i = 0; i < npage; i++)
> -		if (put_pfn(pfn++, dma->prot))
> -			unlocked++;
> +	while (npage) {
> +		long nr_pages = 1;
> +
> +		if (!is_invalid_reserved_pfn(pfn)) {
> +			struct page *page = pfn_to_page(pfn);
> +			struct folio *folio = page_folio(page);
> +			long folio_pages_num = folio_nr_pages(folio);
> +
> +			/*
> +			 * For a folio, it represents a physically
> +			 * contiguous set of bytes, and all of its pages
> +			 * share the same invalid/reserved state.
> +			 *
> +			 * Here, our PFNs are contiguous. Therefore, if we
> +			 * detect that the current PFN belongs to a large
> +			 * folio, we can batch the operations for the next
> +			 * nr_pages PFNs.
> +			 */
> +			if (folio_pages_num > 1)
> +				nr_pages = min_t(long, npage,
> +					folio_pages_num -
> +					folio_page_idx(folio, page));
> +
> +			_put_pfns(page, nr_pages, dma->prot);


This is sneaky. You interpret the page pointer a an actual page array, 
assuming that it would give you the right values when advancing nr_pages 
in that array.

This is mostly true, but with !CONFIG_SPARSEMEM_VMEMMAP it is not 
universally true for very large folios (e.g., in a 1 GiB hugetlb folio 
when we cross the 128 MiB mark on x86).

Not sure if that could already trigger here, but it is subtle.

> +			unlocked += nr_pages;

We could do slightly better here, as we already have the folio. We would 
add a unpin_user_folio_dirty_locked() similar to unpin_user_folio().

Instead of _put_pfns, we would be calling

unpin_user_folio_dirty_locked(folio, nr_pages, dma->prot & IOMMU_WRITE);


Then, also what I mention above would not apply.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16  8:14   ` David Hildenbrand
@ 2025-06-16  8:27     ` David Hildenbrand
  2025-06-16  8:28       ` David Hildenbrand
  2025-06-16 11:13     ` lizhe.67
  1 sibling, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-06-16  8:27 UTC (permalink / raw)
  To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, peterx

On 16.06.25 10:14, David Hildenbrand wrote:
> On 16.06.25 09:52, lizhe.67@bytedance.com wrote:
>> From: Li Zhe <lizhe.67@bytedance.com>
>>
>> When vfio_unpin_pages_remote() is called with a range of addresses that
>> includes large folios, the function currently performs individual
>> put_pfn() operations for each page. This can lead to significant
>> performance overheads, especially when dealing with large ranges of pages.
>>
>> This patch optimize this process by batching the put_pfn() operations.
>>
>> The performance test results, based on v6.15, for completing the 16G VFIO
>> IOMMU DMA unmapping, obtained through unit test[1] with slight
>> modifications[2], are as follows.
>>
>> Base(v6.15):
>> ./vfio-pci-mem-dma-map 0000:03:00.0 16
>> ------- AVERAGE (MADV_HUGEPAGE) --------
>> VFIO MAP DMA in 0.047 s (338.6 GB/s)
>> VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
>> ------- AVERAGE (MAP_POPULATE) --------
>> VFIO MAP DMA in 0.280 s (57.2 GB/s)
>> VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
>> ------- AVERAGE (HUGETLBFS) --------
>> VFIO MAP DMA in 0.052 s (308.3 GB/s)
>> VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
>>
>> Map[3] + This patchset:
>> ------- AVERAGE (MADV_HUGEPAGE) --------
>> VFIO MAP DMA in 0.027 s (598.2 GB/s)
>> VFIO UNMAP DMA in 0.049 s (328.7 GB/s)
>> ------- AVERAGE (MAP_POPULATE) --------
>> VFIO MAP DMA in 0.289 s (55.3 GB/s)
>> VFIO UNMAP DMA in 0.303 s (52.9 GB/s)
>> ------- AVERAGE (HUGETLBFS) --------
>> VFIO MAP DMA in 0.032 s (506.8 GB/s)
>> VFIO UNMAP DMA in 0.049 s (326.7 GB/s)
>>
>> For large folio, we achieve an approximate 64% performance improvement
>> in the VFIO UNMAP DMA item. For small folios, the performance test
>> results appear to show no significant changes.
>>
>> [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
>> [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
>> [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
>>
>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>> ---
>>    drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
>>    1 file changed, 46 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index e952bf8bdfab..09ecc546ece8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>    	return true;
>>    }
>>    
>> -static int put_pfn(unsigned long pfn, int prot)
>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>    {
>> -	if (!is_invalid_reserved_pfn(pfn)) {
>> -		struct page *page = pfn_to_page(pfn);
>> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>> +}
>>    
>> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>> -		return 1;
>> +/*
>> + * The caller must ensure that these npages PFNs belong to the same folio.
>> + */
>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>> +{
>> +	if (!is_invalid_reserved_pfn(pfn)) {
>> +		_put_pfns(pfn_to_page(pfn), npages, prot);
>> +		return npages;
>>    	}
>>    	return 0;
>>    }
>>    
>> +static inline int put_pfn(unsigned long pfn, int prot)
>> +{
>> +	return put_pfns(pfn, 1, prot);
>> +}
>> +
>>    #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>    
>>    static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>> @@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>    				    bool do_accounting)
>>    {
>>    	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>> -	long i;
>>    
>> -	for (i = 0; i < npage; i++)
>> -		if (put_pfn(pfn++, dma->prot))
>> -			unlocked++;
>> +	while (npage) {
>> +		long nr_pages = 1;
>> +
>> +		if (!is_invalid_reserved_pfn(pfn)) {
>> +			struct page *page = pfn_to_page(pfn);
>> +			struct folio *folio = page_folio(page);
>> +			long folio_pages_num = folio_nr_pages(folio);
>> +
>> +			/*
>> +			 * For a folio, it represents a physically
>> +			 * contiguous set of bytes, and all of its pages
>> +			 * share the same invalid/reserved state.
>> +			 *
>> +			 * Here, our PFNs are contiguous. Therefore, if we
>> +			 * detect that the current PFN belongs to a large
>> +			 * folio, we can batch the operations for the next
>> +			 * nr_pages PFNs.
>> +			 */
>> +			if (folio_pages_num > 1)
>> +				nr_pages = min_t(long, npage,
>> +					folio_pages_num -
>> +					folio_page_idx(folio, page));
>> +
>> +			_put_pfns(page, nr_pages, dma->prot);
> 
> 
> This is sneaky. You interpret the page pointer a an actual page array,
> assuming that it would give you the right values when advancing nr_pages
> in that array.

Just to add to this: unpin_user_page_range_dirty_lock() is not 
universally save in the hugetlb scenario I described.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16  8:27     ` David Hildenbrand
@ 2025-06-16  8:28       ` David Hildenbrand
  0 siblings, 0 replies; 9+ messages in thread
From: David Hildenbrand @ 2025-06-16  8:28 UTC (permalink / raw)
  To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, peterx

On 16.06.25 10:27, David Hildenbrand wrote:
> On 16.06.25 10:14, David Hildenbrand wrote:
>> On 16.06.25 09:52, lizhe.67@bytedance.com wrote:
>>> From: Li Zhe <lizhe.67@bytedance.com>
>>>
>>> When vfio_unpin_pages_remote() is called with a range of addresses that
>>> includes large folios, the function currently performs individual
>>> put_pfn() operations for each page. This can lead to significant
>>> performance overheads, especially when dealing with large ranges of pages.
>>>
>>> This patch optimize this process by batching the put_pfn() operations.
>>>
>>> The performance test results, based on v6.15, for completing the 16G VFIO
>>> IOMMU DMA unmapping, obtained through unit test[1] with slight
>>> modifications[2], are as follows.
>>>
>>> Base(v6.15):
>>> ./vfio-pci-mem-dma-map 0000:03:00.0 16
>>> ------- AVERAGE (MADV_HUGEPAGE) --------
>>> VFIO MAP DMA in 0.047 s (338.6 GB/s)
>>> VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
>>> ------- AVERAGE (MAP_POPULATE) --------
>>> VFIO MAP DMA in 0.280 s (57.2 GB/s)
>>> VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
>>> ------- AVERAGE (HUGETLBFS) --------
>>> VFIO MAP DMA in 0.052 s (308.3 GB/s)
>>> VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
>>>
>>> Map[3] + This patchset:
>>> ------- AVERAGE (MADV_HUGEPAGE) --------
>>> VFIO MAP DMA in 0.027 s (598.2 GB/s)
>>> VFIO UNMAP DMA in 0.049 s (328.7 GB/s)
>>> ------- AVERAGE (MAP_POPULATE) --------
>>> VFIO MAP DMA in 0.289 s (55.3 GB/s)
>>> VFIO UNMAP DMA in 0.303 s (52.9 GB/s)
>>> ------- AVERAGE (HUGETLBFS) --------
>>> VFIO MAP DMA in 0.032 s (506.8 GB/s)
>>> VFIO UNMAP DMA in 0.049 s (326.7 GB/s)
>>>
>>> For large folio, we achieve an approximate 64% performance improvement
>>> in the VFIO UNMAP DMA item. For small folios, the performance test
>>> results appear to show no significant changes.
>>>
>>> [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
>>> [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
>>> [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
>>>
>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>>> ---
>>>     drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
>>>     1 file changed, 46 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index e952bf8bdfab..09ecc546ece8 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>     	return true;
>>>     }
>>>     
>>> -static int put_pfn(unsigned long pfn, int prot)
>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>>     {
>>> -	if (!is_invalid_reserved_pfn(pfn)) {
>>> -		struct page *page = pfn_to_page(pfn);
>>> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>> +}
>>>     
>>> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>> -		return 1;
>>> +/*
>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>> + */
>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>> +{
>>> +	if (!is_invalid_reserved_pfn(pfn)) {
>>> +		_put_pfns(pfn_to_page(pfn), npages, prot);
>>> +		return npages;
>>>     	}
>>>     	return 0;
>>>     }
>>>     
>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>> +{
>>> +	return put_pfns(pfn, 1, prot);
>>> +}
>>> +
>>>     #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>     
>>>     static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>> @@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>     				    bool do_accounting)
>>>     {
>>>     	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>> -	long i;
>>>     
>>> -	for (i = 0; i < npage; i++)
>>> -		if (put_pfn(pfn++, dma->prot))
>>> -			unlocked++;
>>> +	while (npage) {
>>> +		long nr_pages = 1;
>>> +
>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>> +			struct page *page = pfn_to_page(pfn);
>>> +			struct folio *folio = page_folio(page);
>>> +			long folio_pages_num = folio_nr_pages(folio);
>>> +
>>> +			/*
>>> +			 * For a folio, it represents a physically
>>> +			 * contiguous set of bytes, and all of its pages
>>> +			 * share the same invalid/reserved state.
>>> +			 *
>>> +			 * Here, our PFNs are contiguous. Therefore, if we
>>> +			 * detect that the current PFN belongs to a large
>>> +			 * folio, we can batch the operations for the next
>>> +			 * nr_pages PFNs.
>>> +			 */
>>> +			if (folio_pages_num > 1)
>>> +				nr_pages = min_t(long, npage,
>>> +					folio_pages_num -
>>> +					folio_page_idx(folio, page));
>>> +
>>> +			_put_pfns(page, nr_pages, dma->prot);
>>
>>
>> This is sneaky. You interpret the page pointer a an actual page array,
>> assuming that it would give you the right values when advancing nr_pages
>> in that array.
> 
> Just to add to this: unpin_user_page_range_dirty_lock() is not
> universally save in the hugetlb scenario I described.

Correction: it is, because gup_folio_range_next() uses "nth_page".

So this should work.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16  8:14   ` David Hildenbrand
  2025-06-16  8:27     ` David Hildenbrand
@ 2025-06-16 11:13     ` lizhe.67
  2025-06-16 11:18       ` David Hildenbrand
  1 sibling, 1 reply; 9+ messages in thread
From: lizhe.67 @ 2025-06-16 11:13 UTC (permalink / raw)
  To: david; +Cc: alex.williamson, kvm, linux-kernel, lizhe.67, peterx

On Mon, 16 Jun 2025 10:14:23 +0200, david@redhat.com wrote:

> >   drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
> >   1 file changed, 46 insertions(+), 9 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index e952bf8bdfab..09ecc546ece8 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> >   	return true;
> >   }
> >   
> > -static int put_pfn(unsigned long pfn, int prot)
> > +static inline void _put_pfns(struct page *page, int npages, int prot)
> >   {
> > -	if (!is_invalid_reserved_pfn(pfn)) {
> > -		struct page *page = pfn_to_page(pfn);
> > +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> > +}
> >   
> > -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> > -		return 1;
> > +/*
> > + * The caller must ensure that these npages PFNs belong to the same folio.
> > + */
> > +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> > +{
> > +	if (!is_invalid_reserved_pfn(pfn)) {
> > +		_put_pfns(pfn_to_page(pfn), npages, prot);
> > +		return npages;
> >   	}
> >   	return 0;
> >   }
> >   
> > +static inline int put_pfn(unsigned long pfn, int prot)
> > +{
> > +	return put_pfns(pfn, 1, prot);
> > +}
> > +
> >   #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> >   
> >   static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> > @@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> >   				    bool do_accounting)
> >   {
> >   	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> > -	long i;
> >   
> > -	for (i = 0; i < npage; i++)
> > -		if (put_pfn(pfn++, dma->prot))
> > -			unlocked++;
> > +	while (npage) {
> > +		long nr_pages = 1;
> > +
> > +		if (!is_invalid_reserved_pfn(pfn)) {
> > +			struct page *page = pfn_to_page(pfn);
> > +			struct folio *folio = page_folio(page);
> > +			long folio_pages_num = folio_nr_pages(folio);
> > +
> > +			/*
> > +			 * For a folio, it represents a physically
> > +			 * contiguous set of bytes, and all of its pages
> > +			 * share the same invalid/reserved state.
> > +			 *
> > +			 * Here, our PFNs are contiguous. Therefore, if we
> > +			 * detect that the current PFN belongs to a large
> > +			 * folio, we can batch the operations for the next
> > +			 * nr_pages PFNs.
> > +			 */
> > +			if (folio_pages_num > 1)
> > +				nr_pages = min_t(long, npage,
> > +					folio_pages_num -
> > +					folio_page_idx(folio, page));
> > +
> > +			_put_pfns(page, nr_pages, dma->prot);
> 
> 
> This is sneaky. You interpret the page pointer a an actual page array, 
> assuming that it would give you the right values when advancing nr_pages 
> in that array.
> 
> This is mostly true, but with !CONFIG_SPARSEMEM_VMEMMAP it is not 
> universally true for very large folios (e.g., in a 1 GiB hugetlb folio 
> when we cross the 128 MiB mark on x86).
> 
> Not sure if that could already trigger here, but it is subtle.

As previously mentioned in the email, the code here functions
correctly.

> > +			unlocked += nr_pages;
> 
> We could do slightly better here, as we already have the folio. We would 
> add a unpin_user_folio_dirty_locked() similar to unpin_user_folio().
> 
> Instead of _put_pfns, we would be calling
> 
> unpin_user_folio_dirty_locked(folio, nr_pages, dma->prot & IOMMU_WRITE);

Thank you so much for your suggestion. Does this implementation of
unpin_user_folio_dirty_locked() look viable to you?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fdda6b16263b..567c9dae9088 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1689,6 +1689,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 				      bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 void unpin_user_folio(struct folio *folio, unsigned long npages);
+void unpin_user_folio_dirty_locked(struct folio *folio,
+		unsigned long npages, bool make_dirty);
 void unpin_folios(struct folio **folios, unsigned long nfolios);
 
 static inline bool is_cow_mapping(vm_flags_t flags)
diff --git a/mm/gup.c b/mm/gup.c
index 84461d384ae2..2f1e14a79463 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -360,11 +360,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_range_next(page, npages, i, &nr);
-		if (make_dirty && !folio_test_dirty(folio)) {
-			folio_lock(folio);
-			folio_mark_dirty(folio);
-			folio_unlock(folio);
-		}
+		if (make_dirty && !folio_test_dirty(folio))
+			folio_mark_dirty_lock(folio);
 		gup_put_folio(folio, nr, FOLL_PIN);
 	}
 }
@@ -435,6 +432,26 @@ void unpin_user_folio(struct folio *folio, unsigned long npages)
 }
 EXPORT_SYMBOL(unpin_user_folio);
 
+/**
+ * unpin_user_folio_dirty_locked() - release pages of a folio and
+ * optionally dirty
+ *
+ * @folio:  pointer to folio to be released
+ * @npages: number of pages of same folio
+ * @make_dirty: whether to mark the folio dirty
+ *
+ * Mark the folio as being modified if @make_dirty is true. Then
+ * release npages of the folio.
+ */
+void unpin_user_folio_dirty_locked(struct folio *folio,
+		unsigned long npages, bool make_dirty)
+{
+	if (make_dirty && !folio_test_dirty(folio))
+		folio_mark_dirty_lock(folio);
+	gup_put_folio(folio, npages, FOLL_PIN);
+}
+EXPORT_SYMBOL(unpin_user_folio_dirty_locked);
+
 /**
  * unpin_folios() - release an array of gup-pinned folios.
  * @folios:  array of folios to be marked dirty and released.

--

Thanks,
Zhe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16 11:13     ` lizhe.67
@ 2025-06-16 11:18       ` David Hildenbrand
  2025-06-16 12:07         ` lizhe.67
  0 siblings, 1 reply; 9+ messages in thread
From: David Hildenbrand @ 2025-06-16 11:18 UTC (permalink / raw)
  To: lizhe.67; +Cc: alex.williamson, kvm, linux-kernel, peterx

On 16.06.25 13:13, lizhe.67@bytedance.com wrote:
> On Mon, 16 Jun 2025 10:14:23 +0200, david@redhat.com wrote:
> 
>>>    drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
>>>    1 file changed, 46 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index e952bf8bdfab..09ecc546ece8 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>    	return true;
>>>    }
>>>    
>>> -static int put_pfn(unsigned long pfn, int prot)
>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>>    {
>>> -	if (!is_invalid_reserved_pfn(pfn)) {
>>> -		struct page *page = pfn_to_page(pfn);
>>> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>> +}
>>>    
>>> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>> -		return 1;
>>> +/*
>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>> + */
>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>> +{
>>> +	if (!is_invalid_reserved_pfn(pfn)) {
>>> +		_put_pfns(pfn_to_page(pfn), npages, prot);
>>> +		return npages;
>>>    	}
>>>    	return 0;
>>>    }
>>>    
>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>> +{
>>> +	return put_pfns(pfn, 1, prot);
>>> +}
>>> +
>>>    #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>    
>>>    static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>> @@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>    				    bool do_accounting)
>>>    {
>>>    	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>> -	long i;
>>>    
>>> -	for (i = 0; i < npage; i++)
>>> -		if (put_pfn(pfn++, dma->prot))
>>> -			unlocked++;
>>> +	while (npage) {
>>> +		long nr_pages = 1;
>>> +
>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>> +			struct page *page = pfn_to_page(pfn);
>>> +			struct folio *folio = page_folio(page);
>>> +			long folio_pages_num = folio_nr_pages(folio);
>>> +
>>> +			/*
>>> +			 * For a folio, it represents a physically
>>> +			 * contiguous set of bytes, and all of its pages
>>> +			 * share the same invalid/reserved state.
>>> +			 *
>>> +			 * Here, our PFNs are contiguous. Therefore, if we
>>> +			 * detect that the current PFN belongs to a large
>>> +			 * folio, we can batch the operations for the next
>>> +			 * nr_pages PFNs.
>>> +			 */
>>> +			if (folio_pages_num > 1)
>>> +				nr_pages = min_t(long, npage,
>>> +					folio_pages_num -
>>> +					folio_page_idx(folio, page));
>>> +
>>> +			_put_pfns(page, nr_pages, dma->prot);
>>
>>
>> This is sneaky. You interpret the page pointer a an actual page array,
>> assuming that it would give you the right values when advancing nr_pages
>> in that array.
>>
>> This is mostly true, but with !CONFIG_SPARSEMEM_VMEMMAP it is not
>> universally true for very large folios (e.g., in a 1 GiB hugetlb folio
>> when we cross the 128 MiB mark on x86).
>>
>> Not sure if that could already trigger here, but it is subtle.
> 
> As previously mentioned in the email, the code here functions
> correctly.
> 
>>> +			unlocked += nr_pages;
>>
>> We could do slightly better here, as we already have the folio. We would
>> add a unpin_user_folio_dirty_locked() similar to unpin_user_folio().
>>
>> Instead of _put_pfns, we would be calling
>>
>> unpin_user_folio_dirty_locked(folio, nr_pages, dma->prot & IOMMU_WRITE);
> 
> Thank you so much for your suggestion. Does this implementation of
> unpin_user_folio_dirty_locked() look viable to you?
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fdda6b16263b..567c9dae9088 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1689,6 +1689,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>   				      bool make_dirty);
>   void unpin_user_pages(struct page **pages, unsigned long npages);
>   void unpin_user_folio(struct folio *folio, unsigned long npages);
> +void unpin_user_folio_dirty_locked(struct folio *folio,
> +		unsigned long npages, bool make_dirty);
>   void unpin_folios(struct folio **folios, unsigned long nfolios);
>   
>   static inline bool is_cow_mapping(vm_flags_t flags)
> diff --git a/mm/gup.c b/mm/gup.c
> index 84461d384ae2..2f1e14a79463 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -360,11 +360,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>   
>   	for (i = 0; i < npages; i += nr) {
>   		folio = gup_folio_range_next(page, npages, i, &nr);
> -		if (make_dirty && !folio_test_dirty(folio)) {
> -			folio_lock(folio);
> -			folio_mark_dirty(folio);
> -			folio_unlock(folio);
> -		}
> +		if (make_dirty && !folio_test_dirty(folio))
> +			folio_mark_dirty_lock(folio);
>   		gup_put_folio(folio, nr, FOLL_PIN);

We can call unpin_user_folio_dirty_locked(). :)

>   	}
>   }
> @@ -435,6 +432,26 @@ void unpin_user_folio(struct folio *folio, unsigned long npages)
>   }
>   EXPORT_SYMBOL(unpin_user_folio);
>   
> +/**
> + * unpin_user_folio_dirty_locked() - release pages of a folio and
> + * optionally dirty

"conditionally mark a folio dirty and unpin it"

Because that's the sequence in which it is done.

> + *
> + * @folio:  pointer to folio to be released
> + * @npages: number of pages of same folio

Can we change that to "nrefs" or rather "npins"?

> + * @make_dirty: whether to mark the folio dirty
> + *
> + * Mark the folio as being modified if @make_dirty is true. Then
> + * release npages of the folio.

Similarly, adjust the doc here.

> + */
> +void unpin_user_folio_dirty_locked(struct folio *folio,
> +		unsigned long npages, bool make_dirty)
> +{
> +	if (make_dirty && !folio_test_dirty(folio))
> +		folio_mark_dirty_lock(folio);
> +	gup_put_folio(folio, npages, FOLL_PIN);
> +}
> +EXPORT_SYMBOL(unpin_user_folio_dirty_locked);

Yes, should probably go into a separate cleanup patch.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-16 11:18       ` David Hildenbrand
@ 2025-06-16 12:07         ` lizhe.67
  0 siblings, 0 replies; 9+ messages in thread
From: lizhe.67 @ 2025-06-16 12:07 UTC (permalink / raw)
  To: david, alex.williamson; +Cc: kvm, linux-kernel, lizhe.67, peterx

> On Mon, 16 Jun 2025 13:18:58 +0200, david@redhat.com wrote:
> 
> On 16.06.25 13:13, lizhe.67@bytedance.com wrote:
> > On Mon, 16 Jun 2025 10:14:23 +0200, david@redhat.com wrote:
> > 
> >>>    drivers/vfio/vfio_iommu_type1.c | 55 +++++++++++++++++++++++++++------
> >>>    1 file changed, 46 insertions(+), 9 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index e952bf8bdfab..09ecc546ece8 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>>    	return true;
> >>>    }
> >>>    
> >>> -static int put_pfn(unsigned long pfn, int prot)
> >>> +static inline void _put_pfns(struct page *page, int npages, int prot)
> >>>    {
> >>> -	if (!is_invalid_reserved_pfn(pfn)) {
> >>> -		struct page *page = pfn_to_page(pfn);
> >>> +	unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> >>> +}
> >>>    
> >>> -		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> >>> -		return 1;
> >>> +/*
> >>> + * The caller must ensure that these npages PFNs belong to the same folio.
> >>> + */
> >>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> >>> +{
> >>> +	if (!is_invalid_reserved_pfn(pfn)) {
> >>> +		_put_pfns(pfn_to_page(pfn), npages, prot);
> >>> +		return npages;
> >>>    	}
> >>>    	return 0;
> >>>    }
> >>>    
> >>> +static inline int put_pfn(unsigned long pfn, int prot)
> >>> +{
> >>> +	return put_pfns(pfn, 1, prot);
> >>> +}
> >>> +
> >>>    #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> >>>    
> >>>    static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> >>> @@ -806,11 +817,37 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> >>>    				    bool do_accounting)
> >>>    {
> >>>    	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> >>> -	long i;
> >>>    
> >>> -	for (i = 0; i < npage; i++)
> >>> -		if (put_pfn(pfn++, dma->prot))
> >>> -			unlocked++;
> >>> +	while (npage) {
> >>> +		long nr_pages = 1;
> >>> +
> >>> +		if (!is_invalid_reserved_pfn(pfn)) {
> >>> +			struct page *page = pfn_to_page(pfn);
> >>> +			struct folio *folio = page_folio(page);
> >>> +			long folio_pages_num = folio_nr_pages(folio);
> >>> +
> >>> +			/*
> >>> +			 * For a folio, it represents a physically
> >>> +			 * contiguous set of bytes, and all of its pages
> >>> +			 * share the same invalid/reserved state.
> >>> +			 *
> >>> +			 * Here, our PFNs are contiguous. Therefore, if we
> >>> +			 * detect that the current PFN belongs to a large
> >>> +			 * folio, we can batch the operations for the next
> >>> +			 * nr_pages PFNs.
> >>> +			 */
> >>> +			if (folio_pages_num > 1)
> >>> +				nr_pages = min_t(long, npage,
> >>> +					folio_pages_num -
> >>> +					folio_page_idx(folio, page));
> >>> +
> >>> +			_put_pfns(page, nr_pages, dma->prot);
> >>
> >>
> >> This is sneaky. You interpret the page pointer a an actual page array,
> >> assuming that it would give you the right values when advancing nr_pages
> >> in that array.
> >>
> >> This is mostly true, but with !CONFIG_SPARSEMEM_VMEMMAP it is not
> >> universally true for very large folios (e.g., in a 1 GiB hugetlb folio
> >> when we cross the 128 MiB mark on x86).
> >>
> >> Not sure if that could already trigger here, but it is subtle.
> > 
> > As previously mentioned in the email, the code here functions
> > correctly.
> > 
> >>> +			unlocked += nr_pages;
> >>
> >> We could do slightly better here, as we already have the folio. We would
> >> add a unpin_user_folio_dirty_locked() similar to unpin_user_folio().
> >>
> >> Instead of _put_pfns, we would be calling
> >>
> >> unpin_user_folio_dirty_locked(folio, nr_pages, dma->prot & IOMMU_WRITE);
> > 
> > Thank you so much for your suggestion. Does this implementation of
> > unpin_user_folio_dirty_locked() look viable to you?
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index fdda6b16263b..567c9dae9088 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1689,6 +1689,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> >   				      bool make_dirty);
> >   void unpin_user_pages(struct page **pages, unsigned long npages);
> >   void unpin_user_folio(struct folio *folio, unsigned long npages);
> > +void unpin_user_folio_dirty_locked(struct folio *folio,
> > +		unsigned long npages, bool make_dirty);
> >   void unpin_folios(struct folio **folios, unsigned long nfolios);
> >   
> >   static inline bool is_cow_mapping(vm_flags_t flags)
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 84461d384ae2..2f1e14a79463 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -360,11 +360,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> >   
> >   	for (i = 0; i < npages; i += nr) {
> >   		folio = gup_folio_range_next(page, npages, i, &nr);
> > -		if (make_dirty && !folio_test_dirty(folio)) {
> > -			folio_lock(folio);
> > -			folio_mark_dirty(folio);
> > -			folio_unlock(folio);
> > -		}
> > +		if (make_dirty && !folio_test_dirty(folio))
> > +			folio_mark_dirty_lock(folio);
> >   		gup_put_folio(folio, nr, FOLL_PIN);
> 
> We can call unpin_user_folio_dirty_locked(). :)
> 
> >   	}
> >   }
> > @@ -435,6 +432,26 @@ void unpin_user_folio(struct folio *folio, unsigned long npages)
> >   }
> >   EXPORT_SYMBOL(unpin_user_folio);
> >   
> > +/**
> > + * unpin_user_folio_dirty_locked() - release pages of a folio and
> > + * optionally dirty
> 
> "conditionally mark a folio dirty and unpin it"
> 
> Because that's the sequence in which it is done.
> 
> > + *
> > + * @folio:  pointer to folio to be released
> > + * @npages: number of pages of same folio
> 
> Can we change that to "nrefs" or rather "npins"?
> 
> > + * @make_dirty: whether to mark the folio dirty
> > + *
> > + * Mark the folio as being modified if @make_dirty is true. Then
> > + * release npages of the folio.
> 
> Similarly, adjust the doc here.
> 
> > + */
> > +void unpin_user_folio_dirty_locked(struct folio *folio,
> > +		unsigned long npages, bool make_dirty)
> > +{
> > +	if (make_dirty && !folio_test_dirty(folio))
> > +		folio_mark_dirty_lock(folio);
> > +	gup_put_folio(folio, npages, FOLL_PIN);
> > +}
> > +EXPORT_SYMBOL(unpin_user_folio_dirty_locked);
> 
> Yes, should probably go into a separate cleanup patch.

Thank you very much for your suggestions. The revised implementation
is as follows.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fdda6b16263b..567c9dae9088 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1689,6 +1689,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 				      bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 void unpin_user_folio(struct folio *folio, unsigned long npages);
+void unpin_user_folio_dirty_locked(struct folio *folio,
+		unsigned long npages, bool make_dirty);
 void unpin_folios(struct folio **folios, unsigned long nfolios);
 
 static inline bool is_cow_mapping(vm_flags_t flags)
diff --git a/mm/gup.c b/mm/gup.c
index 84461d384ae2..3d25b2dcbe85 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_range_next(page, npages, i, &nr);
-		if (make_dirty && !folio_test_dirty(folio)) {
-			folio_lock(folio);
-			folio_mark_dirty(folio);
-			folio_unlock(folio);
-		}
-		gup_put_folio(folio, nr, FOLL_PIN);
+		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
 	}
 }
 EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
@@ -435,6 +430,26 @@ void unpin_user_folio(struct folio *folio, unsigned long npages)
 }
 EXPORT_SYMBOL(unpin_user_folio);
 
+/**
+ * unpin_user_folio_dirty_locked() - conditionally mark a folio
+ * dirty and unpin it
+ *
+ * @folio:  pointer to folio to be released
+ * @nrefs:  number of pages of same folio
+ * @make_dirty: whether to mark the folio dirty
+ *
+ * Mark the folio as being modified if @make_dirty is true. Then
+ * release nrefs of the folio.
+ */
+void unpin_user_folio_dirty_locked(struct folio *folio,
+		unsigned long nrefs, bool make_dirty)
+{
+	if (make_dirty && !folio_test_dirty(folio))
+		folio_mark_dirty_lock(folio);
+	gup_put_folio(folio, nrefs, FOLL_PIN);
+}
+EXPORT_SYMBOL(unpin_user_folio_dirty_locked);
+
 /**
  * unpin_folios() - release an array of gup-pinned folios.
  * @folios:  array of folios to be marked dirty and released.

--

Hi David, Alex, if there are no further issues, I will include this
patch as the second separate patch in the v4 version, inserting it
between the existing two patches. Thank you for your review!

Thanks,
Zhe

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-06-16 12:07 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16  7:52 [PATCH v3 0/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-16  7:52 ` [PATCH v3 1/2] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
2025-06-16  7:52 ` [PATCH v3 2/2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-16  8:14   ` David Hildenbrand
2025-06-16  8:27     ` David Hildenbrand
2025-06-16  8:28       ` David Hildenbrand
2025-06-16 11:13     ` lizhe.67
2025-06-16 11:18       ` David Hildenbrand
2025-06-16 12:07         ` lizhe.67

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).