linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] optimize vfio_unpin_pages_remote() for large folio
@ 2025-06-17  4:18 lizhe.67
  2025-06-17  4:18 ` [PATCH v4 1/3] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-17  4:18 UTC (permalink / raw)
  To: alex.williamson, akpm, david, peterx
  Cc: kvm, linux-kernel, linux-mm, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patchset is based on patch 'vfio/type1: optimize
vfio_pin_pages_remote() for large folios'[1].

When vfio_unpin_pages_remote() is called with a range of addresses
that includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of
pages. We can optimize this process by batching the put_pfn()
operations.

The first patch batches the vfio_find_vpfn() calls in function
vfio_unpin_pages_remote(). However, performance testing indicates that
this patch does not seem to have a significant impact. The primary
reason is that the vpfn rb tree is generally empty. Nevertheless, we
believe it can still offer performance benefits in certain scenarios
and also lays the groundwork for the third patch. The second patch
introduces a new interface, unpin_user_folio_dirty_locked(), to
conditionally mark a folio as dirty and unpin it. This interface will
be used by the third patch. The third patch, using the method described
earlier, optimizes the performance of vfio_unpin_pages_remote() for
large folio scenarios.

The performance test results, based on v6.15, for completing the 16G VFIO
IOMMU DMA unmapping, obtained through unit test[2] with slight
modifications[3], are as follows.

Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.047 s (338.6 GB/s)
VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.280 s (57.2 GB/s)
VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (308.3 GB/s)
VFIO UNMAP DMA in 0.139 s (115.1 GB/s)

Map[1] + First patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (596.1 GB/s)
VFIO UNMAP DMA in 0.138 s (115.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.292 s (54.8 GB/s)
VFIO UNMAP DMA in 0.310 s (51.6 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (506.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.1 GB/s)

Map[1] + This patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (563.9 GB/s)
VFIO UNMAP DMA in 0.049 s (325.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.296 s (54.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.033 s (485.1 GB/s)
VFIO UNMAP DMA in 0.049 s (324.4 GB/s)

The first patch appears to have negligible impact on the performance
of VFIO UNMAP DMA.

With the second and the third patch, we achieve an approximate 64%
performance improvement in the VFIO UNMAP DMA item for large folios.
For small folios, the performance test results appear to show no
significant changes.

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
[2]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[3]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/

Changelogs:

v3->v4:
- Introduce a new interface unpin_user_folio_dirty_locked(). Its
  purpose is to conditionally mark a folio as dirty and unpin it.
  This interface will be called in the VFIO DMA unmap process.
- Revert the related changes to put_pfn().
- Update the performance test results.

v2->v3:
- Split the original patch into two separate patches.
- Add several comments specific to large folio scenarios.
- Rename two variables.
- The update to iova has been removed within the loop in
  vfio_unpin_pages_remote().
- Update the performance test results.

v1->v2:
- Refactor the implementation of the optimized code

v3: https://lore.kernel.org/all/20250616075251.89067-1-lizhe.67@bytedance.com/
v2: https://lore.kernel.org/all/20250610045753.6405-1-lizhe.67@bytedance.com/
v1: https://lore.kernel.org/all/20250605124923.21896-1-lizhe.67@bytedance.com/

Li Zhe (3):
  vfio/type1: batch vfio_find_vpfn() in function
    vfio_unpin_pages_remote()
  gup: introduce unpin_user_folio_dirty_locked()
  vfio/type1: optimize vfio_unpin_pages_remote() for large folio

 drivers/vfio/vfio_iommu_type1.c | 37 ++++++++++++++++++++++++++-------
 include/linux/mm.h              |  2 ++
 mm/gup.c                        | 27 ++++++++++++++++++------
 3 files changed, 53 insertions(+), 13 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/3] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote()
  2025-06-17  4:18 [PATCH v4 0/3] optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-17  4:18 ` lizhe.67
  2025-06-17  4:18 ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
  2025-06-17  4:18 ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
  2 siblings, 0 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-17  4:18 UTC (permalink / raw)
  To: alex.williamson, akpm, david, peterx
  Cc: kvm, linux-kernel, linux-mm, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

This patch is based on patch 'vfio/type1: optimize
vfio_pin_pages_remote() for large folios'[1].

The function vpfn_pages() can help us determine the number of vpfn
nodes on the vpfn rb tree within a specified range. This allows us
to avoid searching for each vpfn individually in the function
vfio_unpin_pages_remote(). This patch batches the vfio_find_vpfn()
calls in function vfio_unpin_pages_remote().

[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 28ee4b8d39ae..e952bf8bdfab 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    unsigned long pfn, unsigned long npage,
 				    bool do_accounting)
 {
-	long unlocked = 0, locked = 0;
+	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
 	long i;
 
-	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
-		if (put_pfn(pfn++, dma->prot)) {
+	for (i = 0; i < npage; i++)
+		if (put_pfn(pfn++, dma->prot))
 			unlocked++;
-			if (vfio_find_vpfn(dma, iova))
-				locked++;
-		}
-	}
 
 	if (do_accounting)
 		vfio_lock_acct(dma, locked - unlocked, true);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17  4:18 [PATCH v4 0/3] optimize vfio_unpin_pages_remote() for large folio lizhe.67
  2025-06-17  4:18 ` [PATCH v4 1/3] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
@ 2025-06-17  4:18 ` lizhe.67
  2025-06-17  7:35   ` David Hildenbrand
  2025-06-17 13:42   ` Jason Gunthorpe
  2025-06-17  4:18 ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
  2 siblings, 2 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-17  4:18 UTC (permalink / raw)
  To: alex.williamson, akpm, david, peterx
  Cc: kvm, linux-kernel, linux-mm, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

Introduce a new interface, unpin_user_folio_dirty_locked(). This
interface is similar to unpin_user_folio(), but it adds the
capability to conditionally mark a folio as dirty. VFIO will utilize
this interface to accelerate the VFIO DMA unmap process.

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 include/linux/mm.h |  2 ++
 mm/gup.c           | 27 +++++++++++++++++++++------
 2 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fdda6b16263b..242b05671502 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1689,6 +1689,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 				      bool make_dirty);
 void unpin_user_pages(struct page **pages, unsigned long npages);
 void unpin_user_folio(struct folio *folio, unsigned long npages);
+void unpin_user_folio_dirty_locked(struct folio *folio,
+		unsigned long npins, bool make_dirty);
 void unpin_folios(struct folio **folios, unsigned long nfolios);
 
 static inline bool is_cow_mapping(vm_flags_t flags)
diff --git a/mm/gup.c b/mm/gup.c
index 84461d384ae2..15debead5f5b 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
 
 	for (i = 0; i < npages; i += nr) {
 		folio = gup_folio_range_next(page, npages, i, &nr);
-		if (make_dirty && !folio_test_dirty(folio)) {
-			folio_lock(folio);
-			folio_mark_dirty(folio);
-			folio_unlock(folio);
-		}
-		gup_put_folio(folio, nr, FOLL_PIN);
+		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
 	}
 }
 EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
@@ -435,6 +430,26 @@ void unpin_user_folio(struct folio *folio, unsigned long npages)
 }
 EXPORT_SYMBOL(unpin_user_folio);
 
+/**
+ * unpin_user_folio_dirty_locked() - conditionally mark a folio
+ * dirty and unpin it
+ *
+ * @folio:  pointer to folio to be released
+ * @npins:  number of pins
+ * @make_dirty: whether to mark the folio dirty
+ *
+ * Mark the folio as being modified if @make_dirty is true. Then
+ * release npins of the folio.
+ */
+void unpin_user_folio_dirty_locked(struct folio *folio,
+		unsigned long npins, bool make_dirty)
+{
+	if (make_dirty && !folio_test_dirty(folio))
+		folio_mark_dirty_lock(folio);
+	gup_put_folio(folio, npins, FOLL_PIN);
+}
+EXPORT_SYMBOL_GPL(unpin_user_folio_dirty_locked);
+
 /**
  * unpin_folios() - release an array of gup-pinned folios.
  * @folios:  array of folios to be marked dirty and released.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17  4:18 [PATCH v4 0/3] optimize vfio_unpin_pages_remote() for large folio lizhe.67
  2025-06-17  4:18 ` [PATCH v4 1/3] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
  2025-06-17  4:18 ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
@ 2025-06-17  4:18 ` lizhe.67
  2025-06-17  7:43   ` David Hildenbrand
  2 siblings, 1 reply; 34+ messages in thread
From: lizhe.67 @ 2025-06-17  4:18 UTC (permalink / raw)
  To: alex.williamson, akpm, david, peterx
  Cc: kvm, linux-kernel, linux-mm, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

This patch optimize this process by batching the put_pfn() operations.

The performance test results, based on v6.15, for completing the 16G VFIO
IOMMU DMA unmapping, obtained through unit test[1] with slight
modifications[2], are as follows.

Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.047 s (338.6 GB/s)
VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.280 s (57.2 GB/s)
VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (308.3 GB/s)
VFIO UNMAP DMA in 0.139 s (115.1 GB/s)

Map[3] + This patchset:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (563.9 GB/s)
VFIO UNMAP DMA in 0.049 s (325.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.294 s (54.4 GB/s)
VFIO UNMAP DMA in 0.296 s (54.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.033 s (485.1 GB/s)
VFIO UNMAP DMA in 0.049 s (324.4 GB/s)

For large folio, we achieve an approximate 64% performance improvement
in the VFIO UNMAP DMA item. For small folios, the performance test
results appear to show no significant changes.

[1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
[3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
 drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e952bf8bdfab..159ba80082a8 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 				    bool do_accounting)
 {
 	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
-	long i;
 
-	for (i = 0; i < npage; i++)
-		if (put_pfn(pfn++, dma->prot))
-			unlocked++;
+	while (npage) {
+		long nr_pages = 1;
+
+		if (!is_invalid_reserved_pfn(pfn)) {
+			struct page *page = pfn_to_page(pfn);
+			struct folio *folio = page_folio(page);
+			long folio_pages_num = folio_nr_pages(folio);
+
+			/*
+			 * For a folio, it represents a physically
+			 * contiguous set of bytes, and all of its pages
+			 * share the same invalid/reserved state.
+			 *
+			 * Here, our PFNs are contiguous. Therefore, if we
+			 * detect that the current PFN belongs to a large
+			 * folio, we can batch the operations for the next
+			 * nr_pages PFNs.
+			 */
+			if (folio_pages_num > 1)
+				nr_pages = min_t(long, npage,
+					folio_pages_num -
+					folio_page_idx(folio, page));
+
+			unpin_user_folio_dirty_locked(folio, nr_pages,
+					dma->prot & IOMMU_WRITE);
+			unlocked += nr_pages;
+		}
+
+		pfn += nr_pages;
+		npage -= nr_pages;
+	}
 
 	if (do_accounting)
 		vfio_lock_acct(dma, locked - unlocked, true);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17  4:18 ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
@ 2025-06-17  7:35   ` David Hildenbrand
  2025-06-17 13:42   ` Jason Gunthorpe
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17  7:35 UTC (permalink / raw)
  To: lizhe.67, alex.williamson, akpm, peterx; +Cc: kvm, linux-kernel, linux-mm

On 17.06.25 06:18, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> Introduce a new interface, unpin_user_folio_dirty_locked(). This
> interface is similar to unpin_user_folio(), but it adds the
> capability to conditionally mark a folio as dirty. VFIO will utilize
> this interface to accelerate the VFIO DMA unmap process.
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>   include/linux/mm.h |  2 ++
>   mm/gup.c           | 27 +++++++++++++++++++++------
>   2 files changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fdda6b16263b..242b05671502 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1689,6 +1689,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>   				      bool make_dirty);
>   void unpin_user_pages(struct page **pages, unsigned long npages);
>   void unpin_user_folio(struct folio *folio, unsigned long npages);
> +void unpin_user_folio_dirty_locked(struct folio *folio,
> +		unsigned long npins, bool make_dirty);
>   void unpin_folios(struct folio **folios, unsigned long nfolios);
>   
>   static inline bool is_cow_mapping(vm_flags_t flags)
> diff --git a/mm/gup.c b/mm/gup.c
> index 84461d384ae2..15debead5f5b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>   
>   	for (i = 0; i < npages; i += nr) {
>   		folio = gup_folio_range_next(page, npages, i, &nr);
> -		if (make_dirty && !folio_test_dirty(folio)) {
> -			folio_lock(folio);
> -			folio_mark_dirty(folio);
> -			folio_unlock(folio);
> -		}
> -		gup_put_folio(folio, nr, FOLL_PIN);
> +		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
>   	}
>   }
>   EXPORT_SYMBOL(unpin_user_page_range_dirty_lock);
> @@ -435,6 +430,26 @@ void unpin_user_folio(struct folio *folio, unsigned long npages)
>   }
>   EXPORT_SYMBOL(unpin_user_folio);
>   
> +/**
> + * unpin_user_folio_dirty_locked() - conditionally mark a folio
> + * dirty and unpin it
> + *
> + * @folio:  pointer to folio to be released
> + * @npins:  number of pins
> + * @make_dirty: whether to mark the folio dirty
> + *
> + * Mark the folio as being modified if @make_dirty is true. Then
> + * release npins of the folio.
> + */
> +void unpin_user_folio_dirty_locked(struct folio *folio,
> +		unsigned long npins, bool make_dirty)
> +{
> +	if (make_dirty && !folio_test_dirty(folio))
> +		folio_mark_dirty_lock(folio);
> +	gup_put_folio(folio, npins, FOLL_PIN);
> +}
> +EXPORT_SYMBOL_GPL(unpin_user_folio_dirty_locked);
> +
>   /**
>    * unpin_folios() - release an array of gup-pinned folios.
>    * @folios:  array of folios to be marked dirty and released.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17  4:18 ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-17  7:43   ` David Hildenbrand
  2025-06-17  9:21     ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17  7:43 UTC (permalink / raw)
  To: lizhe.67, alex.williamson, akpm, peterx; +Cc: kvm, linux-kernel, linux-mm

On 17.06.25 06:18, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the put_pfn() operations.
> 
> The performance test results, based on v6.15, for completing the 16G VFIO
> IOMMU DMA unmapping, obtained through unit test[1] with slight
> modifications[2], are as follows.
> 
> Base(v6.15):
> ./vfio-pci-mem-dma-map 0000:03:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.047 s (338.6 GB/s)
> VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.280 s (57.2 GB/s)
> VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.052 s (308.3 GB/s)
> VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
> 
> Map[3] + This patchset:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.028 s (563.9 GB/s)
> VFIO UNMAP DMA in 0.049 s (325.1 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.294 s (54.4 GB/s)
> VFIO UNMAP DMA in 0.296 s (54.1 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.033 s (485.1 GB/s)
> VFIO UNMAP DMA in 0.049 s (324.4 GB/s)
> 
> For large folio, we achieve an approximate 64% performance improvement
> in the VFIO UNMAP DMA item. For small folios, the performance test
> results appear to show no significant changes.
> 
> [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
> [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
> [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>   drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
>   1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e952bf8bdfab..159ba80082a8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>   				    bool do_accounting)
>   {
>   	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> -	long i;
>   
> -	for (i = 0; i < npage; i++)
> -		if (put_pfn(pfn++, dma->prot))
> -			unlocked++;
> +	while (npage) {
> +		long nr_pages = 1;
> +
> +		if (!is_invalid_reserved_pfn(pfn)) {
> +			struct page *page = pfn_to_page(pfn);
> +			struct folio *folio = page_folio(page);
> +			long folio_pages_num = folio_nr_pages(folio);
> +
> +			/*
> +			 * For a folio, it represents a physically
> +			 * contiguous set of bytes, and all of its pages
> +			 * share the same invalid/reserved state.
> +			 *
> +			 * Here, our PFNs are contiguous. Therefore, if we
> +			 * detect that the current PFN belongs to a large
> +			 * folio, we can batch the operations for the next
> +			 * nr_pages PFNs.
> +			 */
> +			if (folio_pages_num > 1)
> +				nr_pages = min_t(long, npage,
> +					folio_pages_num -
> +					folio_page_idx(folio, page));
> +

(I know I can be a pain :) )

But the long comment indicates that this is confusing.


That is essentially the logic in gup_folio_range_next().

What about factoring that out into a helper like

/*
  * TODO, returned number includes the provided current page.
  */
unsigned long folio_remaining_pages(struct folio *folio,
	struct pages *pages, unsigned long max_pages)
{
	if (!folio_test_large(folio))
		return 1;
	return min_t(unsigned long, max_pages,
		     folio_nr_pages(folio) - folio_page_idx(folio, page));
}


Then here you would do

if (!is_invalid_reserved_pfn(pfn)) {
	struct page *page = pfn_to_page(pfn);
	struct folio *folio = page_folio(page);

	/* We can batch-process pages belonging to the same folio. */
	nr_pages = folio_remaining_pages(folio, page, npage);

	unpin_user_folio_dirty_locked(folio, nr_pages,
				      dma->prot & IOMMU_WRITE);
	unlocked += nr_pages;
}

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17  7:43   ` David Hildenbrand
@ 2025-06-17  9:21     ` lizhe.67
  2025-06-17  9:27       ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: lizhe.67 @ 2025-06-17  9:21 UTC (permalink / raw)
  To: david; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote:
 
> On 17.06.25 06:18, lizhe.67@bytedance.com wrote:
> > From: Li Zhe <lizhe.67@bytedance.com>
> > 
> > When vfio_unpin_pages_remote() is called with a range of addresses that
> > includes large folios, the function currently performs individual
> > put_pfn() operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> > 
> > This patch optimize this process by batching the put_pfn() operations.
> > 
> > The performance test results, based on v6.15, for completing the 16G VFIO
> > IOMMU DMA unmapping, obtained through unit test[1] with slight
> > modifications[2], are as follows.
> > 
> > Base(v6.15):
> > ./vfio-pci-mem-dma-map 0000:03:00.0 16
> > ------- AVERAGE (MADV_HUGEPAGE) --------
> > VFIO MAP DMA in 0.047 s (338.6 GB/s)
> > VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
> > ------- AVERAGE (MAP_POPULATE) --------
> > VFIO MAP DMA in 0.280 s (57.2 GB/s)
> > VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
> > ------- AVERAGE (HUGETLBFS) --------
> > VFIO MAP DMA in 0.052 s (308.3 GB/s)
> > VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
> > 
> > Map[3] + This patchset:
> > ------- AVERAGE (MADV_HUGEPAGE) --------
> > VFIO MAP DMA in 0.028 s (563.9 GB/s)
> > VFIO UNMAP DMA in 0.049 s (325.1 GB/s)
> > ------- AVERAGE (MAP_POPULATE) --------
> > VFIO MAP DMA in 0.294 s (54.4 GB/s)
> > VFIO UNMAP DMA in 0.296 s (54.1 GB/s)
> > ------- AVERAGE (HUGETLBFS) --------
> > VFIO MAP DMA in 0.033 s (485.1 GB/s)
> > VFIO UNMAP DMA in 0.049 s (324.4 GB/s)
> > 
> > For large folio, we achieve an approximate 64% performance improvement
> > in the VFIO UNMAP DMA item. For small folios, the performance test
> > results appear to show no significant changes.
> > 
> > [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
> > [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
> > [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
> > 
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > ---
> >   drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
> >   1 file changed, 31 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index e952bf8bdfab..159ba80082a8 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> >   				    bool do_accounting)
> >   {
> >   	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> > -	long i;
> >   
> > -	for (i = 0; i < npage; i++)
> > -		if (put_pfn(pfn++, dma->prot))
> > -			unlocked++;
> > +	while (npage) {
> > +		long nr_pages = 1;
> > +
> > +		if (!is_invalid_reserved_pfn(pfn)) {
> > +			struct page *page = pfn_to_page(pfn);
> > +			struct folio *folio = page_folio(page);
> > +			long folio_pages_num = folio_nr_pages(folio);
> > +
> > +			/*
> > +			 * For a folio, it represents a physically
> > +			 * contiguous set of bytes, and all of its pages
> > +			 * share the same invalid/reserved state.
> > +			 *
> > +			 * Here, our PFNs are contiguous. Therefore, if we
> > +			 * detect that the current PFN belongs to a large
> > +			 * folio, we can batch the operations for the next
> > +			 * nr_pages PFNs.
> > +			 */
> > +			if (folio_pages_num > 1)
> > +				nr_pages = min_t(long, npage,
> > +					folio_pages_num -
> > +					folio_page_idx(folio, page));
> > +
> 
> (I know I can be a pain :) )

No, not at all! I really appreciate you taking the time to review my
patch.

> But the long comment indicates that this is confusing.
> 
> 
> That is essentially the logic in gup_folio_range_next().
> 
> What about factoring that out into a helper like
> 
> /*
>   * TODO, returned number includes the provided current page.
>   */
> unsigned long folio_remaining_pages(struct folio *folio,
> 	struct pages *pages, unsigned long max_pages)
> {
> 	if (!folio_test_large(folio))
> 		return 1;
> 	return min_t(unsigned long, max_pages,
> 		     folio_nr_pages(folio) - folio_page_idx(folio, page));
> }
> 
> 
> Then here you would do
> 
> if (!is_invalid_reserved_pfn(pfn)) {
> 	struct page *page = pfn_to_page(pfn);
> 	struct folio *folio = page_folio(page);
> 
> 	/* We can batch-process pages belonging to the same folio. */
> 	nr_pages = folio_remaining_pages(folio, page, npage);
> 
> 	unpin_user_folio_dirty_locked(folio, nr_pages,
> 				      dma->prot & IOMMU_WRITE);
> 	unlocked += nr_pages;
> }

Yes, this indeed makes the code much more comprehensible. Do you think
the implementation of the patch as follows look viable to you? I have
added some brief comments on top of your work to explain why we can
batch-process pages belonging to the same folio. This was suggested by
Alex[1].

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e952bf8bdfab..d7653f4c10d5 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
        return pinned;
 }
 
+/* Returned number includes the provided current page. */
+static inline unsigned long folio_remaining_pages(struct folio *folio,
+               struct page *page, unsigned long max_pages)
+{
+       if (!folio_test_large(folio))
+               return 1;
+       return min_t(unsigned long, max_pages,
+                    folio_nr_pages(folio) - folio_page_idx(folio, page));
+}
+
 static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
                                    unsigned long pfn, unsigned long npage,
                                    bool do_accounting)
 {
        long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
-       long i;
 
-       for (i = 0; i < npage; i++)
-               if (put_pfn(pfn++, dma->prot))
-                       unlocked++;
+       while (npage) {
+               unsigned long nr_pages = 1;
+
+               if (!is_invalid_reserved_pfn(pfn)) {
+                       struct page *page = pfn_to_page(pfn);
+                       struct folio *folio = page_folio(page);
+
+                       /*
+                        * We can batch-process pages belonging to the same
+                        * folio because all pages within a folio share the
+                        * same invalid/reserved state.
+                        * */
+                       nr_pages = folio_remaining_pages(folio, page, npage);
+                       unpin_user_folio_dirty_locked(folio, nr_pages,
+                                       dma->prot & IOMMU_WRITE);
+                       unlocked += nr_pages;
+               }
+
+               pfn += nr_pages;
+               npage -= nr_pages;
+       }
 
        if (do_accounting)
		vfio_lock_acct(dma, locked - unlocked, true);
---

Thanks,
Zhe

[1]: https://lore.kernel.org/all/20250613113818.584bec0a.alex.williamson@redhat.com/

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17  9:21     ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
@ 2025-06-17  9:27       ` David Hildenbrand
  2025-06-17  9:47         ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17  9:27 UTC (permalink / raw)
  To: lizhe.67; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 17.06.25 11:21, lizhe.67@bytedance.com wrote:
> On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote:
>   
>> On 17.06.25 06:18, lizhe.67@bytedance.com wrote:
>>> From: Li Zhe <lizhe.67@bytedance.com>
>>>
>>> When vfio_unpin_pages_remote() is called with a range of addresses that
>>> includes large folios, the function currently performs individual
>>> put_pfn() operations for each page. This can lead to significant
>>> performance overheads, especially when dealing with large ranges of pages.
>>>
>>> This patch optimize this process by batching the put_pfn() operations.
>>>
>>> The performance test results, based on v6.15, for completing the 16G VFIO
>>> IOMMU DMA unmapping, obtained through unit test[1] with slight
>>> modifications[2], are as follows.
>>>
>>> Base(v6.15):
>>> ./vfio-pci-mem-dma-map 0000:03:00.0 16
>>> ------- AVERAGE (MADV_HUGEPAGE) --------
>>> VFIO MAP DMA in 0.047 s (338.6 GB/s)
>>> VFIO UNMAP DMA in 0.138 s (116.2 GB/s)
>>> ------- AVERAGE (MAP_POPULATE) --------
>>> VFIO MAP DMA in 0.280 s (57.2 GB/s)
>>> VFIO UNMAP DMA in 0.312 s (51.3 GB/s)
>>> ------- AVERAGE (HUGETLBFS) --------
>>> VFIO MAP DMA in 0.052 s (308.3 GB/s)
>>> VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
>>>
>>> Map[3] + This patchset:
>>> ------- AVERAGE (MADV_HUGEPAGE) --------
>>> VFIO MAP DMA in 0.028 s (563.9 GB/s)
>>> VFIO UNMAP DMA in 0.049 s (325.1 GB/s)
>>> ------- AVERAGE (MAP_POPULATE) --------
>>> VFIO MAP DMA in 0.294 s (54.4 GB/s)
>>> VFIO UNMAP DMA in 0.296 s (54.1 GB/s)
>>> ------- AVERAGE (HUGETLBFS) --------
>>> VFIO MAP DMA in 0.033 s (485.1 GB/s)
>>> VFIO UNMAP DMA in 0.049 s (324.4 GB/s)
>>>
>>> For large folio, we achieve an approximate 64% performance improvement
>>> in the VFIO UNMAP DMA item. For small folios, the performance test
>>> results appear to show no significant changes.
>>>
>>> [1]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
>>> [2]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
>>> [3]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
>>>
>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>>> ---
>>>    drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
>>>    1 file changed, 31 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index e952bf8bdfab..159ba80082a8 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>    				    bool do_accounting)
>>>    {
>>>    	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>> -	long i;
>>>    
>>> -	for (i = 0; i < npage; i++)
>>> -		if (put_pfn(pfn++, dma->prot))
>>> -			unlocked++;
>>> +	while (npage) {
>>> +		long nr_pages = 1;
>>> +
>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>> +			struct page *page = pfn_to_page(pfn);
>>> +			struct folio *folio = page_folio(page);
>>> +			long folio_pages_num = folio_nr_pages(folio);
>>> +
>>> +			/*
>>> +			 * For a folio, it represents a physically
>>> +			 * contiguous set of bytes, and all of its pages
>>> +			 * share the same invalid/reserved state.
>>> +			 *
>>> +			 * Here, our PFNs are contiguous. Therefore, if we
>>> +			 * detect that the current PFN belongs to a large
>>> +			 * folio, we can batch the operations for the next
>>> +			 * nr_pages PFNs.
>>> +			 */
>>> +			if (folio_pages_num > 1)
>>> +				nr_pages = min_t(long, npage,
>>> +					folio_pages_num -
>>> +					folio_page_idx(folio, page));
>>> +
>>
>> (I know I can be a pain :) )
> 
> No, not at all! I really appreciate you taking the time to review my
> patch.
> 
>> But the long comment indicates that this is confusing.
>>
>>
>> That is essentially the logic in gup_folio_range_next().
>>
>> What about factoring that out into a helper like
>>
>> /*
>>    * TODO, returned number includes the provided current page.
>>    */
>> unsigned long folio_remaining_pages(struct folio *folio,
>> 	struct pages *pages, unsigned long max_pages)
>> {
>> 	if (!folio_test_large(folio))
>> 		return 1;
>> 	return min_t(unsigned long, max_pages,
>> 		     folio_nr_pages(folio) - folio_page_idx(folio, page));
>> }
>>
>>
>> Then here you would do
>>
>> if (!is_invalid_reserved_pfn(pfn)) {
>> 	struct page *page = pfn_to_page(pfn);
>> 	struct folio *folio = page_folio(page);
>>
>> 	/* We can batch-process pages belonging to the same folio. */
>> 	nr_pages = folio_remaining_pages(folio, page, npage);
>>
>> 	unpin_user_folio_dirty_locked(folio, nr_pages,
>> 				      dma->prot & IOMMU_WRITE);
>> 	unlocked += nr_pages;
>> }
> 
> Yes, this indeed makes the code much more comprehensible. Do you think
> the implementation of the patch as follows look viable to you? I have
> added some brief comments on top of your work to explain why we can
> batch-process pages belonging to the same folio. This was suggested by
> Alex[1].
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e952bf8bdfab..d7653f4c10d5 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>          return pinned;
>   }
>   
> +/* Returned number includes the provided current page. */
> +static inline unsigned long folio_remaining_pages(struct folio *folio,
> +               struct page *page, unsigned long max_pages)
> +{
> +       if (!folio_test_large(folio))
> +               return 1;
> +       return min_t(unsigned long, max_pages,
> +                    folio_nr_pages(folio) - folio_page_idx(folio, page));
> +}

Note that I think that should go somewhere into mm.h, and also get used 
by GUP. So factoring it out from GUP and then using it here.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17  9:27       ` David Hildenbrand
@ 2025-06-17  9:47         ` lizhe.67
  2025-06-17  9:49           ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: lizhe.67 @ 2025-06-17  9:47 UTC (permalink / raw)
  To: david; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote:
 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index e952bf8bdfab..d7653f4c10d5 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >          return pinned;
> >   }
> >   
> > +/* Returned number includes the provided current page. */
> > +static inline unsigned long folio_remaining_pages(struct folio *folio,
> > +               struct page *page, unsigned long max_pages)
> > +{
> > +       if (!folio_test_large(folio))
> > +               return 1;
> > +       return min_t(unsigned long, max_pages,
> > +                    folio_nr_pages(folio) - folio_page_idx(folio, page));
> > +}
> 
> Note that I think that should go somewhere into mm.h, and also get used 
> by GUP. So factoring it out from GUP and then using it here.

I think I need to separate this out into a distinct patch within the
patchset. Is that correct?

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17  9:47         ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-17  9:49           ` David Hildenbrand
  2025-06-17 12:42             ` lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17  9:49 UTC (permalink / raw)
  To: lizhe.67; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 17.06.25 11:47, lizhe.67@bytedance.com wrote:
> On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote:
>   
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index e952bf8bdfab..d7653f4c10d5 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>>>           return pinned;
>>>    }
>>>    
>>> +/* Returned number includes the provided current page. */
>>> +static inline unsigned long folio_remaining_pages(struct folio *folio,
>>> +               struct page *page, unsigned long max_pages)
>>> +{
>>> +       if (!folio_test_large(folio))
>>> +               return 1;
>>> +       return min_t(unsigned long, max_pages,
>>> +                    folio_nr_pages(folio) - folio_page_idx(folio, page));
>>> +}
>>
>> Note that I think that should go somewhere into mm.h, and also get used
>> by GUP. So factoring it out from GUP and then using it here.
> 
> I think I need to separate this out into a distinct patch within the
> patchset. Is that correct?

Yes, that's what I would do.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17  9:49           ` David Hildenbrand
@ 2025-06-17 12:42             ` lizhe.67
  2025-06-17 13:47               ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: lizhe.67 @ 2025-06-17 12:42 UTC (permalink / raw)
  To: david; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Tue, 17 Jun 2025 11:49:43 +0200, david@redhat.com wrote:
 
> On 17.06.25 11:47, lizhe.67@bytedance.com wrote:
> > On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote:
> >   
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index e952bf8bdfab..d7653f4c10d5 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >>>           return pinned;
> >>>    }
> >>>    
> >>> +/* Returned number includes the provided current page. */
> >>> +static inline unsigned long folio_remaining_pages(struct folio *folio,
> >>> +               struct page *page, unsigned long max_pages)
> >>> +{
> >>> +       if (!folio_test_large(folio))
> >>> +               return 1;
> >>> +       return min_t(unsigned long, max_pages,
> >>> +                    folio_nr_pages(folio) - folio_page_idx(folio, page));
> >>> +}
> >>
> >> Note that I think that should go somewhere into mm.h, and also get used
> >> by GUP. So factoring it out from GUP and then using it here.
> > 
> > I think I need to separate this out into a distinct patch within the
> > patchset. Is that correct?
> 
> Yes, that's what I would do.

How do you think of this implementation?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 242b05671502..eb91f99ea973 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio)
        return folio_large_nr_pages(folio);
 }
 
+/*
+ * folio_remaining_pages - Counts the number of pages from a given
+ * start page to the end of the folio.
+ *
+ * @folio: Pointer to folio
+ * @start_page: The starting page from which to begin counting.
+ *
+ * Returned number includes the provided start page.
+ *
+ * The caller must ensure that @start_page belongs to @folio.
+ */
+static inline unsigned long folio_remaining_pages(struct folio *folio,
+               struct page *start_page)
+{
+       return folio_nr_pages(folio) - folio_page_idx(folio, start_page);
+}
+
 /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 #define MAX_FOLIO_NR_PAGES     (1UL << PUD_ORDER)
diff --git a/mm/gup.c b/mm/gup.c
index 15debead5f5b..14ae2e3088b4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start,
 
        if (folio_test_large(folio))
                nr = min_t(unsigned int, npages - i,
-                          folio_nr_pages(folio) - folio_page_idx(folio, next));
+                          folio_remaining_pages(folio, next));
 
        *ntails = nr;
        return folio;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b2fc5266e3d2..34e85258060c 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
                                return page;
                        }
 
-                       skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
+                       skip_pages = folio_remaining_pages(folio, page);
                        pfn += skip_pages - 1;
                        continue;
                }
---

Thanks,
Zhe

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17  4:18 ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
  2025-06-17  7:35   ` David Hildenbrand
@ 2025-06-17 13:42   ` Jason Gunthorpe
  2025-06-17 13:45     ` David Hildenbrand
  1 sibling, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 13:42 UTC (permalink / raw)
  To: lizhe.67
  Cc: alex.williamson, akpm, david, peterx, kvm, linux-kernel, linux-mm

On Tue, Jun 17, 2025 at 12:18:20PM +0800, lizhe.67@bytedance.com wrote:

> @@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>  
>  	for (i = 0; i < npages; i += nr) {
>  		folio = gup_folio_range_next(page, npages, i, &nr);
> -		if (make_dirty && !folio_test_dirty(folio)) {
> -			folio_lock(folio);
> -			folio_mark_dirty(folio);
> -			folio_unlock(folio);
> -		}
> -		gup_put_folio(folio, nr, FOLL_PIN);
> +		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
>  	}

I don't think we should call an exported function here - this is a
fast path for rdma and iommfd, I don't want to see it degrade to save
three duplicated lines :\

Make the new function an inline?
Duplicate the lines?

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17 13:42   ` Jason Gunthorpe
@ 2025-06-17 13:45     ` David Hildenbrand
  2025-06-17 13:58       ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17 13:45 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhe.67
  Cc: alex.williamson, akpm, peterx, kvm, linux-kernel, linux-mm

On 17.06.25 15:42, Jason Gunthorpe wrote:
> On Tue, Jun 17, 2025 at 12:18:20PM +0800, lizhe.67@bytedance.com wrote:
> 
>> @@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>>   
>>   	for (i = 0; i < npages; i += nr) {
>>   		folio = gup_folio_range_next(page, npages, i, &nr);
>> -		if (make_dirty && !folio_test_dirty(folio)) {
>> -			folio_lock(folio);
>> -			folio_mark_dirty(folio);
>> -			folio_unlock(folio);
>> -		}
>> -		gup_put_folio(folio, nr, FOLL_PIN);
>> +		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
>>   	}
> 
> I don't think we should call an exported function here - this is a
> fast path for rdma and iommfd, I don't want to see it degrade to save
> three duplicated lines :\

Any way to quantify? In theory, the compiler could still optimize this 
within the same file, no?

> 
> Make the new function an inline?

That of course also works.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17 12:42             ` lizhe.67
@ 2025-06-17 13:47               ` David Hildenbrand
  2025-06-18  6:11                 ` lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17 13:47 UTC (permalink / raw)
  To: lizhe.67; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 17.06.25 14:42, lizhe.67@bytedance.com wrote:
> On Tue, 17 Jun 2025 11:49:43 +0200, david@redhat.com wrote:
>   
>> On 17.06.25 11:47, lizhe.67@bytedance.com wrote:
>>> On Tue, 17 Jun 2025 09:43:56 +0200, david@redhat.com wrote:
>>>    
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index e952bf8bdfab..d7653f4c10d5 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -801,16 +801,43 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>>>>>            return pinned;
>>>>>     }
>>>>>     
>>>>> +/* Returned number includes the provided current page. */
>>>>> +static inline unsigned long folio_remaining_pages(struct folio *folio,
>>>>> +               struct page *page, unsigned long max_pages)
>>>>> +{
>>>>> +       if (!folio_test_large(folio))
>>>>> +               return 1;
>>>>> +       return min_t(unsigned long, max_pages,
>>>>> +                    folio_nr_pages(folio) - folio_page_idx(folio, page));
>>>>> +}
>>>>
>>>> Note that I think that should go somewhere into mm.h, and also get used
>>>> by GUP. So factoring it out from GUP and then using it here.
>>>
>>> I think I need to separate this out into a distinct patch within the
>>> patchset. Is that correct?
>>
>> Yes, that's what I would do.
> 
> How do you think of this implementation?
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 242b05671502..eb91f99ea973 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio)
>          return folio_large_nr_pages(folio);
>   }
>   
> +/*
> + * folio_remaining_pages - Counts the number of pages from a given
> + * start page to the end of the folio.
> + *
> + * @folio: Pointer to folio
> + * @start_page: The starting page from which to begin counting.
> + *
> + * Returned number includes the provided start page.
> + *
> + * The caller must ensure that @start_page belongs to @folio.
> + */
> +static inline unsigned long folio_remaining_pages(struct folio *folio,
> +               struct page *start_page)
> +{
> +       return folio_nr_pages(folio) - folio_page_idx(folio, start_page);
> +}
> +
>   /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>   #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>   #define MAX_FOLIO_NR_PAGES     (1UL << PUD_ORDER)
> diff --git a/mm/gup.c b/mm/gup.c
> index 15debead5f5b..14ae2e3088b4 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start,
>   
>          if (folio_test_large(folio))
>                  nr = min_t(unsigned int, npages - i,
> -                          folio_nr_pages(folio) - folio_page_idx(folio, next));
> +                          folio_remaining_pages(folio, next));
>   
>          *ntails = nr;
>          return folio;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b2fc5266e3d2..34e85258060c 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>                                  return page;
>                          }
>   
> -                       skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
> +                       skip_pages = folio_remaining_pages(folio, page);
>                          pfn += skip_pages - 1;
>                          continue;
>                  }
> ---

Guess I would have pulled the "min" in there, but passing something like 
ULONG_MAX for the page_isolation case also looks rather ugly.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17 13:45     ` David Hildenbrand
@ 2025-06-17 13:58       ` David Hildenbrand
  2025-06-17 14:04         ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17 13:58 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhe.67
  Cc: alex.williamson, akpm, peterx, kvm, linux-kernel, linux-mm

On 17.06.25 15:45, David Hildenbrand wrote:
> On 17.06.25 15:42, Jason Gunthorpe wrote:
>> On Tue, Jun 17, 2025 at 12:18:20PM +0800, lizhe.67@bytedance.com wrote:
>>
>>> @@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>>>    
>>>    	for (i = 0; i < npages; i += nr) {
>>>    		folio = gup_folio_range_next(page, npages, i, &nr);
>>> -		if (make_dirty && !folio_test_dirty(folio)) {
>>> -			folio_lock(folio);
>>> -			folio_mark_dirty(folio);
>>> -			folio_unlock(folio);
>>> -		}
>>> -		gup_put_folio(folio, nr, FOLL_PIN);
>>> +		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
>>>    	}
>>
>> I don't think we should call an exported function here - this is a
>> fast path for rdma and iommfd, I don't want to see it degrade to save
>> three duplicated lines :\
> 
> Any way to quantify? In theory, the compiler could still optimize this
> within the same file, no?

Looking at the compiler output, I think the compile is doing exactly that.

Unless my obdjump -D -S analysis skills are seriously degraded :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17 13:58       ` David Hildenbrand
@ 2025-06-17 14:04         ` David Hildenbrand
  2025-06-17 15:22           ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-17 14:04 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhe.67
  Cc: alex.williamson, akpm, peterx, kvm, linux-kernel, linux-mm

On 17.06.25 15:58, David Hildenbrand wrote:
> On 17.06.25 15:45, David Hildenbrand wrote:
>> On 17.06.25 15:42, Jason Gunthorpe wrote:
>>> On Tue, Jun 17, 2025 at 12:18:20PM +0800, lizhe.67@bytedance.com wrote:
>>>
>>>> @@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>>>>     
>>>>     	for (i = 0; i < npages; i += nr) {
>>>>     		folio = gup_folio_range_next(page, npages, i, &nr);
>>>> -		if (make_dirty && !folio_test_dirty(folio)) {
>>>> -			folio_lock(folio);
>>>> -			folio_mark_dirty(folio);
>>>> -			folio_unlock(folio);
>>>> -		}
>>>> -		gup_put_folio(folio, nr, FOLL_PIN);
>>>> +		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
>>>>     	}
>>>
>>> I don't think we should call an exported function here - this is a
>>> fast path for rdma and iommfd, I don't want to see it degrade to save
>>> three duplicated lines :\
>>
>> Any way to quantify? In theory, the compiler could still optimize this
>> within the same file, no?
> 
> Looking at the compiler output, I think the compile is doing exactly that.
> 
> Unless my obdjump -D -S analysis skills are seriously degraded :)

FWIW, while already looking at this, even before this change, the 
compiler does not inline gup_put_folio() into this function, which is a 
bit unexpected.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17 14:04         ` David Hildenbrand
@ 2025-06-17 15:22           ` Jason Gunthorpe
  2025-06-18  6:28             ` lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-17 15:22 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: lizhe.67, alex.williamson, akpm, peterx, kvm, linux-kernel,
	linux-mm

On Tue, Jun 17, 2025 at 04:04:26PM +0200, David Hildenbrand wrote:
> On 17.06.25 15:58, David Hildenbrand wrote:
> > On 17.06.25 15:45, David Hildenbrand wrote:
> > > On 17.06.25 15:42, Jason Gunthorpe wrote:
> > > > On Tue, Jun 17, 2025 at 12:18:20PM +0800, lizhe.67@bytedance.com wrote:
> > > > 
> > > > > @@ -360,12 +360,7 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
> > > > >     	for (i = 0; i < npages; i += nr) {
> > > > >     		folio = gup_folio_range_next(page, npages, i, &nr);
> > > > > -		if (make_dirty && !folio_test_dirty(folio)) {
> > > > > -			folio_lock(folio);
> > > > > -			folio_mark_dirty(folio);
> > > > > -			folio_unlock(folio);
> > > > > -		}
> > > > > -		gup_put_folio(folio, nr, FOLL_PIN);
> > > > > +		unpin_user_folio_dirty_locked(folio, nr, make_dirty);
> > > > >     	}
> > > > 
> > > > I don't think we should call an exported function here - this is a
> > > > fast path for rdma and iommfd, I don't want to see it degrade to save
> > > > three duplicated lines :\
> > > 
> > > Any way to quantify? In theory, the compiler could still optimize this
> > > within the same file, no?
> > 
> > Looking at the compiler output, I think the compile is doing exactly that.
> > 
> > Unless my obdjump -D -S analysis skills are seriously degraded :)
> 
> FWIW, while already looking at this, even before this change, the compiler
> does not inline gup_put_folio() into this function, which is a bit
> unexpected.

Weird, but I would not expect this as a general rule, not sure we
should rely on it.

I would say exported function should not get automatically
inlined. That throws all the kprobes into chaos :\

BTW, why can't the other patches in this series just use
unpin_user_page_range_dirty_lock? The way this stuff is supposed to
work is to combine adjacent physical addresses and then invoke
unpin_user_page_range_dirty_lock() on the start page of the physical
range. This is why we have the gup_folio_range_next() which does the
segmentation in an efficient way.

Combining adjacent physical is basically free math.

Segmenting to folios in the vfio side doesn't make a lot of sense,
IMHO.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-17 13:47               ` David Hildenbrand
@ 2025-06-18  6:11                 ` lizhe.67
  2025-06-18  7:22                   ` lizhe.67
  2025-06-18  8:54                   ` David Hildenbrand
  0 siblings, 2 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-18  6:11 UTC (permalink / raw)
  To: david; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Tue, 17 Jun 2025 15:47:09 +0200, david@redhat.com wrote:
 
> > How do you think of this implementation?
> > 
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 242b05671502..eb91f99ea973 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio)
> >          return folio_large_nr_pages(folio);
> >   }
> >   
> > +/*
> > + * folio_remaining_pages - Counts the number of pages from a given
> > + * start page to the end of the folio.
> > + *
> > + * @folio: Pointer to folio
> > + * @start_page: The starting page from which to begin counting.
> > + *
> > + * Returned number includes the provided start page.
> > + *
> > + * The caller must ensure that @start_page belongs to @folio.
> > + */
> > +static inline unsigned long folio_remaining_pages(struct folio *folio,
> > +               struct page *start_page)
> > +{
> > +       return folio_nr_pages(folio) - folio_page_idx(folio, start_page);
> > +}
> > +
> >   /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> >   #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> >   #define MAX_FOLIO_NR_PAGES     (1UL << PUD_ORDER)
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 15debead5f5b..14ae2e3088b4 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start,
> >   
> >          if (folio_test_large(folio))
> >                  nr = min_t(unsigned int, npages - i,
> > -                          folio_nr_pages(folio) - folio_page_idx(folio, next));
> > +                          folio_remaining_pages(folio, next));
> >   
> >          *ntails = nr;
> >          return folio;
> > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > index b2fc5266e3d2..34e85258060c 100644
> > --- a/mm/page_isolation.c
> > +++ b/mm/page_isolation.c
> > @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
> >                                  return page;
> >                          }
> >   
> > -                       skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
> > +                       skip_pages = folio_remaining_pages(folio, page);
> >                          pfn += skip_pages - 1;
> >                          continue;
> >                  }
> > ---
> 
> Guess I would have pulled the "min" in there, but passing something like 
> ULONG_MAX for the page_isolation case also looks rather ugly.

Yes, the page_isolation case does not require the 'min' logic. Since
there are already places in the current kernel code where
folio_remaining_pages() is used without needing min, we could simply
create a custom wrapper function based on folio_remaining_pages() only
in those specific scenarios where min is necessary.

Following this line of thinking, the wrapper function in vfio would
look something like this.

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -801,16 +801,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
        return pinned;
 }
 
+static inline unsigned long vfio_folio_remaining_pages(
+               struct folio *folio, struct page *start_page,
+               unsigned long max_pages)
+{
+       if (!folio_test_large(folio))
+               return 1;
+       return min(max_pages,
+                  folio_remaining_pages(folio, start_page));
+}
+

---

Does this approach seem acceptable to you?

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-17 15:22           ` Jason Gunthorpe
@ 2025-06-18  6:28             ` lizhe.67
  2025-06-18  8:20               ` David Hildenbrand
  2025-06-18 11:36               ` Jason Gunthorpe
  0 siblings, 2 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-18  6:28 UTC (permalink / raw)
  To: jgg, david
  Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
 
> Weird, but I would not expect this as a general rule, not sure we
> should rely on it.
> 
> I would say exported function should not get automatically
> inlined. That throws all the kprobes into chaos :\
> 
> BTW, why can't the other patches in this series just use
> unpin_user_page_range_dirty_lock? The way this stuff is supposed to
> work is to combine adjacent physical addresses and then invoke
> unpin_user_page_range_dirty_lock() on the start page of the physical
> range. This is why we have the gup_folio_range_next() which does the
> segmentation in an efficient way.
> 
> Combining adjacent physical is basically free math.
> 
> Segmenting to folios in the vfio side doesn't make a lot of sense,
> IMHO.
> 
>  drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
>  1 file changed, 31 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index e952bf8bdfab..159ba80082a8 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>  				    bool do_accounting)
>  {
>  	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> -	long i;
>  
> -	for (i = 0; i < npage; i++)
> -		if (put_pfn(pfn++, dma->prot))
> -			unlocked++;
> +	while (npage) {
> +		long nr_pages = 1;
> +
> +		if (!is_invalid_reserved_pfn(pfn)) {
> +			struct page *page = pfn_to_page(pfn);
> +			struct folio *folio = page_folio(page);
> +			long folio_pages_num = folio_nr_pages(folio);
> +
> +			/*
> +			 * For a folio, it represents a physically
> +			 * contiguous set of bytes, and all of its pages
> +			 * share the same invalid/reserved state.
> +			 *
> +			 * Here, our PFNs are contiguous. Therefore, if we
> +			 * detect that the current PFN belongs to a large
> +			 * folio, we can batch the operations for the next
> +			 * nr_pages PFNs.
> +			 */
> +			if (folio_pages_num > 1)
> +				nr_pages = min_t(long, npage,
> +					folio_pages_num -
> +					folio_page_idx(folio, page));
> +
> +			unpin_user_folio_dirty_locked(folio, nr_pages,
> +					dma->prot & IOMMU_WRITE);

Are you suggesting that we should directly call
unpin_user_page_range_dirty_lock() here (patch 3/3) instead?

BTW, it appears that implementing unpin_user_folio_dirty_locked()
as an inline function may not be viable for vfio, given that
gup_put_folio() is not exported.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-18  6:11                 ` lizhe.67
@ 2025-06-18  7:22                   ` lizhe.67
  2025-06-18  8:54                   ` David Hildenbrand
  1 sibling, 0 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-18  7:22 UTC (permalink / raw)
  To: david; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx,
	lizhe.67

On Wed, 18 Jun 2025 14:11:43 +0800, lizhe.67@bytedance.com wrote:
 
> > > How do you think of this implementation?
> > > 
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 242b05671502..eb91f99ea973 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio)
> > >          return folio_large_nr_pages(folio);
> > >   }
> > >   
> > > +/*
> > > + * folio_remaining_pages - Counts the number of pages from a given
> > > + * start page to the end of the folio.
> > > + *
> > > + * @folio: Pointer to folio
> > > + * @start_page: The starting page from which to begin counting.
> > > + *
> > > + * Returned number includes the provided start page.
> > > + *
> > > + * The caller must ensure that @start_page belongs to @folio.
> > > + */
> > > +static inline unsigned long folio_remaining_pages(struct folio *folio,
> > > +               struct page *start_page)
> > > +{
> > > +       return folio_nr_pages(folio) - folio_page_idx(folio, start_page);
> > > +}
> > > +
> > >   /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> > >   #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> > >   #define MAX_FOLIO_NR_PAGES     (1UL << PUD_ORDER)
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index 15debead5f5b..14ae2e3088b4 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start,
> > >   
> > >          if (folio_test_large(folio))
> > >                  nr = min_t(unsigned int, npages - i,
> > > -                          folio_nr_pages(folio) - folio_page_idx(folio, next));
> > > +                          folio_remaining_pages(folio, next));
> > >   
> > >          *ntails = nr;
> > >          return folio;
> > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> > > index b2fc5266e3d2..34e85258060c 100644
> > > --- a/mm/page_isolation.c
> > > +++ b/mm/page_isolation.c
> > > @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
> > >                                  return page;
> > >                          }
> > >   
> > > -                       skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
> > > +                       skip_pages = folio_remaining_pages(folio, page);
> > >                          pfn += skip_pages - 1;
> > >                          continue;
> > >                  }
> > > ---
> > 
> > Guess I would have pulled the "min" in there, but passing something like 
> > ULONG_MAX for the page_isolation case also looks rather ugly.
> 
> Yes, the page_isolation case does not require the 'min' logic. Since
> there are already places in the current kernel code where
> folio_remaining_pages() is used without needing min, we could simply
> create a custom wrapper function based on folio_remaining_pages() only
> in those specific scenarios where min is necessary.
> 
> Following this line of thinking, the wrapper function in vfio would
> look something like this.
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -801,16 +801,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>         return pinned;
>  }
>  
> +static inline unsigned long vfio_folio_remaining_pages(
> +               struct folio *folio, struct page *start_page,
> +               unsigned long max_pages)
> +{
> +       if (!folio_test_large(folio))
> +               return 1;

The above two lines may no longer be necessary.

> +       return min(max_pages,
> +                  folio_remaining_pages(folio, start_page));
> +}

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18  6:28             ` lizhe.67
@ 2025-06-18  8:20               ` David Hildenbrand
  2025-06-18 11:36               ` Jason Gunthorpe
  1 sibling, 0 replies; 34+ messages in thread
From: David Hildenbrand @ 2025-06-18  8:20 UTC (permalink / raw)
  To: lizhe.67, jgg; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 18.06.25 08:28, lizhe.67@bytedance.com wrote:
> On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
>   
>> Weird, but I would not expect this as a general rule, not sure we
>> should rely on it.
>>
>> I would say exported function should not get automatically
>> inlined. That throws all the kprobes into chaos :\
>>
>> BTW, why can't the other patches in this series just use
>> unpin_user_page_range_dirty_lock? The way this stuff is supposed to
>> work is to combine adjacent physical addresses and then invoke
>> unpin_user_page_range_dirty_lock() on the start page of the physical
>> range. This is why we have the gup_folio_range_next() which does the
>> segmentation in an efficient way.
>>
>> Combining adjacent physical is basically free math.
>>
>> Segmenting to folios in the vfio side doesn't make a lot of sense,
>> IMHO.
>>
>>   drivers/vfio/vfio_iommu_type1.c | 35 +++++++++++++++++++++++++++++----
>>   1 file changed, 31 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index e952bf8bdfab..159ba80082a8 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -806,11 +806,38 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>   				    bool do_accounting)
>>   {
>>   	long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>> -	long i;
>>   
>> -	for (i = 0; i < npage; i++)
>> -		if (put_pfn(pfn++, dma->prot))
>> -			unlocked++;
>> +	while (npage) {
>> +		long nr_pages = 1;
>> +
>> +		if (!is_invalid_reserved_pfn(pfn)) {
>> +			struct page *page = pfn_to_page(pfn);
>> +			struct folio *folio = page_folio(page);
>> +			long folio_pages_num = folio_nr_pages(folio);
>> +
>> +			/*
>> +			 * For a folio, it represents a physically
>> +			 * contiguous set of bytes, and all of its pages
>> +			 * share the same invalid/reserved state.
>> +			 *
>> +			 * Here, our PFNs are contiguous. Therefore, if we
>> +			 * detect that the current PFN belongs to a large
>> +			 * folio, we can batch the operations for the next
>> +			 * nr_pages PFNs.
>> +			 */
>> +			if (folio_pages_num > 1)
>> +				nr_pages = min_t(long, npage,
>> +					folio_pages_num -
>> +					folio_page_idx(folio, page));
>> +
>> +			unpin_user_folio_dirty_locked(folio, nr_pages,
>> +					dma->prot & IOMMU_WRITE);
> 
> Are you suggesting that we should directly call
> unpin_user_page_range_dirty_lock() here (patch 3/3) instead?
> 
> BTW, it appears that implementing unpin_user_folio_dirty_locked()
> as an inline function may not be viable for vfio, given that
> gup_put_folio() is not exported.

The compiler seems to properly inline like before, so I think we can 
keep that. @Jason correct me if I am wrong.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-18  6:11                 ` lizhe.67
  2025-06-18  7:22                   ` lizhe.67
@ 2025-06-18  8:54                   ` David Hildenbrand
  2025-06-18  9:39                     ` lizhe.67
  1 sibling, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-18  8:54 UTC (permalink / raw)
  To: lizhe.67; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 18.06.25 08:11, lizhe.67@bytedance.com wrote:
> On Tue, 17 Jun 2025 15:47:09 +0200, david@redhat.com wrote:
>   
>>> How do you think of this implementation?
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 242b05671502..eb91f99ea973 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio)
>>>           return folio_large_nr_pages(folio);
>>>    }
>>>    
>>> +/*
>>> + * folio_remaining_pages - Counts the number of pages from a given
>>> + * start page to the end of the folio.
>>> + *
>>> + * @folio: Pointer to folio
>>> + * @start_page: The starting page from which to begin counting.
>>> + *
>>> + * Returned number includes the provided start page.
>>> + *
>>> + * The caller must ensure that @start_page belongs to @folio.
>>> + */
>>> +static inline unsigned long folio_remaining_pages(struct folio *folio,
>>> +               struct page *start_page)
>>> +{
>>> +       return folio_nr_pages(folio) - folio_page_idx(folio, start_page);
>>> +}
>>> +
>>>    /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>>>    #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>>>    #define MAX_FOLIO_NR_PAGES     (1UL << PUD_ORDER)
>>> diff --git a/mm/gup.c b/mm/gup.c
>>> index 15debead5f5b..14ae2e3088b4 100644
>>> --- a/mm/gup.c
>>> +++ b/mm/gup.c
>>> @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start,
>>>    
>>>           if (folio_test_large(folio))
>>>                   nr = min_t(unsigned int, npages - i,
>>> -                          folio_nr_pages(folio) - folio_page_idx(folio, next));
>>> +                          folio_remaining_pages(folio, next));
>>>    
>>>           *ntails = nr;
>>>           return folio;
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index b2fc5266e3d2..34e85258060c 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>>>                                   return page;
>>>                           }
>>>    
>>> -                       skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
>>> +                       skip_pages = folio_remaining_pages(folio, page);
>>>                           pfn += skip_pages - 1;
>>>                           continue;
>>>                   }
>>> ---
>>
>> Guess I would have pulled the "min" in there, but passing something like
>> ULONG_MAX for the page_isolation case also looks rather ugly.
> 
> Yes, the page_isolation case does not require the 'min' logic. Since
> there are already places in the current kernel code where
> folio_remaining_pages() is used without needing min, we could simply
> create a custom wrapper function based on folio_remaining_pages() only
> in those specific scenarios where min is necessary.

I would just do something like that, removing gup_folio_range_next() completely.

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 98a606908307b..6c224b1c6c169 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1995,6 +1995,32 @@ static inline long folio_nr_pages(const struct folio *folio)
  	return folio_large_nr_pages(folio);
  }
  
+/**
+ * folio_remaining_pages - The remaining number of pages in the folio.
+ * @folio: The folio.
+ * @page: The folio page to start the counting.
+ * @max_npages: Limit how far to count.
+ *
+ * The returned number will include the provided page.
+ *
+ * The caller must ensure that @page belongs to @folio. When setting
+ * @max_npages to ULONG_MAX, the parameter is effectively ignored.
+ *
+ * Return: The remaining number of pages, limited by @max_npages.
+ */
+static inline unsigned long folio_remaining_pages(struct folio *folio,
+		struct page *page, unsigned long max_npages)
+{
+	unsigned long nr;
+
+	if (!folio_test_large(folio))
+		return 1;
+	nr = folio_large_nr_pages(folio) - folio_page_idx(folio, page);
+	if (__builtin_constant_p(max_npages) && max_npages == ULONG_MAX)
+		return nr;
+	return min_t(unsigned long, nr, max_npages);
+}
+
  /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
  #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
  #define MAX_FOLIO_NR_PAGES	(1UL << PUD_ORDER)
diff --git a/mm/gup.c b/mm/gup.c
index 6888e871a74a9..3385428d028f6 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -234,21 +234,6 @@ void folio_add_pin(struct folio *folio)
  	}
  }
  
-static inline struct folio *gup_folio_range_next(struct page *start,
-		unsigned long npages, unsigned long i, unsigned int *ntails)
-{
-	struct page *next = nth_page(start, i);
-	struct folio *folio = page_folio(next);
-	unsigned int nr = 1;
-
-	if (folio_test_large(folio))
-		nr = min_t(unsigned int, npages - i,
-			   folio_nr_pages(folio) - folio_page_idx(folio, next));
-
-	*ntails = nr;
-	return folio;
-}
-
  static inline struct folio *gup_folio_next(struct page **list,
  		unsigned long npages, unsigned long i, unsigned int *ntails)
  {
@@ -356,11 +341,13 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
  				      bool make_dirty)
  {
  	unsigned long i;
-	struct folio *folio;
  	unsigned int nr;
  
  	for (i = 0; i < npages; i += nr) {
-		folio = gup_folio_range_next(page, npages, i, &nr);
+		struct page *next = nth_page(page, i);
+		struct folio *folio = page_folio(next);
+
+		nr = folio_remaining_pages(folio, next, npages - i);
  		if (make_dirty && !folio_test_dirty(folio)) {
  			folio_lock(folio);
  			folio_mark_dirty(folio);
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index ece3bfc56bcd5..6f75defb38d73 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
  				return page;
  			}
  
-			skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
+			skip_pages = folio_remaining_pages(folio, page, ULONG_MAX);
  			pfn += skip_pages - 1;
  			continue;
  		}
-- 


But up to you, anything with such a helper function makes the code easier to digest.

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
  2025-06-18  8:54                   ` David Hildenbrand
@ 2025-06-18  9:39                     ` lizhe.67
  0 siblings, 0 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-18  9:39 UTC (permalink / raw)
  To: david; +Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Wed, 18 Jun 2025 10:54:38 +0200, david@redhat.com wrote:
 
> On 18.06.25 08:11, lizhe.67@bytedance.com wrote:
> > On Tue, 17 Jun 2025 15:47:09 +0200, david@redhat.com wrote:
> >   
> >>> How do you think of this implementation?
> >>>
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index 242b05671502..eb91f99ea973 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -2165,6 +2165,23 @@ static inline long folio_nr_pages(const struct folio *folio)
> >>>           return folio_large_nr_pages(folio);
> >>>    }
> >>>    
> >>> +/*
> >>> + * folio_remaining_pages - Counts the number of pages from a given
> >>> + * start page to the end of the folio.
> >>> + *
> >>> + * @folio: Pointer to folio
> >>> + * @start_page: The starting page from which to begin counting.
> >>> + *
> >>> + * Returned number includes the provided start page.
> >>> + *
> >>> + * The caller must ensure that @start_page belongs to @folio.
> >>> + */
> >>> +static inline unsigned long folio_remaining_pages(struct folio *folio,
> >>> +               struct page *start_page)
> >>> +{
> >>> +       return folio_nr_pages(folio) - folio_page_idx(folio, start_page);
> >>> +}
> >>> +
> >>>    /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
> >>>    #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> >>>    #define MAX_FOLIO_NR_PAGES     (1UL << PUD_ORDER)
> >>> diff --git a/mm/gup.c b/mm/gup.c
> >>> index 15debead5f5b..14ae2e3088b4 100644
> >>> --- a/mm/gup.c
> >>> +++ b/mm/gup.c
> >>> @@ -242,7 +242,7 @@ static inline struct folio *gup_folio_range_next(struct page *start,
> >>>    
> >>>           if (folio_test_large(folio))
> >>>                   nr = min_t(unsigned int, npages - i,
> >>> -                          folio_nr_pages(folio) - folio_page_idx(folio, next));
> >>> +                          folio_remaining_pages(folio, next));
> >>>    
> >>>           *ntails = nr;
> >>>           return folio;
> >>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> >>> index b2fc5266e3d2..34e85258060c 100644
> >>> --- a/mm/page_isolation.c
> >>> +++ b/mm/page_isolation.c
> >>> @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
> >>>                                   return page;
> >>>                           }
> >>>    
> >>> -                       skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
> >>> +                       skip_pages = folio_remaining_pages(folio, page);
> >>>                           pfn += skip_pages - 1;
> >>>                           continue;
> >>>                   }
> >>> ---
> >>
> >> Guess I would have pulled the "min" in there, but passing something like
> >> ULONG_MAX for the page_isolation case also looks rather ugly.
> > 
> > Yes, the page_isolation case does not require the 'min' logic. Since
> > there are already places in the current kernel code where
> > folio_remaining_pages() is used without needing min, we could simply
> > create a custom wrapper function based on folio_remaining_pages() only
> > in those specific scenarios where min is necessary.
> 
> I would just do something like that, removing gup_folio_range_next() completely.
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 98a606908307b..6c224b1c6c169 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1995,6 +1995,32 @@ static inline long folio_nr_pages(const struct folio *folio)
>   	return folio_large_nr_pages(folio);
>   }
>   
> +/**
> + * folio_remaining_pages - The remaining number of pages in the folio.
> + * @folio: The folio.
> + * @page: The folio page to start the counting.
> + * @max_npages: Limit how far to count.
> + *
> + * The returned number will include the provided page.
> + *
> + * The caller must ensure that @page belongs to @folio. When setting
> + * @max_npages to ULONG_MAX, the parameter is effectively ignored.
> + *
> + * Return: The remaining number of pages, limited by @max_npages.
> + */
> +static inline unsigned long folio_remaining_pages(struct folio *folio,
> +		struct page *page, unsigned long max_npages)
> +{
> +	unsigned long nr;
> +
> +	if (!folio_test_large(folio))
> +		return 1;
> +	nr = folio_large_nr_pages(folio) - folio_page_idx(folio, page);
> +	if (__builtin_constant_p(max_npages) && max_npages == ULONG_MAX)
> +		return nr;
> +	return min_t(unsigned long, nr, max_npages);
> +}
> +
>   /* Only hugetlbfs can allocate folios larger than MAX_ORDER */
>   #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>   #define MAX_FOLIO_NR_PAGES	(1UL << PUD_ORDER)
> diff --git a/mm/gup.c b/mm/gup.c
> index 6888e871a74a9..3385428d028f6 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -234,21 +234,6 @@ void folio_add_pin(struct folio *folio)
>   	}
>   }
>   
> -static inline struct folio *gup_folio_range_next(struct page *start,
> -		unsigned long npages, unsigned long i, unsigned int *ntails)
> -{
> -	struct page *next = nth_page(start, i);
> -	struct folio *folio = page_folio(next);
> -	unsigned int nr = 1;
> -
> -	if (folio_test_large(folio))
> -		nr = min_t(unsigned int, npages - i,
> -			   folio_nr_pages(folio) - folio_page_idx(folio, next));
> -
> -	*ntails = nr;
> -	return folio;
> -}
> -
>   static inline struct folio *gup_folio_next(struct page **list,
>   		unsigned long npages, unsigned long i, unsigned int *ntails)
>   {
> @@ -356,11 +341,13 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages,
>   				      bool make_dirty)
>   {
>   	unsigned long i;
> -	struct folio *folio;
>   	unsigned int nr;
>   
>   	for (i = 0; i < npages; i += nr) {
> -		folio = gup_folio_range_next(page, npages, i, &nr);
> +		struct page *next = nth_page(page, i);
> +		struct folio *folio = page_folio(next);
> +
> +		nr = folio_remaining_pages(folio, next, npages - i);
>   		if (make_dirty && !folio_test_dirty(folio)) {
>   			folio_lock(folio);
>   			folio_mark_dirty(folio);
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index ece3bfc56bcd5..6f75defb38d73 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -96,7 +96,7 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>   				return page;
>   			}
>   
> -			skip_pages = folio_nr_pages(folio) - folio_page_idx(folio, page);
> +			skip_pages = folio_remaining_pages(folio, page, ULONG_MAX);
>   			pfn += skip_pages - 1;
>   			continue;
>   		}
> -- 
> 
> 
> But up to you, anything with such a helper function makes the code easier to digest.

Thank you. This implementation looks better. Please allow me to
integrate this patch into my patchset.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18  6:28             ` lizhe.67
  2025-06-18  8:20               ` David Hildenbrand
@ 2025-06-18 11:36               ` Jason Gunthorpe
  2025-06-18 11:40                 ` David Hildenbrand
  1 sibling, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-18 11:36 UTC (permalink / raw)
  To: lizhe.67
  Cc: david, akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On Wed, Jun 18, 2025 at 02:28:20PM +0800, lizhe.67@bytedance.com wrote:
> On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
> > +	while (npage) {
> > +		long nr_pages = 1;
> > +
> > +		if (!is_invalid_reserved_pfn(pfn)) {
> > +			struct page *page = pfn_to_page(pfn);
> > +			struct folio *folio = page_folio(page);
> > +			long folio_pages_num = folio_nr_pages(folio);
> > +
> > +			/*
> > +			 * For a folio, it represents a physically
> > +			 * contiguous set of bytes, and all of its pages
> > +			 * share the same invalid/reserved state.
> > +			 *
> > +			 * Here, our PFNs are contiguous. Therefore, if we
> > +			 * detect that the current PFN belongs to a large
> > +			 * folio, we can batch the operations for the next
> > +			 * nr_pages PFNs.
> > +			 */
> > +			if (folio_pages_num > 1)
> > +				nr_pages = min_t(long, npage,
> > +					folio_pages_num -
> > +					folio_page_idx(folio, page));
> > +
> > +			unpin_user_folio_dirty_locked(folio, nr_pages,
> > +					dma->prot & IOMMU_WRITE);
> 
> Are you suggesting that we should directly call
> unpin_user_page_range_dirty_lock() here (patch 3/3) instead?

I'm saying you should not have the word 'folio' inside the VFIO. You
accumulate a contiguous range of pfns, by only checking the pfn, and
then call 

unpin_user_page_range_dirty_lock(pfn_to_page(first_pfn)...);

No need for any of this. vfio should never look at the struct page
except as the last moment to pass the range.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 11:36               ` Jason Gunthorpe
@ 2025-06-18 11:40                 ` David Hildenbrand
  2025-06-18 11:42                   ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-18 11:40 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhe.67
  Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 18.06.25 13:36, Jason Gunthorpe wrote:
> On Wed, Jun 18, 2025 at 02:28:20PM +0800, lizhe.67@bytedance.com wrote:
>> On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
>>> +	while (npage) {
>>> +		long nr_pages = 1;
>>> +
>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>> +			struct page *page = pfn_to_page(pfn);
>>> +			struct folio *folio = page_folio(page);
>>> +			long folio_pages_num = folio_nr_pages(folio);
>>> +
>>> +			/*
>>> +			 * For a folio, it represents a physically
>>> +			 * contiguous set of bytes, and all of its pages
>>> +			 * share the same invalid/reserved state.
>>> +			 *
>>> +			 * Here, our PFNs are contiguous. Therefore, if we
>>> +			 * detect that the current PFN belongs to a large
>>> +			 * folio, we can batch the operations for the next
>>> +			 * nr_pages PFNs.
>>> +			 */
>>> +			if (folio_pages_num > 1)
>>> +				nr_pages = min_t(long, npage,
>>> +					folio_pages_num -
>>> +					folio_page_idx(folio, page));
>>> +
>>> +			unpin_user_folio_dirty_locked(folio, nr_pages,
>>> +					dma->prot & IOMMU_WRITE);
>>
>> Are you suggesting that we should directly call
>> unpin_user_page_range_dirty_lock() here (patch 3/3) instead?
> 
> I'm saying you should not have the word 'folio' inside the VFIO. You
> accumulate a contiguous range of pfns, by only checking the pfn, and
> then call
> 
> unpin_user_page_range_dirty_lock(pfn_to_page(first_pfn)...);
> 
> No need for any of this. vfio should never look at the struct page
> except as the last moment to pass the range.

Hah, agreed, that's actually simpler and there is no need to factor 
anything out.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 11:40                 ` David Hildenbrand
@ 2025-06-18 11:42                   ` David Hildenbrand
  2025-06-18 11:46                     ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-18 11:42 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhe.67
  Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On 18.06.25 13:40, David Hildenbrand wrote:
> On 18.06.25 13:36, Jason Gunthorpe wrote:
>> On Wed, Jun 18, 2025 at 02:28:20PM +0800, lizhe.67@bytedance.com wrote:
>>> On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
>>>> +	while (npage) {
>>>> +		long nr_pages = 1;
>>>> +
>>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>>> +			struct page *page = pfn_to_page(pfn);
>>>> +			struct folio *folio = page_folio(page);
>>>> +			long folio_pages_num = folio_nr_pages(folio);
>>>> +
>>>> +			/*
>>>> +			 * For a folio, it represents a physically
>>>> +			 * contiguous set of bytes, and all of its pages
>>>> +			 * share the same invalid/reserved state.
>>>> +			 *
>>>> +			 * Here, our PFNs are contiguous. Therefore, if we
>>>> +			 * detect that the current PFN belongs to a large
>>>> +			 * folio, we can batch the operations for the next
>>>> +			 * nr_pages PFNs.
>>>> +			 */
>>>> +			if (folio_pages_num > 1)
>>>> +				nr_pages = min_t(long, npage,
>>>> +					folio_pages_num -
>>>> +					folio_page_idx(folio, page));
>>>> +
>>>> +			unpin_user_folio_dirty_locked(folio, nr_pages,
>>>> +					dma->prot & IOMMU_WRITE);
>>>
>>> Are you suggesting that we should directly call
>>> unpin_user_page_range_dirty_lock() here (patch 3/3) instead?
>>
>> I'm saying you should not have the word 'folio' inside the VFIO. You
>> accumulate a contiguous range of pfns, by only checking the pfn, and
>> then call
>>
>> unpin_user_page_range_dirty_lock(pfn_to_page(first_pfn)...);
>>
>> No need for any of this. vfio should never look at the struct page
>> except as the last moment to pass the range.
> 
> Hah, agreed, that's actually simpler and there is no need to factor
> anything out.

Ah, no, wait, the problem is that we don't know how many pages we can 
supply, because there might be is_invalid_reserved_pfn() in the range ...

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 11:42                   ` David Hildenbrand
@ 2025-06-18 11:46                     ` Jason Gunthorpe
  2025-06-18 11:52                       ` David Hildenbrand
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-18 11:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: lizhe.67, akpm, alex.williamson, kvm, linux-kernel, linux-mm,
	peterx

On Wed, Jun 18, 2025 at 01:42:09PM +0200, David Hildenbrand wrote:
> On 18.06.25 13:40, David Hildenbrand wrote:
> > On 18.06.25 13:36, Jason Gunthorpe wrote:
> > > On Wed, Jun 18, 2025 at 02:28:20PM +0800, lizhe.67@bytedance.com wrote:
> > > > On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
> > > > > +	while (npage) {
> > > > > +		long nr_pages = 1;
> > > > > +
> > > > > +		if (!is_invalid_reserved_pfn(pfn)) {
> > > > > +			struct page *page = pfn_to_page(pfn);
> > > > > +			struct folio *folio = page_folio(page);
> > > > > +			long folio_pages_num = folio_nr_pages(folio);
> > > > > +
> > > > > +			/*
> > > > > +			 * For a folio, it represents a physically
> > > > > +			 * contiguous set of bytes, and all of its pages
> > > > > +			 * share the same invalid/reserved state.
> > > > > +			 *
> > > > > +			 * Here, our PFNs are contiguous. Therefore, if we
> > > > > +			 * detect that the current PFN belongs to a large
> > > > > +			 * folio, we can batch the operations for the next
> > > > > +			 * nr_pages PFNs.
> > > > > +			 */
> > > > > +			if (folio_pages_num > 1)
> > > > > +				nr_pages = min_t(long, npage,
> > > > > +					folio_pages_num -
> > > > > +					folio_page_idx(folio, page));
> > > > > +
> > > > > +			unpin_user_folio_dirty_locked(folio, nr_pages,
> > > > > +					dma->prot & IOMMU_WRITE);
> > > > 
> > > > Are you suggesting that we should directly call
> > > > unpin_user_page_range_dirty_lock() here (patch 3/3) instead?
> > > 
> > > I'm saying you should not have the word 'folio' inside the VFIO. You
> > > accumulate a contiguous range of pfns, by only checking the pfn, and
> > > then call
> > > 
> > > unpin_user_page_range_dirty_lock(pfn_to_page(first_pfn)...);
> > > 
> > > No need for any of this. vfio should never look at the struct page
> > > except as the last moment to pass the range.
> > 
> > Hah, agreed, that's actually simpler and there is no need to factor
> > anything out.
> 
> Ah, no, wait, the problem is that we don't know how many pages we can
> supply, because there might be is_invalid_reserved_pfn() in the range ...

You stop batching when you hit any invalid_reserved_pfn and flush it.

It still has to check read back and check every PFN to make sure it is
contiguous, checking reserved too is not a problemm.

Jason 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 11:46                     ` Jason Gunthorpe
@ 2025-06-18 11:52                       ` David Hildenbrand
  2025-06-18 11:56                         ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: David Hildenbrand @ 2025-06-18 11:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: lizhe.67, akpm, alex.williamson, kvm, linux-kernel, linux-mm,
	peterx

On 18.06.25 13:46, Jason Gunthorpe wrote:
> On Wed, Jun 18, 2025 at 01:42:09PM +0200, David Hildenbrand wrote:
>> On 18.06.25 13:40, David Hildenbrand wrote:
>>> On 18.06.25 13:36, Jason Gunthorpe wrote:
>>>> On Wed, Jun 18, 2025 at 02:28:20PM +0800, lizhe.67@bytedance.com wrote:
>>>>> On Tue, 17 Jun 2025 12:22:10 -0300, jgg@ziepe.ca wrote:
>>>>>> +	while (npage) {
>>>>>> +		long nr_pages = 1;
>>>>>> +
>>>>>> +		if (!is_invalid_reserved_pfn(pfn)) {
>>>>>> +			struct page *page = pfn_to_page(pfn);
>>>>>> +			struct folio *folio = page_folio(page);
>>>>>> +			long folio_pages_num = folio_nr_pages(folio);
>>>>>> +
>>>>>> +			/*
>>>>>> +			 * For a folio, it represents a physically
>>>>>> +			 * contiguous set of bytes, and all of its pages
>>>>>> +			 * share the same invalid/reserved state.
>>>>>> +			 *
>>>>>> +			 * Here, our PFNs are contiguous. Therefore, if we
>>>>>> +			 * detect that the current PFN belongs to a large
>>>>>> +			 * folio, we can batch the operations for the next
>>>>>> +			 * nr_pages PFNs.
>>>>>> +			 */
>>>>>> +			if (folio_pages_num > 1)
>>>>>> +				nr_pages = min_t(long, npage,
>>>>>> +					folio_pages_num -
>>>>>> +					folio_page_idx(folio, page));
>>>>>> +
>>>>>> +			unpin_user_folio_dirty_locked(folio, nr_pages,
>>>>>> +					dma->prot & IOMMU_WRITE);
>>>>>
>>>>> Are you suggesting that we should directly call
>>>>> unpin_user_page_range_dirty_lock() here (patch 3/3) instead?
>>>>
>>>> I'm saying you should not have the word 'folio' inside the VFIO. You
>>>> accumulate a contiguous range of pfns, by only checking the pfn, and
>>>> then call
>>>>
>>>> unpin_user_page_range_dirty_lock(pfn_to_page(first_pfn)...);
>>>>
>>>> No need for any of this. vfio should never look at the struct page
>>>> except as the last moment to pass the range.
>>>
>>> Hah, agreed, that's actually simpler and there is no need to factor
>>> anything out.
>>
>> Ah, no, wait, the problem is that we don't know how many pages we can
>> supply, because there might be is_invalid_reserved_pfn() in the range ...
> 
> You stop batching when you hit any invalid_reserved_pfn and flush it.
> 
> It still has to check read back and check every PFN to make sure it is
> contiguous, checking reserved too is not a problemm.

I thought we also wanted to optimize out the is_invalid_reserved_pfn() 
check for each subpage of a folio.

pfn_valid() + pfn_to_page() are not super cheap in some relevant configs 
IIRC.


-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 11:52                       ` David Hildenbrand
@ 2025-06-18 11:56                         ` Jason Gunthorpe
  2025-06-18 12:19                           ` lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-18 11:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: lizhe.67, akpm, alex.williamson, kvm, linux-kernel, linux-mm,
	peterx

On Wed, Jun 18, 2025 at 01:52:37PM +0200, David Hildenbrand wrote:

> I thought we also wanted to optimize out the
> is_invalid_reserved_pfn() check for each subpage of a folio.

VFIO keeps a tracking structure for the ranges, you can record there
if a reserved PFN was ever placed into this range and skip the check
entirely.

It would be very rare for reserved PFNs and non reserved will to be
mixed within the same range, userspace could cause this but nothing
should.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 11:56                         ` Jason Gunthorpe
@ 2025-06-18 12:19                           ` lizhe.67
  2025-06-18 13:23                             ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: lizhe.67 @ 2025-06-18 12:19 UTC (permalink / raw)
  To: jgg, david
  Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Wed, 18 Jun 2025 08:56:22 -0300, jgg@ziepe.ca wrote:
 
> On Wed, Jun 18, 2025 at 01:52:37PM +0200, David Hildenbrand wrote:
> 
> > I thought we also wanted to optimize out the
> > is_invalid_reserved_pfn() check for each subpage of a folio.

Yes, that is an important aspect of our optimization.

> VFIO keeps a tracking structure for the ranges, you can record there
> if a reserved PFN was ever placed into this range and skip the check
> entirely.
> 
> It would be very rare for reserved PFNs and non reserved will to be
> mixed within the same range, userspace could cause this but nothing
> should.

Yes, but it seems we don't have a very straightforward interface to
obtain the reserved attribute of this large range of pfns. Moreover,
this implies that we need to move the logic of the
is_invalid_reserved_pfn() check to another process. I'm not sure if
this is necessary.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 12:19                           ` lizhe.67
@ 2025-06-18 13:23                             ` Jason Gunthorpe
  2025-06-19  9:05                               ` lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-18 13:23 UTC (permalink / raw)
  To: lizhe.67
  Cc: david, akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On Wed, Jun 18, 2025 at 08:19:28PM +0800, lizhe.67@bytedance.com wrote:
> On Wed, 18 Jun 2025 08:56:22 -0300, jgg@ziepe.ca wrote:
>  
> > On Wed, Jun 18, 2025 at 01:52:37PM +0200, David Hildenbrand wrote:
> > 
> > > I thought we also wanted to optimize out the
> > > is_invalid_reserved_pfn() check for each subpage of a folio.
> 
> Yes, that is an important aspect of our optimization.
> 
> > VFIO keeps a tracking structure for the ranges, you can record there
> > if a reserved PFN was ever placed into this range and skip the check
> > entirely.
> > 
> > It would be very rare for reserved PFNs and non reserved will to be
> > mixed within the same range, userspace could cause this but nothing
> > should.
> 
> Yes, but it seems we don't have a very straightforward interface to
> obtain the reserved attribute of this large range of pfns.

vfio_unmap_unpin()  has the struct vfio_dma, you'd store the
indication there and pass it down.

It already builds the longest run of physical contiguity here:

		for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
			next = iommu_iova_to_phys(domain->domain, iova + len);
			if (next != phys + len)
				break;
		}

And we pass down a physically contiguous range to
unmap_unpin_fast()/unmap_unpin_slow().

The only thing you need to do is to detect reserved in
vfio_unmap_unpin() optimized flag in the dma, and break up the above
loop if it crosses a reserved boundary.

If you have a reserved range then just directly call iommu_unmap and
forget about any page pinning.

Then in the page pinning side you use the range version.

Something very approximately like the below. But again, I would
implore you to just use iommufd that is already much better here.

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 1136d7ac6b597e..097b97c67e3f0d 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -738,12 +738,13 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
 	long unlocked = 0, locked = 0;
 	long i;
 
+	/* The caller has already ensured the pfn range is not reserved */
+	unpin_user_page_range_dirty_lock(pfn_to_page(pfn), npage,
+					 dma->prot & IOMMU_WRITE);
 	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
-		if (put_pfn(pfn++, dma->prot)) {
 			unlocked++;
 			if (vfio_find_vpfn(dma, iova))
 				locked++;
-		}
 	}
 
 	if (do_accounting)
@@ -1082,6 +1083,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 	while (iova < end) {
 		size_t unmapped, len;
 		phys_addr_t phys, next;
+		bool reserved = false;
 
 		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
@@ -1089,6 +1091,9 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			continue;
 		}
 
+		if (dma->has_reserved)
+			reserved = is_invalid_reserved_pfn(phys >> PAGE_SHIFT);
+
 		/*
 		 * To optimize for fewer iommu_unmap() calls, each of which
 		 * may require hardware cache flushing, try to find the
@@ -1098,21 +1103,31 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
 			next = iommu_iova_to_phys(domain->domain, iova + len);
 			if (next != phys + len)
 				break;
+			if (dma->has_reserved &&
+			    reserved != is_invalid_reserved_pfn(next >> PAGE_SHIFT))
+				break;
 		}
 
 		/*
 		 * First, try to use fast unmap/unpin. In case of failure,
 		 * switch to slow unmap/unpin path.
 		 */
-		unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
-					    &unlocked, &unmapped_region_list,
-					    &unmapped_region_cnt,
-					    &iotlb_gather);
-		if (!unmapped) {
-			unmapped = unmap_unpin_slow(domain, dma, &iova, len,
-						    phys, &unlocked);
-			if (WARN_ON(!unmapped))
-				break;
+		if (reserved) {
+			unmapped = iommu_unmap(domain->domain, iova, len);
+			*iova += unmapped;
+		} else {
+			unmapped = unmap_unpin_fast(domain, dma, &iova, len,
+						    phys, &unlocked,
+						    &unmapped_region_list,
+						    &unmapped_region_cnt,
+						    &iotlb_gather);
+			if (!unmapped) {
+				unmapped = unmap_unpin_slow(domain, dma, &iova,
+							    len, phys,
+							    &unlocked);
+				if (WARN_ON(!unmapped))
+					break;
+			}
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-18 13:23                             ` Jason Gunthorpe
@ 2025-06-19  9:05                               ` lizhe.67
  2025-06-19 12:35                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 34+ messages in thread
From: lizhe.67 @ 2025-06-19  9:05 UTC (permalink / raw)
  To: jgg, david
  Cc: akpm, alex.williamson, kvm, linux-kernel, linux-mm, lizhe.67,
	peterx

On Wed, 18 Jun 2025 10:23:50 -0300, jgg@ziepe.ca wrote:
 
> On Wed, Jun 18, 2025 at 08:19:28PM +0800, lizhe.67@bytedance.com wrote:
> > On Wed, 18 Jun 2025 08:56:22 -0300, jgg@ziepe.ca wrote:
> >  
> > > On Wed, Jun 18, 2025 at 01:52:37PM +0200, David Hildenbrand wrote:
> > > 
> > > > I thought we also wanted to optimize out the
> > > > is_invalid_reserved_pfn() check for each subpage of a folio.
> > 
> > Yes, that is an important aspect of our optimization.
> > 
> > > VFIO keeps a tracking structure for the ranges, you can record there
> > > if a reserved PFN was ever placed into this range and skip the check
> > > entirely.
> > > 
> > > It would be very rare for reserved PFNs and non reserved will to be
> > > mixed within the same range, userspace could cause this but nothing
> > > should.
> > 
> > Yes, but it seems we don't have a very straightforward interface to
> > obtain the reserved attribute of this large range of pfns.
> 
> vfio_unmap_unpin()  has the struct vfio_dma, you'd store the
> indication there and pass it down.
> 
> It already builds the longest run of physical contiguity here:
> 
> 		for (len = PAGE_SIZE; iova + len < end; len += PAGE_SIZE) {
> 			next = iommu_iova_to_phys(domain->domain, iova + len);
> 			if (next != phys + len)
> 				break;
> 		}
> 
> And we pass down a physically contiguous range to
> unmap_unpin_fast()/unmap_unpin_slow().
> 
> The only thing you need to do is to detect reserved in
> vfio_unmap_unpin() optimized flag in the dma, and break up the above
> loop if it crosses a reserved boundary.
> 
> If you have a reserved range then just directly call iommu_unmap and
> forget about any page pinning.
> 
> Then in the page pinning side you use the range version.
> 
> Something very approximately like the below. But again, I would
> implore you to just use iommufd that is already much better here.

Thank you for your suggestion. We are also working on this, but
it is not something that can be completed in a short time. In
the near term, we are still expected to use the type1 method.

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 1136d7ac6b597e..097b97c67e3f0d 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -738,12 +738,13 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>  	long unlocked = 0, locked = 0;
>  	long i;
>  
> +	/* The caller has already ensured the pfn range is not reserved */
> +	unpin_user_page_range_dirty_lock(pfn_to_page(pfn), npage,
> +					 dma->prot & IOMMU_WRITE);
>  	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> -		if (put_pfn(pfn++, dma->prot)) {
>  			unlocked++;
>  			if (vfio_find_vpfn(dma, iova))
>  				locked++;
> -		}
>  	}
>  
>  	if (do_accounting)
> @@ -1082,6 +1083,7 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  	while (iova < end) {
>  		size_t unmapped, len;
>  		phys_addr_t phys, next;
> +		bool reserved = false;
>  
>  		phys = iommu_iova_to_phys(domain->domain, iova);
>  		if (WARN_ON(!phys)) {
> @@ -1089,6 +1091,9 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			continue;
>  		}
>  
> +		if (dma->has_reserved)
> +			reserved = is_invalid_reserved_pfn(phys >> PAGE_SHIFT);
> +
>  		/*
>  		 * To optimize for fewer iommu_unmap() calls, each of which
>  		 * may require hardware cache flushing, try to find the
> @@ -1098,21 +1103,31 @@ static long vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
>  			next = iommu_iova_to_phys(domain->domain, iova + len);
>  			if (next != phys + len)
>  				break;
> +			if (dma->has_reserved &&
> +			    reserved != is_invalid_reserved_pfn(next >> PAGE_SHIFT))
> +				break;
>  		}
>  
>  		/*
>  		 * First, try to use fast unmap/unpin. In case of failure,
>  		 * switch to slow unmap/unpin path.
>  		 */
> -		unmapped = unmap_unpin_fast(domain, dma, &iova, len, phys,
> -					    &unlocked, &unmapped_region_list,
> -					    &unmapped_region_cnt,
> -					    &iotlb_gather);
> -		if (!unmapped) {
> -			unmapped = unmap_unpin_slow(domain, dma, &iova, len,
> -						    phys, &unlocked);
> -			if (WARN_ON(!unmapped))
> -				break;
> +		if (reserved) {
> +			unmapped = iommu_unmap(domain->domain, iova, len);
> +			*iova += unmapped;
> +		} else {
> +			unmapped = unmap_unpin_fast(domain, dma, &iova, len,
> +						    phys, &unlocked,
> +						    &unmapped_region_list,
> +						    &unmapped_region_cnt,
> +						    &iotlb_gather);
> +			if (!unmapped) {
> +				unmapped = unmap_unpin_slow(domain, dma, &iova,
> +							    len, phys,
> +							    &unlocked);
> +				if (WARN_ON(!unmapped))
> +					break;
> +			}
>  		}
>  	}

As I understand it, there seem to be some issues with this
implementation. How can we obtain the value of dma->has_reserved
(acquiring it within vfio_pin_pages_remote() might be a good option)
and ensure that this value remains unchanged from the time of
assignment until we perform the unpin operation? I've searched
through the code and it appears that there are instances where
SetPageReserved() is called outside of the initialization phase.
Please correct me if I am wrong.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-19  9:05                               ` lizhe.67
@ 2025-06-19 12:35                                 ` Jason Gunthorpe
  2025-06-19 12:49                                   ` lizhe.67
  0 siblings, 1 reply; 34+ messages in thread
From: Jason Gunthorpe @ 2025-06-19 12:35 UTC (permalink / raw)
  To: lizhe.67
  Cc: david, akpm, alex.williamson, kvm, linux-kernel, linux-mm, peterx

On Thu, Jun 19, 2025 at 05:05:42PM +0800, lizhe.67@bytedance.com wrote:

> As I understand it, there seem to be some issues with this
> implementation. How can we obtain the value of dma->has_reserved
> (acquiring it within vfio_pin_pages_remote() might be a good option)

Yes, you record it during vfio_pin_pages operations. If VFIO call
iommu_map on something that went down the non-GUP path then it sets
the flag.

> and ensure that this value remains unchanged from the time of
> assignment until we perform the unpin operation? 

Map/unmap are paired and not allowed to race so that isn't an issue.

> I've searched through the code and it appears that there are
> instances where SetPageReserved() is called outside of the
> initialization phase.  Please correct me if I am wrong.

It should not be relevant here, pages under use by VFIO are not
permitted to change it will break things.

Jason

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked()
  2025-06-19 12:35                                 ` Jason Gunthorpe
@ 2025-06-19 12:49                                   ` lizhe.67
  0 siblings, 0 replies; 34+ messages in thread
From: lizhe.67 @ 2025-06-19 12:49 UTC (permalink / raw)
  To: jgg
  Cc: akpm, alex.williamson, david, kvm, linux-kernel, linux-mm,
	lizhe.67, peterx

On Thu, 19 Jun 2025 09:35:04 -0300, jgg@ziepe.ca wrote:
 
> On Thu, Jun 19, 2025 at 05:05:42PM +0800, lizhe.67@bytedance.com wrote:
> 
> > As I understand it, there seem to be some issues with this
> > implementation. How can we obtain the value of dma->has_reserved
> > (acquiring it within vfio_pin_pages_remote() might be a good option)
> 
> Yes, you record it during vfio_pin_pages operations. If VFIO call
> iommu_map on something that went down the non-GUP path then it sets
> the flag.
> 
> > and ensure that this value remains unchanged from the time of
> > assignment until we perform the unpin operation? 
> 
> Map/unmap are paired and not allowed to race so that isn't an issue.
> 
> > I've searched through the code and it appears that there are
> > instances where SetPageReserved() is called outside of the
> > initialization phase.  Please correct me if I am wrong.
> 
> It should not be relevant here, pages under use by VFIO are not
> permitted to change it will break things.

Then this approach appears to be no problem, and there’s no need to
introduce any new interfaces. All modifications remain internal to
vfio. I’ll send out a v5 patch based on this approach.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-06-19 12:49 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-17  4:18 [PATCH v4 0/3] optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-17  4:18 ` [PATCH v4 1/3] vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() lizhe.67
2025-06-17  4:18 ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
2025-06-17  7:35   ` David Hildenbrand
2025-06-17 13:42   ` Jason Gunthorpe
2025-06-17 13:45     ` David Hildenbrand
2025-06-17 13:58       ` David Hildenbrand
2025-06-17 14:04         ` David Hildenbrand
2025-06-17 15:22           ` Jason Gunthorpe
2025-06-18  6:28             ` lizhe.67
2025-06-18  8:20               ` David Hildenbrand
2025-06-18 11:36               ` Jason Gunthorpe
2025-06-18 11:40                 ` David Hildenbrand
2025-06-18 11:42                   ` David Hildenbrand
2025-06-18 11:46                     ` Jason Gunthorpe
2025-06-18 11:52                       ` David Hildenbrand
2025-06-18 11:56                         ` Jason Gunthorpe
2025-06-18 12:19                           ` lizhe.67
2025-06-18 13:23                             ` Jason Gunthorpe
2025-06-19  9:05                               ` lizhe.67
2025-06-19 12:35                                 ` Jason Gunthorpe
2025-06-19 12:49                                   ` lizhe.67
2025-06-17  4:18 ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-17  7:43   ` David Hildenbrand
2025-06-17  9:21     ` [PATCH v4 2/3] gup: introduce unpin_user_folio_dirty_locked() lizhe.67
2025-06-17  9:27       ` David Hildenbrand
2025-06-17  9:47         ` [PATCH v4 3/3] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-17  9:49           ` David Hildenbrand
2025-06-17 12:42             ` lizhe.67
2025-06-17 13:47               ` David Hildenbrand
2025-06-18  6:11                 ` lizhe.67
2025-06-18  7:22                   ` lizhe.67
2025-06-18  8:54                   ` David Hildenbrand
2025-06-18  9:39                     ` lizhe.67

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).