[PATCH v4] vfio/type1: optimize vfio_pin_pages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios
@ 2025-05-21  4:25 lizhe.67
  2025-05-21  6:36 ` David Hildenbrand
  2025-05-21 19:17 ` Alex Williamson
  0 siblings, 2 replies; 13+ messages in thread
From: lizhe.67 @ 2025-05-21  4:25 UTC (permalink / raw)
  To: alex.williamson; +Cc: david, kvm, linux-kernel, muchun.song, peterx, lizhe.67

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_pin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
statistics counting operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

This patch optimize this process by batching the statistics counting
operations.

The performance test results for completing the 8G VFIO IOMMU DMA mapping,
obtained through trace-cmd, are as follows. In this case, the 8G virtual
address space has been mapped to physical memory using hugetlbfs with
pagesize=2M.

Before this patch:
funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();

After this patch:
funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
Changelogs:

v3->v4:
- Use min_t() to obtain the step size, rather than min().
- Fix some issues in commit message and title.

v2->v3:
- Code simplification.
- Fix some issues in comments.

v1->v2:
- Fix some issues in comments and formatting.
- Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
- Move the processing logic for hugetlbfs folio into the while(true) loop
  and use a variable with a default value of 1 to indicate the number of
  consecutive pages.

v3 patch: https://lore.kernel.org/all/20250520070020.6181-1-lizhe.67@bytedance.com/
v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/

 drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0ac56072af9f..bd46ed9361fe 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
 /*
  * Helper Functions for host iova-pfn list
  */
-static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+
+/*
+ * Find the first vfio_pfn that overlapping the range
+ * [iova, iova + PAGE_SIZE * npage) in rb tree.
+ */
+static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
+		dma_addr_t iova, unsigned long npage)
 {
 	struct vfio_pfn *vpfn;
 	struct rb_node *node = dma->pfn_list.rb_node;
+	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
 
 	while (node) {
 		vpfn = rb_entry(node, struct vfio_pfn, node);
 
-		if (iova < vpfn->iova)
+		if (end_iova <= vpfn->iova)
 			node = node->rb_left;
 		else if (iova > vpfn->iova)
 			node = node->rb_right;
@@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	return NULL;
 }
 
+static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+{
+	return vfio_find_vpfn_range(dma, iova, 1);
+}
+
 static void vfio_link_pfn(struct vfio_dma *dma,
 			  struct vfio_pfn *new)
 {
@@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 		 * and rsvd here, and therefore continues to use the batch.
 		 */
 		while (true) {
+			struct folio *folio = page_folio(batch->pages[batch->offset]);
+			long nr_pages;
+
 			if (pfn != *pfn_base + pinned ||
 			    rsvd != is_invalid_reserved_pfn(pfn))
 				goto out;
 
+			/*
+			 * Note: The current nr_pages does not achieve the optimal
+			 * performance in scenarios where folio_nr_pages() exceeds
+			 * batch->capacity. It is anticipated that future enhancements
+			 * will address this limitation.
+			 */
+			nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
+						folio_page_idx(folio, batch->pages[batch->offset]));
+			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
+				nr_pages = 1;
+
 			/*
 			 * Reserved pages aren't counted against the user,
 			 * externally pinned pages are already counted against
 			 * the user.
 			 */
-			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
+			if (!rsvd && (nr_pages > 1 || !vfio_find_vpfn(dma, iova))) {
 				if (!dma->lock_cap &&
-				    mm->locked_vm + lock_acct + 1 > limit) {
+				    mm->locked_vm + lock_acct + nr_pages > limit) {
 					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 						__func__, limit << PAGE_SHIFT);
 					ret = -ENOMEM;
 					goto unpin_out;
 				}
-				lock_acct++;
+				lock_acct += nr_pages;
 			}
 
-			pinned++;
-			npage--;
-			vaddr += PAGE_SIZE;
-			iova += PAGE_SIZE;
-			batch->offset++;
-			batch->size--;
+			pinned += nr_pages;
+			npage -= nr_pages;
+			vaddr += PAGE_SIZE * nr_pages;
+			iova += PAGE_SIZE * nr_pages;
+			batch->offset += nr_pages;
+			batch->size -= nr_pages;
 
 			if (!batch->size)
 				break;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios
  2025-05-21  4:25 [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios lizhe.67
@ 2025-05-21  6:36 ` David Hildenbrand
  2025-05-21 19:17 ` Alex Williamson
  1 sibling, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2025-05-21  6:36 UTC (permalink / raw)
  To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, muchun.song, peterx

On 21.05.25 06:25, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> Changelogs:
> 
> v3->v4:
> - Use min_t() to obtain the step size, rather than min().
> - Fix some issues in commit message and title.

It's usually a good idea to wait with re submissions until the 
discussions on the previous version have ended.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios
  2025-05-21  4:25 [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios lizhe.67
  2025-05-21  6:36 ` David Hildenbrand
@ 2025-05-21 19:17 ` Alex Williamson
  2025-05-22  3:49   ` [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio lizhe.67
  1 sibling, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2025-05-21 19:17 UTC (permalink / raw)
  To: lizhe.67; +Cc: david, kvm, linux-kernel, muchun.song, peterx

On Wed, 21 May 2025 12:25:07 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---

Given the discussion on v3, this is currently a Nak.  Follow-up in that
thread if there are further ideas how to salvage this.  Thanks,

Alex

> Changelogs:
> 
> v3->v4:
> - Use min_t() to obtain the step size, rather than min().
> - Fix some issues in commit message and title.
> 
> v2->v3:
> - Code simplification.
> - Fix some issues in comments.
> 
> v1->v2:
> - Fix some issues in comments and formatting.
> - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> - Move the processing logic for hugetlbfs folio into the while(true) loop
>   and use a variable with a default value of 1 to indicate the number of
>   consecutive pages.
> 
> v3 patch: https://lore.kernel.org/all/20250520070020.6181-1-lizhe.67@bytedance.com/
> v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> 
>  drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
>  1 file changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..bd46ed9361fe 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +
> +/*
> + * Find the first vfio_pfn that overlapping the range
> + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova, unsigned long npage)
>  {
>  	struct vfio_pfn *vpfn;
>  	struct rb_node *node = dma->pfn_list.rb_node;
> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>  
>  	while (node) {
>  		vpfn = rb_entry(node, struct vfio_pfn, node);
>  
> -		if (iova < vpfn->iova)
> +		if (end_iova <= vpfn->iova)
>  			node = node->rb_left;
>  		else if (iova > vpfn->iova)
>  			node = node->rb_right;
> @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	return NULL;
>  }
>  
> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +{
> +	return vfio_find_vpfn_range(dma, iova, 1);
> +}
> +
>  static void vfio_link_pfn(struct vfio_dma *dma,
>  			  struct vfio_pfn *new)
>  {
> @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  		 * and rsvd here, and therefore continues to use the batch.
>  		 */
>  		while (true) {
> +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> +			long nr_pages;
> +
>  			if (pfn != *pfn_base + pinned ||
>  			    rsvd != is_invalid_reserved_pfn(pfn))
>  				goto out;
>  
> +			/*
> +			 * Note: The current nr_pages does not achieve the optimal
> +			 * performance in scenarios where folio_nr_pages() exceeds
> +			 * batch->capacity. It is anticipated that future enhancements
> +			 * will address this limitation.
> +			 */
> +			nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
> +						folio_page_idx(folio, batch->pages[batch->offset]));
> +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> +				nr_pages = 1;
> +
>  			/*
>  			 * Reserved pages aren't counted against the user,
>  			 * externally pinned pages are already counted against
>  			 * the user.
>  			 */
> -			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> +			if (!rsvd && (nr_pages > 1 || !vfio_find_vpfn(dma, iova))) {
>  				if (!dma->lock_cap &&
> -				    mm->locked_vm + lock_acct + 1 > limit) {
> +				    mm->locked_vm + lock_acct + nr_pages > limit) {
>  					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>  						__func__, limit << PAGE_SHIFT);
>  					ret = -ENOMEM;
>  					goto unpin_out;
>  				}
> -				lock_acct++;
> +				lock_acct += nr_pages;
>  			}
>  
> -			pinned++;
> -			npage--;
> -			vaddr += PAGE_SIZE;
> -			iova += PAGE_SIZE;
> -			batch->offset++;
> -			batch->size--;
> +			pinned += nr_pages;
> +			npage -= nr_pages;
> +			vaddr += PAGE_SIZE * nr_pages;
> +			iova += PAGE_SIZE * nr_pages;
> +			batch->offset += nr_pages;
> +			batch->size -= nr_pages;
>  
>  			if (!batch->size)
>  				break;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-21 19:17 ` Alex Williamson
@ 2025-05-22  3:49   ` lizhe.67
  2025-05-22  7:22     ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: lizhe.67 @ 2025-05-22  3:49 UTC (permalink / raw)
  To: alex.williamson; +Cc: david, kvm, linux-kernel, lizhe.67, muchun.song, peterx

On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:

>> From: Li Zhe <lizhe.67@bytedance.com>
>> 
>> When vfio_pin_pages_remote() is called with a range of addresses that
>> includes large folios, the function currently performs individual
>> statistics counting operations for each page. This can lead to significant
>> performance overheads, especially when dealing with large ranges of pages.
>> 
>> This patch optimize this process by batching the statistics counting
>> operations.
>> 
>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
>> address space has been mapped to physical memory using hugetlbfs with
>> pagesize=2M.
>> 
>> Before this patch:
>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>> 
>> After this patch:
>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
>> 
>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>
>Given the discussion on v3, this is currently a Nak.  Follow-up in that
>thread if there are further ideas how to salvage this.  Thanks,

How about considering the solution David mentioned to check whether the
pages or PFNs are actually consecutive?

I have conducted a preliminary attempt, and the performance testing
revealed that the time consumption is approximately 18,000 microseconds.
Compared to the previous 33,000 microseconds, this also represents a
significant improvement.

The modification is quite straightforward. The code below reflects the
changes I have made based on this patch.

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bd46ed9361fe..1cc1f76d4020 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
        return ret;
 }
 
+static inline long continuous_page_num(struct vfio_batch *batch, long npage)
+{
+       long i;
+       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
+
+       for (i = 1; i < npage; ++i) {
+               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
+                       break;
+               next_pfn++;
+       }
+       return i;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -708,8 +721,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
                         */
                        nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
                                                folio_page_idx(folio, batch->pages[batch->offset]));
-                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
-                               nr_pages = 1;
+                       if (nr_pages > 1) {
+                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
+                                       nr_pages = 1;
+                               else
+                                       nr_pages = continuous_page_num(batch, nr_pages);
+                       }
 
                        /*
                         * Reserved pages aren't counted against the user,

Thanks,
Zhe

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-22  3:49   ` [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio lizhe.67
@ 2025-05-22  7:22     ` David Hildenbrand
  2025-05-22  8:25       ` lizhe.67
  0 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2025-05-22  7:22 UTC (permalink / raw)
  To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, muchun.song, peterx

On 22.05.25 05:49, lizhe.67@bytedance.com wrote:
> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:
> 
>>> From: Li Zhe <lizhe.67@bytedance.com>
>>>
>>> When vfio_pin_pages_remote() is called with a range of addresses that
>>> includes large folios, the function currently performs individual
>>> statistics counting operations for each page. This can lead to significant
>>> performance overheads, especially when dealing with large ranges of pages.
>>>
>>> This patch optimize this process by batching the statistics counting
>>> operations.
>>>
>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
>>> address space has been mapped to physical memory using hugetlbfs with
>>> pagesize=2M.
>>>
>>> Before this patch:
>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>>>
>>> After this patch:
>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
>>>
>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>
>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
>> thread if there are further ideas how to salvage this.  Thanks,
> 
> How about considering the solution David mentioned to check whether the
> pages or PFNs are actually consecutive?
> 
> I have conducted a preliminary attempt, and the performance testing
> revealed that the time consumption is approximately 18,000 microseconds.
> Compared to the previous 33,000 microseconds, this also represents a
> significant improvement.
> 
> The modification is quite straightforward. The code below reflects the
> changes I have made based on this patch.
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index bd46ed9361fe..1cc1f76d4020 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>          return ret;
>   }
>   
> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
> +{
> +       long i;
> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
> +
> +       for (i = 1; i < npage; ++i) {
> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
> +                       break;
> +               next_pfn++;
> +       }
> +       return i;
> +}


What might be faster is obtaining the folio, and then calculating the 
next expected page pointer, comparing whether the page pointers match.

Essentially, using folio_page() to calculate the expected next page.

nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
so that might be rather fast.


So we'd obtain

start_idx = folio_idx(folio, batch->pages[batch->offset]);

and then check for

batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-22  7:22     ` David Hildenbrand
@ 2025-05-22  8:25       ` lizhe.67
  2025-05-22 20:52         ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: lizhe.67 @ 2025-05-22  8:25 UTC (permalink / raw)
  To: david; +Cc: alex.williamson, kvm, linux-kernel, lizhe.67, muchun.song, peterx

On Thu, 22 May 2025 09:22:50 +0200, david@redhat.com wrote:

>On 22.05.25 05:49, lizhe.67@bytedance.com wrote:
>> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:
>> 
>>>> From: Li Zhe <lizhe.67@bytedance.com>
>>>>
>>>> When vfio_pin_pages_remote() is called with a range of addresses that
>>>> includes large folios, the function currently performs individual
>>>> statistics counting operations for each page. This can lead to significant
>>>> performance overheads, especially when dealing with large ranges of pages.
>>>>
>>>> This patch optimize this process by batching the statistics counting
>>>> operations.
>>>>
>>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
>>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
>>>> address space has been mapped to physical memory using hugetlbfs with
>>>> pagesize=2M.
>>>>
>>>> Before this patch:
>>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>>>>
>>>> After this patch:
>>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
>>>>
>>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>>>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>> ---
>>>
>>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
>>> thread if there are further ideas how to salvage this.  Thanks,
>> 
>> How about considering the solution David mentioned to check whether the
>> pages or PFNs are actually consecutive?
>> 
>> I have conducted a preliminary attempt, and the performance testing
>> revealed that the time consumption is approximately 18,000 microseconds.
>> Compared to the previous 33,000 microseconds, this also represents a
>> significant improvement.
>> 
>> The modification is quite straightforward. The code below reflects the
>> changes I have made based on this patch.
>> 
>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>> index bd46ed9361fe..1cc1f76d4020 100644
>> --- a/drivers/vfio/vfio_iommu_type1.c
>> +++ b/drivers/vfio/vfio_iommu_type1.c
>> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>>          return ret;
>>   }
>>   
>> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
>> +{
>> +       long i;
>> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
>> +
>> +       for (i = 1; i < npage; ++i) {
>> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
>> +                       break;
>> +               next_pfn++;
>> +       }
>> +       return i;
>> +}
>
>
>What might be faster is obtaining the folio, and then calculating the 
>next expected page pointer, comparing whether the page pointers match.
>
>Essentially, using folio_page() to calculate the expected next page.
>
>nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
>so that might be rather fast.
>
>
>So we'd obtain
>
>start_idx = folio_idx(folio, batch->pages[batch->offset]);

Do you mean using folio_page_idx()?

>and then check for
>
>batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)

Thank you for your reminder. This is indeed a better solution.
The updated code might look like this:

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index bd46ed9361fe..f9a11b1d8433 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
        return ret;
 }
 
+static inline long continuous_pages_num(struct folio *folio,
+               struct vfio_batch *batch, long npage)
+{
+       long i;
+       unsigned long start_idx =
+                       folio_page_idx(folio, batch->pages[batch->offset]);
+
+       for (i = 1; i < npage; ++i)
+               if (batch->pages[batch->offset + i] !=
+                               folio_page(folio, start_idx + i))
+                       break;
+       return i;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
                         */
                        nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
                                                folio_page_idx(folio, batch->pages[batch->offset]));
-                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
-                               nr_pages = 1;
+                       if (nr_pages > 1) {
+                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
+                                       nr_pages = 1;
+                               else
+                                       nr_pages = continuous_pages_num(folio, batch, nr_pages);
+                       }
 
                        /*
                         * Reserved pages aren't counted against the user,

Thanks,
Zhe

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-22  8:25       ` lizhe.67
@ 2025-05-22 20:52         ` Alex Williamson
  2025-05-23  3:42           ` lizhe.67
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2025-05-22 20:52 UTC (permalink / raw)
  To: lizhe.67; +Cc: david, kvm, linux-kernel, muchun.song, peterx

On Thu, 22 May 2025 16:25:24 +0800
lizhe.67@bytedance.com wrote:

> On Thu, 22 May 2025 09:22:50 +0200, david@redhat.com wrote:
> 
> >On 22.05.25 05:49, lizhe.67@bytedance.com wrote:  
> >> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:
> >>   
> >>>> From: Li Zhe <lizhe.67@bytedance.com>
> >>>>
> >>>> When vfio_pin_pages_remote() is called with a range of addresses that
> >>>> includes large folios, the function currently performs individual
> >>>> statistics counting operations for each page. This can lead to significant
> >>>> performance overheads, especially when dealing with large ranges of pages.
> >>>>
> >>>> This patch optimize this process by batching the statistics counting
> >>>> operations.
> >>>>
> >>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> >>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> >>>> address space has been mapped to physical memory using hugetlbfs with
> >>>> pagesize=2M.
> >>>>
> >>>> Before this patch:
> >>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> >>>>
> >>>> After this patch:
> >>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
> >>>>
> >>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> >>>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> >>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>>> ---  
> >>>
> >>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
> >>> thread if there are further ideas how to salvage this.  Thanks,  
> >> 
> >> How about considering the solution David mentioned to check whether the
> >> pages or PFNs are actually consecutive?
> >> 
> >> I have conducted a preliminary attempt, and the performance testing
> >> revealed that the time consumption is approximately 18,000 microseconds.
> >> Compared to the previous 33,000 microseconds, this also represents a
> >> significant improvement.
> >> 
> >> The modification is quite straightforward. The code below reflects the
> >> changes I have made based on this patch.
> >> 
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index bd46ed9361fe..1cc1f76d4020 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >>          return ret;
> >>   }
> >>   
> >> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
> >> +{
> >> +       long i;
> >> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
> >> +
> >> +       for (i = 1; i < npage; ++i) {
> >> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
> >> +                       break;
> >> +               next_pfn++;
> >> +       }
> >> +       return i;
> >> +}  
> >
> >
> >What might be faster is obtaining the folio, and then calculating the 
> >next expected page pointer, comparing whether the page pointers match.
> >
> >Essentially, using folio_page() to calculate the expected next page.
> >
> >nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
> >so that might be rather fast.
> >
> >
> >So we'd obtain
> >
> >start_idx = folio_idx(folio, batch->pages[batch->offset]);  
> 
> Do you mean using folio_page_idx()?
> 
> >and then check for
> >
> >batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)  
> 
> Thank you for your reminder. This is indeed a better solution.
> The updated code might look like this:
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index bd46ed9361fe..f9a11b1d8433 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>         return ret;
>  }
>  
> +static inline long continuous_pages_num(struct folio *folio,
> +               struct vfio_batch *batch, long npage)

Note this becomes long enough that we should just let the compiler
decide whether to inline or not.

> +{
> +       long i;
> +       unsigned long start_idx =
> +                       folio_page_idx(folio, batch->pages[batch->offset]);
> +
> +       for (i = 1; i < npage; ++i)
> +               if (batch->pages[batch->offset + i] !=
> +                               folio_page(folio, start_idx + i))
> +                       break;
> +       return i;
> +}
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>                          */
>                         nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
>                                                 folio_page_idx(folio, batch->pages[batch->offset]));
> -                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> -                               nr_pages = 1;
> +                       if (nr_pages > 1) {
> +                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
> +                                       nr_pages = 1;
> +                               else
> +                                       nr_pages = continuous_pages_num(folio, batch, nr_pages);
> +                       }


I think we can refactor this a bit better and maybe if we're going to
the trouble of comparing pages we can be a bit more resilient to pages
already accounted as vpfns.  I took a shot at it, compile tested only,
is there still a worthwhile gain?

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0ac56072af9f..e8bba32148f7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -319,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
 /*
  * Helper Functions for host iova-pfn list
  */
-static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+
+/*
+ * Find the first vfio_pfn that overlapping the range
+ * [iova_start, iova_end) in rb tree.
+ */
+static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
+		dma_addr_t iova_start, dma_addr_t iova_end)
 {
 	struct vfio_pfn *vpfn;
 	struct rb_node *node = dma->pfn_list.rb_node;
@@ -327,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	while (node) {
 		vpfn = rb_entry(node, struct vfio_pfn, node);
 
-		if (iova < vpfn->iova)
+		if (iova_end <= vpfn->iova)
 			node = node->rb_left;
-		else if (iova > vpfn->iova)
+		else if (iova_start > vpfn->iova)
 			node = node->rb_right;
 		else
 			return vpfn;
@@ -337,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	return NULL;
 }
 
+static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+{
+	return vfio_find_vpfn_range(dma, iova, iova + PAGE_SIZE);
+}
+
 static void vfio_link_pfn(struct vfio_dma *dma,
 			  struct vfio_pfn *new)
 {
@@ -615,6 +626,43 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	return ret;
 }
 
+static long contig_pages(struct vfio_dma *dma,
+			 struct vfio_batch *batch, dma_addr_t iova)
+{
+	struct page *page = batch->pages[batch->offset];
+	struct folio *folio = page_folio(page);
+	long idx = folio_page_idx(folio, page);
+	long max = min_t(long, batch->size, folio_nr_pages(folio) - idx);
+	long nr_pages;
+
+	for (nr_pages = 1; nr_pages < max; nr_pages++) {
+		if (batch->pages[batch->offset + nr_pages] !=
+		    folio_page(folio, idx + nr_pages))
+			break;
+	}
+
+	return nr_pages;
+}
+
+static long vpfn_pages(struct vfio_dma *dma,
+		       dma_addr_t iova_start, long nr_pages)
+{
+	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
+	struct vfio_pfn *vpfn;
+	long count = 0;
+
+	do {
+		vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);
+		if (likely(!vpfn))
+			break;
+
+		count++;
+		iova_start = vpfn->iova + PAGE_SIZE;
+	} while (iova_start < iova_end);
+
+	return count;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -681,32 +729,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 		 * and rsvd here, and therefore continues to use the batch.
 		 */
 		while (true) {
+			long nr_pages, acct_pages = 0;
+
 			if (pfn != *pfn_base + pinned ||
 			    rsvd != is_invalid_reserved_pfn(pfn))
 				goto out;
 
+			nr_pages = contig_pages(dma, batch, iova);
+			if (!rsvd) {
+				acct_pages = nr_pages;
+				acct_pages -= vpfn_pages(dma, iova, nr_pages);
+			}
+
 			/*
 			 * Reserved pages aren't counted against the user,
 			 * externally pinned pages are already counted against
 			 * the user.
 			 */
-			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
+			if (acct_pages) {
 				if (!dma->lock_cap &&
-				    mm->locked_vm + lock_acct + 1 > limit) {
+				    mm->locked_vm + lock_acct + acct_pages > limit) {
 					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 						__func__, limit << PAGE_SHIFT);
 					ret = -ENOMEM;
 					goto unpin_out;
 				}
-				lock_acct++;
+				lock_acct += acct_pages;
 			}
 
-			pinned++;
-			npage--;
-			vaddr += PAGE_SIZE;
-			iova += PAGE_SIZE;
-			batch->offset++;
-			batch->size--;
+			pinned += nr_pages;
+			npage -= nr_pages;
+			vaddr += PAGE_SIZE * nr_pages;
+			iova += PAGE_SIZE * nr_pages;
+			batch->offset += nr_pages;
+			batch->size -= nr_pages;
 
 			if (!batch->size)
 				break;



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-22 20:52         ` Alex Williamson
@ 2025-05-23  3:42           ` lizhe.67
  2025-05-23 14:54             ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: lizhe.67 @ 2025-05-23  3:42 UTC (permalink / raw)
  To: alex.williamson; +Cc: david, kvm, linux-kernel, lizhe.67, muchun.song, peterx

On Thu, 22 May 2025 14:52:07 -0600, alex.williamson@redhat.com wrote: 

> On Thu, 22 May 2025 16:25:24 +0800
> lizhe.67@bytedance.com wrote:
> 
> > On Thu, 22 May 2025 09:22:50 +0200, david@redhat.com wrote:
> > 
> > >On 22.05.25 05:49, lizhe.67@bytedance.com wrote:  
> > >> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:
> > >>   
> > >>>> From: Li Zhe <lizhe.67@bytedance.com>
> > >>>>
> > >>>> When vfio_pin_pages_remote() is called with a range of addresses that
> > >>>> includes large folios, the function currently performs individual
> > >>>> statistics counting operations for each page. This can lead to significant
> > >>>> performance overheads, especially when dealing with large ranges of pages.
> > >>>>
> > >>>> This patch optimize this process by batching the statistics counting
> > >>>> operations.
> > >>>>
> > >>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > >>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > >>>> address space has been mapped to physical memory using hugetlbfs with
> > >>>> pagesize=2M.
> > >>>>
> > >>>> Before this patch:
> > >>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > >>>>
> > >>>> After this patch:
> > >>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
> > >>>>
> > >>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > >>>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> > >>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > >>>> ---  
> > >>>
> > >>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
> > >>> thread if there are further ideas how to salvage this.  Thanks,  
> > >> 
> > >> How about considering the solution David mentioned to check whether the
> > >> pages or PFNs are actually consecutive?
> > >> 
> > >> I have conducted a preliminary attempt, and the performance testing
> > >> revealed that the time consumption is approximately 18,000 microseconds.
> > >> Compared to the previous 33,000 microseconds, this also represents a
> > >> significant improvement.
> > >> 
> > >> The modification is quite straightforward. The code below reflects the
> > >> changes I have made based on this patch.
> > >> 
> > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > >> index bd46ed9361fe..1cc1f76d4020 100644
> > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > >> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > >>          return ret;
> > >>   }
> > >>   
> > >> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
> > >> +{
> > >> +       long i;
> > >> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
> > >> +
> > >> +       for (i = 1; i < npage; ++i) {
> > >> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
> > >> +                       break;
> > >> +               next_pfn++;
> > >> +       }
> > >> +       return i;
> > >> +}  
> > >
> > >
> > >What might be faster is obtaining the folio, and then calculating the 
> > >next expected page pointer, comparing whether the page pointers match.
> > >
> > >Essentially, using folio_page() to calculate the expected next page.
> > >
> > >nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
> > >so that might be rather fast.
> > >
> > >
> > >So we'd obtain
> > >
> > >start_idx = folio_idx(folio, batch->pages[batch->offset]);  
> > 
> > Do you mean using folio_page_idx()?
> > 
> > >and then check for
> > >
> > >batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)  
> > 
> > Thank you for your reminder. This is indeed a better solution.
> > The updated code might look like this:
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index bd46ed9361fe..f9a11b1d8433 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >         return ret;
> >  }
> >  
> > +static inline long continuous_pages_num(struct folio *folio,
> > +               struct vfio_batch *batch, long npage)
> 
> Note this becomes long enough that we should just let the compiler
> decide whether to inline or not.

Thank you! The 'inline' here indeed needs to be removed.

> > +{
> > +       long i;
> > +       unsigned long start_idx =
> > +                       folio_page_idx(folio, batch->pages[batch->offset]);
> > +
> > +       for (i = 1; i < npage; ++i)
> > +               if (batch->pages[batch->offset + i] !=
> > +                               folio_page(folio, start_idx + i))
> > +                       break;
> > +       return i;
> > +}
> > +
> >  /*
> >   * Attempt to pin pages.  We really don't want to track all the pfns and
> >   * the iommu can only map chunks of consecutive pfns anyway, so get the
> > @@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >                          */
> >                         nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
> >                                                 folio_page_idx(folio, batch->pages[batch->offset]));
> > -                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > -                               nr_pages = 1;
> > +                       if (nr_pages > 1) {
> > +                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
> > +                                       nr_pages = 1;
> > +                               else
> > +                                       nr_pages = continuous_pages_num(folio, batch, nr_pages);
> > +                       }
> 
> 
> I think we can refactor this a bit better and maybe if we're going to
> the trouble of comparing pages we can be a bit more resilient to pages
> already accounted as vpfns.  I took a shot at it, compile tested only,
> is there still a worthwhile gain?
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..e8bba32148f7 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -319,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +
> +/*
> + * Find the first vfio_pfn that overlapping the range
> + * [iova_start, iova_end) in rb tree.
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova_start, dma_addr_t iova_end)
>  {
>  	struct vfio_pfn *vpfn;
>  	struct rb_node *node = dma->pfn_list.rb_node;
> @@ -327,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	while (node) {
>  		vpfn = rb_entry(node, struct vfio_pfn, node);
>  
> -		if (iova < vpfn->iova)
> +		if (iova_end <= vpfn->iova)
>  			node = node->rb_left;
> -		else if (iova > vpfn->iova)
> +		else if (iova_start > vpfn->iova)
>  			node = node->rb_right;
>  		else
>  			return vpfn;
> @@ -337,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	return NULL;
>  }
>  
> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +{
> +	return vfio_find_vpfn_range(dma, iova, iova + PAGE_SIZE);
> +}
> +
>  static void vfio_link_pfn(struct vfio_dma *dma,
>  			  struct vfio_pfn *new)
>  {
> @@ -615,6 +626,43 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>  	return ret;
>  }
>  
> +static long contig_pages(struct vfio_dma *dma,
> +			 struct vfio_batch *batch, dma_addr_t iova)
> +{
> +	struct page *page = batch->pages[batch->offset];
> +	struct folio *folio = page_folio(page);
> +	long idx = folio_page_idx(folio, page);
> +	long max = min_t(long, batch->size, folio_nr_pages(folio) - idx);
> +	long nr_pages;
> +
> +	for (nr_pages = 1; nr_pages < max; nr_pages++) {
> +		if (batch->pages[batch->offset + nr_pages] !=
> +		    folio_page(folio, idx + nr_pages))
> +			break;
> +	}
> +
> +	return nr_pages;
> +}
> +
> +static long vpfn_pages(struct vfio_dma *dma,
> +		       dma_addr_t iova_start, long nr_pages)
> +{
> +	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> +	struct vfio_pfn *vpfn;
> +	long count = 0;
> +
> +	do {
> +		vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);

I am somehow confused here. Function vfio_find_vpfn_range()is designed
to find, through the rbtree, the node that is closest to the root node
and satisfies the condition within the range [iova_start, iova_end),
rather than the node closest to iova_start? Or perhaps I have
misunderstood something?

> +		if (likely(!vpfn))
> +			break;
> +
> +		count++;
> +		iova_start = vpfn->iova + PAGE_SIZE;
> +	} while (iova_start < iova_end);
> +
> +	return count;
> +}
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -681,32 +729,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  		 * and rsvd here, and therefore continues to use the batch.
>  		 */
>  		while (true) {
> +			long nr_pages, acct_pages = 0;
> +
>  			if (pfn != *pfn_base + pinned ||
>  			    rsvd != is_invalid_reserved_pfn(pfn))
>  				goto out;
>  
> +			nr_pages = contig_pages(dma, batch, iova);
> +			if (!rsvd) {
> +				acct_pages = nr_pages;
> +				acct_pages -= vpfn_pages(dma, iova, nr_pages);
> +			}
> +
>  			/*
>  			 * Reserved pages aren't counted against the user,
>  			 * externally pinned pages are already counted against
>  			 * the user.
>  			 */
> -			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> +			if (acct_pages) {
>  				if (!dma->lock_cap &&
> -				    mm->locked_vm + lock_acct + 1 > limit) {
> +				    mm->locked_vm + lock_acct + acct_pages > limit) {
>  					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>  						__func__, limit << PAGE_SHIFT);
>  					ret = -ENOMEM;
>  					goto unpin_out;
>  				}
> -				lock_acct++;
> +				lock_acct += acct_pages;
>  			}
>  
> -			pinned++;
> -			npage--;
> -			vaddr += PAGE_SIZE;
> -			iova += PAGE_SIZE;
> -			batch->offset++;
> -			batch->size--;
> +			pinned += nr_pages;
> +			npage -= nr_pages;
> +			vaddr += PAGE_SIZE * nr_pages;
> +			iova += PAGE_SIZE * nr_pages;
> +			batch->offset += nr_pages;
> +			batch->size -= nr_pages;
>  
>  			if (!batch->size)
>  				break;

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-23  3:42           ` lizhe.67
@ 2025-05-23 14:54             ` Alex Williamson
  2025-05-26  3:37               ` lizhe.67
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2025-05-23 14:54 UTC (permalink / raw)
  To: lizhe.67; +Cc: david, kvm, linux-kernel, muchun.song, peterx

On Fri, 23 May 2025 11:42:38 +0800
lizhe.67@bytedance.com wrote:

> On Thu, 22 May 2025 14:52:07 -0600, alex.williamson@redhat.com wrote: 
> 
> > On Thu, 22 May 2025 16:25:24 +0800
> > lizhe.67@bytedance.com wrote:
> >   
> > > On Thu, 22 May 2025 09:22:50 +0200, david@redhat.com wrote:
> > >   
> > > >On 22.05.25 05:49, lizhe.67@bytedance.com wrote:    
> > > >> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:
> > > >>     
> > > >>>> From: Li Zhe <lizhe.67@bytedance.com>
> > > >>>>
> > > >>>> When vfio_pin_pages_remote() is called with a range of addresses that
> > > >>>> includes large folios, the function currently performs individual
> > > >>>> statistics counting operations for each page. This can lead to significant
> > > >>>> performance overheads, especially when dealing with large ranges of pages.
> > > >>>>
> > > >>>> This patch optimize this process by batching the statistics counting
> > > >>>> operations.
> > > >>>>
> > > >>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > > >>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > > >>>> address space has been mapped to physical memory using hugetlbfs with
> > > >>>> pagesize=2M.
> > > >>>>
> > > >>>> Before this patch:
> > > >>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > > >>>>
> > > >>>> After this patch:
> > > >>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
> > > >>>>
> > > >>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > > >>>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> > > >>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > >>>> ---    
> > > >>>
> > > >>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
> > > >>> thread if there are further ideas how to salvage this.  Thanks,    
> > > >> 
> > > >> How about considering the solution David mentioned to check whether the
> > > >> pages or PFNs are actually consecutive?
> > > >> 
> > > >> I have conducted a preliminary attempt, and the performance testing
> > > >> revealed that the time consumption is approximately 18,000 microseconds.
> > > >> Compared to the previous 33,000 microseconds, this also represents a
> > > >> significant improvement.
> > > >> 
> > > >> The modification is quite straightforward. The code below reflects the
> > > >> changes I have made based on this patch.
> > > >> 
> > > >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > >> index bd46ed9361fe..1cc1f76d4020 100644
> > > >> --- a/drivers/vfio/vfio_iommu_type1.c
> > > >> +++ b/drivers/vfio/vfio_iommu_type1.c
> > > >> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > > >>          return ret;
> > > >>   }
> > > >>   
> > > >> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
> > > >> +{
> > > >> +       long i;
> > > >> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
> > > >> +
> > > >> +       for (i = 1; i < npage; ++i) {
> > > >> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
> > > >> +                       break;
> > > >> +               next_pfn++;
> > > >> +       }
> > > >> +       return i;
> > > >> +}    
> > > >
> > > >
> > > >What might be faster is obtaining the folio, and then calculating the 
> > > >next expected page pointer, comparing whether the page pointers match.
> > > >
> > > >Essentially, using folio_page() to calculate the expected next page.
> > > >
> > > >nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
> > > >so that might be rather fast.
> > > >
> > > >
> > > >So we'd obtain
> > > >
> > > >start_idx = folio_idx(folio, batch->pages[batch->offset]);    
> > > 
> > > Do you mean using folio_page_idx()?
> > >   
> > > >and then check for
> > > >
> > > >batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)    
> > > 
> > > Thank you for your reminder. This is indeed a better solution.
> > > The updated code might look like this:
> > > 
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index bd46ed9361fe..f9a11b1d8433 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> > >         return ret;
> > >  }
> > >  
> > > +static inline long continuous_pages_num(struct folio *folio,
> > > +               struct vfio_batch *batch, long npage)  
> > 
> > Note this becomes long enough that we should just let the compiler
> > decide whether to inline or not.  
> 
> Thank you! The 'inline' here indeed needs to be removed.
> 
> > > +{
> > > +       long i;
> > > +       unsigned long start_idx =
> > > +                       folio_page_idx(folio, batch->pages[batch->offset]);
> > > +
> > > +       for (i = 1; i < npage; ++i)
> > > +               if (batch->pages[batch->offset + i] !=
> > > +                               folio_page(folio, start_idx + i))
> > > +                       break;
> > > +       return i;
> > > +}
> > > +
> > >  /*
> > >   * Attempt to pin pages.  We really don't want to track all the pfns and
> > >   * the iommu can only map chunks of consecutive pfns anyway, so get the
> > > @@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > >                          */
> > >                         nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
> > >                                                 folio_page_idx(folio, batch->pages[batch->offset]));
> > > -                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > > -                               nr_pages = 1;
> > > +                       if (nr_pages > 1) {
> > > +                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
> > > +                                       nr_pages = 1;
> > > +                               else
> > > +                                       nr_pages = continuous_pages_num(folio, batch, nr_pages);
> > > +                       }  
> > 
> > 
> > I think we can refactor this a bit better and maybe if we're going to
> > the trouble of comparing pages we can be a bit more resilient to pages
> > already accounted as vpfns.  I took a shot at it, compile tested only,
> > is there still a worthwhile gain?
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 0ac56072af9f..e8bba32148f7 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -319,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> >  /*
> >   * Helper Functions for host iova-pfn list
> >   */
> > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > +
> > +/*
> > + * Find the first vfio_pfn that overlapping the range
> > + * [iova_start, iova_end) in rb tree.
> > + */
> > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> > +		dma_addr_t iova_start, dma_addr_t iova_end)
> >  {
> >  	struct vfio_pfn *vpfn;
> >  	struct rb_node *node = dma->pfn_list.rb_node;
> > @@ -327,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> >  	while (node) {
> >  		vpfn = rb_entry(node, struct vfio_pfn, node);
> >  
> > -		if (iova < vpfn->iova)
> > +		if (iova_end <= vpfn->iova)
> >  			node = node->rb_left;
> > -		else if (iova > vpfn->iova)
> > +		else if (iova_start > vpfn->iova)
> >  			node = node->rb_right;
> >  		else
> >  			return vpfn;
> > @@ -337,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> >  	return NULL;
> >  }
> >  
> > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > +{
> > +	return vfio_find_vpfn_range(dma, iova, iova + PAGE_SIZE);
> > +}
> > +
> >  static void vfio_link_pfn(struct vfio_dma *dma,
> >  			  struct vfio_pfn *new)
> >  {
> > @@ -615,6 +626,43 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >  	return ret;
> >  }
> >  
> > +static long contig_pages(struct vfio_dma *dma,
> > +			 struct vfio_batch *batch, dma_addr_t iova)
> > +{
> > +	struct page *page = batch->pages[batch->offset];
> > +	struct folio *folio = page_folio(page);
> > +	long idx = folio_page_idx(folio, page);
> > +	long max = min_t(long, batch->size, folio_nr_pages(folio) - idx);
> > +	long nr_pages;
> > +
> > +	for (nr_pages = 1; nr_pages < max; nr_pages++) {
> > +		if (batch->pages[batch->offset + nr_pages] !=
> > +		    folio_page(folio, idx + nr_pages))
> > +			break;
> > +	}
> > +
> > +	return nr_pages;
> > +}
> > +
> > +static long vpfn_pages(struct vfio_dma *dma,
> > +		       dma_addr_t iova_start, long nr_pages)
> > +{
> > +	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> > +	struct vfio_pfn *vpfn;
> > +	long count = 0;
> > +
> > +	do {
> > +		vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);  
> 
> I am somehow confused here. Function vfio_find_vpfn_range()is designed
> to find, through the rbtree, the node that is closest to the root node
> and satisfies the condition within the range [iova_start, iova_end),
> rather than the node closest to iova_start? Or perhaps I have
> misunderstood something?

Sorry, that's an oversight on my part.  We might forego the _range
version and just do an inline walk of the tree counting the number of
already accounted pfns within the range.  Thanks,

Alex

> > +		if (likely(!vpfn))
> > +			break;
> > +
> > +		count++;
> > +		iova_start = vpfn->iova + PAGE_SIZE;
> > +	} while (iova_start < iova_end);
> > +
> > +	return count;
> > +}
> > +
> >  /*
> >   * Attempt to pin pages.  We really don't want to track all the pfns and
> >   * the iommu can only map chunks of consecutive pfns anyway, so get the
> > @@ -681,32 +729,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >  		 * and rsvd here, and therefore continues to use the batch.
> >  		 */
> >  		while (true) {
> > +			long nr_pages, acct_pages = 0;
> > +
> >  			if (pfn != *pfn_base + pinned ||
> >  			    rsvd != is_invalid_reserved_pfn(pfn))
> >  				goto out;
> >  
> > +			nr_pages = contig_pages(dma, batch, iova);
> > +			if (!rsvd) {
> > +				acct_pages = nr_pages;
> > +				acct_pages -= vpfn_pages(dma, iova, nr_pages);
> > +			}
> > +
> >  			/*
> >  			 * Reserved pages aren't counted against the user,
> >  			 * externally pinned pages are already counted against
> >  			 * the user.
> >  			 */
> > -			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> > +			if (acct_pages) {
> >  				if (!dma->lock_cap &&
> > -				    mm->locked_vm + lock_acct + 1 > limit) {
> > +				    mm->locked_vm + lock_acct + acct_pages > limit) {
> >  					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> >  						__func__, limit << PAGE_SHIFT);
> >  					ret = -ENOMEM;
> >  					goto unpin_out;
> >  				}
> > -				lock_acct++;
> > +				lock_acct += acct_pages;
> >  			}
> >  
> > -			pinned++;
> > -			npage--;
> > -			vaddr += PAGE_SIZE;
> > -			iova += PAGE_SIZE;
> > -			batch->offset++;
> > -			batch->size--;
> > +			pinned += nr_pages;
> > +			npage -= nr_pages;
> > +			vaddr += PAGE_SIZE * nr_pages;
> > +			iova += PAGE_SIZE * nr_pages;
> > +			batch->offset += nr_pages;
> > +			batch->size -= nr_pages;
> >  
> >  			if (!batch->size)
> >  				break;  
> 
> Thanks,
> Zhe
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-23 14:54             ` Alex Williamson
@ 2025-05-26  3:37               ` lizhe.67
  2025-05-27 19:14                 ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: lizhe.67 @ 2025-05-26  3:37 UTC (permalink / raw)
  To: alex.williamson; +Cc: david, kvm, linux-kernel, lizhe.67, muchun.song, peterx

On Fri, 23 May 2025 08:54:15 -0600, alex.williamson@redhat.com wrote: 

> > > +static long vpfn_pages(struct vfio_dma *dma,
> > > +		       dma_addr_t iova_start, long nr_pages)
> > > +{
> > > +	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> > > +	struct vfio_pfn *vpfn;
> > > +	long count = 0;
> > > +
> > > +	do {
> > > +		vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);
> > 
> > I am somehow confused here. Function vfio_find_vpfn_range()is designed
> > to find, through the rbtree, the node that is closest to the root node
> > and satisfies the condition within the range [iova_start, iova_end),
> > rather than the node closest to iova_start? Or perhaps I have
> > misunderstood something?
> 
> Sorry, that's an oversight on my part.  We might forego the _range
> version and just do an inline walk of the tree counting the number of
> already accounted pfns within the range.  Thanks,
> 
> Alex
> 
> > > +		if (likely(!vpfn))
> > > +			break;
> > > +
> > > +		count++;
> > > +		iova_start = vpfn->iova + PAGE_SIZE;
> > > +	} while (iova_start < iova_end);
> > > +
> > > +	return count;
> > > +}

The utilization of the function vpfn_pages() is undoubtedly a
good idea. It can swiftly determine the num of vpfn pages
within a specified range, which will evidently expedite the
process of vfio_pin_pages_remote(). Given that the function
vfio_find_vpfn_range() returns the "top" node in the rb tree
that satisfies the condition within the range
[iova_start, iova_end), might we consider implementing the
functionality of vpfn_pages() using the following approach?

+static long _vpfn_pages(struct vfio_pfn *vpfn,
+               dma_addr_t iova_start, dma_addr_t iova_end)
+{
+       struct vfio_pfn *left;
+       struct vfio_pfn *right;
+
+       if (!vpfn)
+               return 0;
+
+       left = vpfn->node.rb_left ?
+               rb_entry(vpfn->node.rb_left, struct vfio_pfn, node) : NULL;
+       right = vpfn->node.rb_right ?
+               rb_entry(vpfn->node.rb_right, struct vfio_pfn, node) : NULL;
+
+       if ((vpfn->iova >= iova_start) && (vpfn->iova < iova_end))
+               return 1 + _vpfn_pages(left, iova_start, iova_end) +
+                               _vpfn_pages(right, iova_start, iova_end);
+
+       if (vpfn->iova >= iova_end)
+               return _vpfn_pages(left, iova_start, iova_end);
+
+       return _vpfn_pages(right, iova_start, iova_end);
+}
+
+static long vpfn_pages(struct vfio_dma *dma,
+               dma_addr_t iova_start, long nr_pages)
+{
+       dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
+       struct vfio_pfn *top = vfio_find_vpfn_range(dma, iova_start, iova_end);
+
+       return _vpfn_pages(top, iova_start, iova_end);
+}

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-26  3:37               ` lizhe.67
@ 2025-05-27 19:14                 ` Alex Williamson
  2025-05-28  4:21                   ` lizhe.67
  0 siblings, 1 reply; 13+ messages in thread
From: Alex Williamson @ 2025-05-27 19:14 UTC (permalink / raw)
  To: lizhe.67; +Cc: david, kvm, linux-kernel, muchun.song, peterx

On Mon, 26 May 2025 11:37:37 +0800
lizhe.67@bytedance.com wrote:

> On Fri, 23 May 2025 08:54:15 -0600, alex.williamson@redhat.com wrote: 
> 
> > > > +static long vpfn_pages(struct vfio_dma *dma,
> > > > +		       dma_addr_t iova_start, long nr_pages)
> > > > +{
> > > > +	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> > > > +	struct vfio_pfn *vpfn;
> > > > +	long count = 0;
> > > > +
> > > > +	do {
> > > > +		vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);  
> > > 
> > > I am somehow confused here. Function vfio_find_vpfn_range()is designed
> > > to find, through the rbtree, the node that is closest to the root node
> > > and satisfies the condition within the range [iova_start, iova_end),
> > > rather than the node closest to iova_start? Or perhaps I have
> > > misunderstood something?  
> > 
> > Sorry, that's an oversight on my part.  We might forego the _range
> > version and just do an inline walk of the tree counting the number of
> > already accounted pfns within the range.  Thanks,
> > 
> > Alex
> >   
> > > > +		if (likely(!vpfn))
> > > > +			break;
> > > > +
> > > > +		count++;
> > > > +		iova_start = vpfn->iova + PAGE_SIZE;
> > > > +	} while (iova_start < iova_end);
> > > > +
> > > > +	return count;
> > > > +}  
> 
> The utilization of the function vpfn_pages() is undoubtedly a
> good idea. It can swiftly determine the num of vpfn pages
> within a specified range, which will evidently expedite the
> process of vfio_pin_pages_remote(). Given that the function
> vfio_find_vpfn_range() returns the "top" node in the rb tree
> that satisfies the condition within the range
> [iova_start, iova_end), might we consider implementing the
> functionality of vpfn_pages() using the following approach?
> 
> +static long _vpfn_pages(struct vfio_pfn *vpfn,
> +               dma_addr_t iova_start, dma_addr_t iova_end)
> +{
> +       struct vfio_pfn *left;
> +       struct vfio_pfn *right;
> +
> +       if (!vpfn)
> +               return 0;
> +
> +       left = vpfn->node.rb_left ?
> +               rb_entry(vpfn->node.rb_left, struct vfio_pfn, node) : NULL;
> +       right = vpfn->node.rb_right ?
> +               rb_entry(vpfn->node.rb_right, struct vfio_pfn, node) : NULL;
> +
> +       if ((vpfn->iova >= iova_start) && (vpfn->iova < iova_end))
> +               return 1 + _vpfn_pages(left, iova_start, iova_end) +
> +                               _vpfn_pages(right, iova_start, iova_end);
> +
> +       if (vpfn->iova >= iova_end)
> +               return _vpfn_pages(left, iova_start, iova_end);
> +
> +       return _vpfn_pages(right, iova_start, iova_end);
> +}

Recursion doesn't seem like a good fit here, the depth is practically
unbounded.  Why not just use the new range function to find the highest
point in the tree that intersects, then search each direction in
separate loops (rb_next/rb_prev), counting additional entries within
the range?  Thanks,

Alex

> +
> +static long vpfn_pages(struct vfio_dma *dma,
> +               dma_addr_t iova_start, long nr_pages)
> +{
> +       dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> +       struct vfio_pfn *top = vfio_find_vpfn_range(dma, iova_start, iova_end);
> +
> +       return _vpfn_pages(top, iova_start, iova_end);
> +}
> 
> Thanks,
> Zhe
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-27 19:14                 ` Alex Williamson
@ 2025-05-28  4:21                   ` lizhe.67
  2025-05-28 20:11                     ` Alex Williamson
  0 siblings, 1 reply; 13+ messages in thread
From: lizhe.67 @ 2025-05-28  4:21 UTC (permalink / raw)
  To: alex.williamson; +Cc: david, kvm, linux-kernel, lizhe.67, muchun.song, peterx

On Tue, 27 May 2025 13:14:50 -0600, alex.williamson@redhat.com wrote: 

> > The utilization of the function vpfn_pages() is undoubtedly a
> > good idea. It can swiftly determine the num of vpfn pages
> > within a specified range, which will evidently expedite the
> > process of vfio_pin_pages_remote(). Given that the function
> > vfio_find_vpfn_range() returns the "top" node in the rb tree
> > that satisfies the condition within the range
> > [iova_start, iova_end), might we consider implementing the
> > functionality of vpfn_pages() using the following approach?
> > 
> > +static long _vpfn_pages(struct vfio_pfn *vpfn,
> > +               dma_addr_t iova_start, dma_addr_t iova_end)
> > +{
> > +       struct vfio_pfn *left;
> > +       struct vfio_pfn *right;
> > +
> > +       if (!vpfn)
> > +               return 0;
> > +
> > +       left = vpfn->node.rb_left ?
> > +               rb_entry(vpfn->node.rb_left, struct vfio_pfn, node) : NULL;
> > +       right = vpfn->node.rb_right ?
> > +               rb_entry(vpfn->node.rb_right, struct vfio_pfn, node) : NULL;
> > +
> > +       if ((vpfn->iova >= iova_start) && (vpfn->iova < iova_end))
> > +               return 1 + _vpfn_pages(left, iova_start, iova_end) +
> > +                               _vpfn_pages(right, iova_start, iova_end);
> > +
> > +       if (vpfn->iova >= iova_end)
> > +               return _vpfn_pages(left, iova_start, iova_end);
> > +
> > +       return _vpfn_pages(right, iova_start, iova_end);
> > +}
> 
> Recursion doesn't seem like a good fit here, the depth is practically
> unbounded.  Why not just use the new range function to find the highest
> point in the tree that intersects, then search each direction in
> separate loops (rb_next/rb_prev), counting additional entries within
> the range?  Thanks,
> 
> Alex

Oh, I see what you mean. So the implementation of vpfn_pages() might be
something like this.

+static long vpfn_pages(struct vfio_dma *dma,
+               dma_addr_t iova_start, long nr_pages)
+{
+       dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
+       struct vfio_pfn *top = vfio_find_vpfn_range(dma, iova_start, iova_end);
+       long ret = 1;
+       struct vfio_pfn *vpfn;
+       struct rb_node *prev;
+       struct rb_node *next;
+
+       if (likely(!top))
+               return 0;
+
+       prev = next = &top->node;
+
+       while ((prev = rb_prev(prev))) {
+               vpfn = rb_entry(prev, struct vfio_pfn, node);
+               if (vpfn->iova < iova_start)
+                       break;
+               ret++;
+       }
+
+       while ((next = rb_next(next))) {
+               vpfn = rb_entry(next, struct vfio_pfn, node);
+               if (vpfn->iova >= iova_end)
+                       break;
+               ret++;
+       }
+
+       return ret;
+}

Thanks,
Zhe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
  2025-05-28  4:21                   ` lizhe.67
@ 2025-05-28 20:11                     ` Alex Williamson
  0 siblings, 0 replies; 13+ messages in thread
From: Alex Williamson @ 2025-05-28 20:11 UTC (permalink / raw)
  To: lizhe.67; +Cc: david, kvm, linux-kernel, muchun.song, peterx

On Wed, 28 May 2025 12:21:24 +0800
lizhe.67@bytedance.com wrote:

> On Tue, 27 May 2025 13:14:50 -0600, alex.williamson@redhat.com wrote: 
> 
> > > The utilization of the function vpfn_pages() is undoubtedly a
> > > good idea. It can swiftly determine the num of vpfn pages
> > > within a specified range, which will evidently expedite the
> > > process of vfio_pin_pages_remote(). Given that the function
> > > vfio_find_vpfn_range() returns the "top" node in the rb tree
> > > that satisfies the condition within the range
> > > [iova_start, iova_end), might we consider implementing the
> > > functionality of vpfn_pages() using the following approach?
> > > 
> > > +static long _vpfn_pages(struct vfio_pfn *vpfn,
> > > +               dma_addr_t iova_start, dma_addr_t iova_end)
> > > +{
> > > +       struct vfio_pfn *left;
> > > +       struct vfio_pfn *right;
> > > +
> > > +       if (!vpfn)
> > > +               return 0;
> > > +
> > > +       left = vpfn->node.rb_left ?
> > > +               rb_entry(vpfn->node.rb_left, struct vfio_pfn, node) : NULL;
> > > +       right = vpfn->node.rb_right ?
> > > +               rb_entry(vpfn->node.rb_right, struct vfio_pfn, node) : NULL;
> > > +
> > > +       if ((vpfn->iova >= iova_start) && (vpfn->iova < iova_end))
> > > +               return 1 + _vpfn_pages(left, iova_start, iova_end) +
> > > +                               _vpfn_pages(right, iova_start, iova_end);
> > > +
> > > +       if (vpfn->iova >= iova_end)
> > > +               return _vpfn_pages(left, iova_start, iova_end);
> > > +
> > > +       return _vpfn_pages(right, iova_start, iova_end);
> > > +}  
> > 
> > Recursion doesn't seem like a good fit here, the depth is practically
> > unbounded.  Why not just use the new range function to find the highest
> > point in the tree that intersects, then search each direction in
> > separate loops (rb_next/rb_prev), counting additional entries within
> > the range?  Thanks,
> > 
> > Alex  
> 
> Oh, I see what you mean. So the implementation of vpfn_pages() might be
> something like this.
> 
> +static long vpfn_pages(struct vfio_dma *dma,
> +               dma_addr_t iova_start, long nr_pages)
> +{
> +       dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
> +       struct vfio_pfn *top = vfio_find_vpfn_range(dma, iova_start, iova_end);
> +       long ret = 1;
> +       struct vfio_pfn *vpfn;
> +       struct rb_node *prev;
> +       struct rb_node *next;
> +
> +       if (likely(!top))
> +               return 0;
> +
> +       prev = next = &top->node;
> +
> +       while ((prev = rb_prev(prev))) {
> +               vpfn = rb_entry(prev, struct vfio_pfn, node);
> +               if (vpfn->iova < iova_start)
> +                       break;
> +               ret++;
> +       }
> +
> +       while ((next = rb_next(next))) {
> +               vpfn = rb_entry(next, struct vfio_pfn, node);
> +               if (vpfn->iova >= iova_end)
> +                       break;
> +               ret++;
> +       }
> +
> +       return ret;
> +}

Yes, that looks like it should work to me.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-28 20:11 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-21  4:25 [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios lizhe.67
2025-05-21  6:36 ` David Hildenbrand
2025-05-21 19:17 ` Alex Williamson
2025-05-22  3:49   ` [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio lizhe.67
2025-05-22  7:22     ` David Hildenbrand
2025-05-22  8:25       ` lizhe.67
2025-05-22 20:52         ` Alex Williamson
2025-05-23  3:42           ` lizhe.67
2025-05-23 14:54             ` Alex Williamson
2025-05-26  3:37               ` lizhe.67
2025-05-27 19:14                 ` Alex Williamson
2025-05-28  4:21                   ` lizhe.67
2025-05-28 20:11                     ` Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).