[PATCH v3] vfio/type1: optimize vfio_pin_pages

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
@ 2025-05-20  7:00 lizhe.67
  2025-05-20 14:07 ` Alex Williamson
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: lizhe.67 @ 2025-05-20  7:00 UTC (permalink / raw)
  To: alex.williamson; +Cc: kvm, linux-kernel, lizhe.67, muchun.song

From: Li Zhe <lizhe.67@bytedance.com>

When vfio_pin_pages_remote() is called with a range of addresses that
includes huge folios, the function currently performs individual
statistics counting operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.

This patch optimize this process by batching the statistics counting
operations.

The performance test results for completing the 8G VFIO IOMMU DMA mapping,
obtained through trace-cmd, are as follows. In this case, the 8G virtual
address space has been mapped to physical memory using hugetlbfs with
pagesize=2M.

Before this patch:
funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();

After this patch:
funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();

Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
Changelogs:

v2->v3:
- Code simplification.
- Fix some issues in comments.

v1->v2:
- Fix some issues in comments and formatting.
- Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
- Move the processing logic for huge folio into the while(true) loop
  and use a variable with a default value of 1 to indicate the number
  of consecutive pages.

v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/

 drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 11 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0ac56072af9f..48f06ce0e290 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
 /*
  * Helper Functions for host iova-pfn list
  */
-static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+
+/*
+ * Find the first vfio_pfn that overlapping the range
+ * [iova, iova + PAGE_SIZE * npage) in rb tree.
+ */
+static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
+		dma_addr_t iova, unsigned long npage)
 {
 	struct vfio_pfn *vpfn;
 	struct rb_node *node = dma->pfn_list.rb_node;
+	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
 
 	while (node) {
 		vpfn = rb_entry(node, struct vfio_pfn, node);
 
-		if (iova < vpfn->iova)
+		if (end_iova <= vpfn->iova)
 			node = node->rb_left;
 		else if (iova > vpfn->iova)
 			node = node->rb_right;
@@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	return NULL;
 }
 
+static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+{
+	return vfio_find_vpfn_range(dma, iova, 1);
+}
+
 static void vfio_link_pfn(struct vfio_dma *dma,
 			  struct vfio_pfn *new)
 {
@@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 		 * and rsvd here, and therefore continues to use the batch.
 		 */
 		while (true) {
+			struct folio *folio = page_folio(batch->pages[batch->offset]);
+			long nr_pages;
+
 			if (pfn != *pfn_base + pinned ||
 			    rsvd != is_invalid_reserved_pfn(pfn))
 				goto out;
 
+			/*
+			 * Note: The current nr_pages does not achieve the optimal
+			 * performance in scenarios where folio_nr_pages() exceeds
+			 * batch->capacity. It is anticipated that future enhancements
+			 * will address this limitation.
+			 */
+			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
+						folio_page_idx(folio, batch->pages[batch->offset]));
+			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
+				nr_pages = 1;
+
 			/*
 			 * Reserved pages aren't counted against the user,
 			 * externally pinned pages are already counted against
 			 * the user.
 			 */
-			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
+			if (!rsvd && (nr_pages > 1 || !vfio_find_vpfn(dma, iova))) {
 				if (!dma->lock_cap &&
-				    mm->locked_vm + lock_acct + 1 > limit) {
+				    mm->locked_vm + lock_acct + nr_pages > limit) {
 					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 						__func__, limit << PAGE_SHIFT);
 					ret = -ENOMEM;
 					goto unpin_out;
 				}
-				lock_acct++;
+				lock_acct += nr_pages;
 			}
 
-			pinned++;
-			npage--;
-			vaddr += PAGE_SIZE;
-			iova += PAGE_SIZE;
-			batch->offset++;
-			batch->size--;
+			pinned += nr_pages;
+			npage -= nr_pages;
+			vaddr += PAGE_SIZE * nr_pages;
+			iova += PAGE_SIZE * nr_pages;
+			batch->offset += nr_pages;
+			batch->size -= nr_pages;
 
 			if (!batch->size)
 				break;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20  7:00 [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio lizhe.67
@ 2025-05-20 14:07 ` Alex Williamson
  2025-05-20 16:59   ` Peter Xu
  2025-05-21  3:34   ` lizhe.67
  2025-05-20 17:38 ` David Hildenbrand
  2025-05-20 23:37 ` kernel test robot
  2 siblings, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2025-05-20 14:07 UTC (permalink / raw)
  To: lizhe.67; +Cc: kvm, linux-kernel, muchun.song, Peter Xu, David Hildenbrand

On Tue, 20 May 2025 15:00:20 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes huge folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();

It looks like we're using the same numbers since the initial
implementation, have these results changed?

> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>

Appreciate the credit, this should probably be Co-developed-by: though.
In general a sign-off is something that needs to be explicitly given.

> ---
> Changelogs:
> 
> v2->v3:
> - Code simplification.
> - Fix some issues in comments.
> 
> v1->v2:
> - Fix some issues in comments and formatting.
> - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> - Move the processing logic for huge folio into the while(true) loop
>   and use a variable with a default value of 1 to indicate the number
>   of consecutive pages.
> 
> v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> 
>  drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
>  1 file changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..48f06ce0e290 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
>  /*
>   * Helper Functions for host iova-pfn list
>   */
> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +
> +/*
> + * Find the first vfio_pfn that overlapping the range
> + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova, unsigned long npage)
>  {
>  	struct vfio_pfn *vpfn;
>  	struct rb_node *node = dma->pfn_list.rb_node;
> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>  
>  	while (node) {
>  		vpfn = rb_entry(node, struct vfio_pfn, node);
>  
> -		if (iova < vpfn->iova)
> +		if (end_iova <= vpfn->iova)
>  			node = node->rb_left;
>  		else if (iova > vpfn->iova)
>  			node = node->rb_right;
> @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	return NULL;
>  }
>  
> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +{
> +	return vfio_find_vpfn_range(dma, iova, 1);
> +}
> +
>  static void vfio_link_pfn(struct vfio_dma *dma,
>  			  struct vfio_pfn *new)
>  {
> @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  		 * and rsvd here, and therefore continues to use the batch.
>  		 */
>  		while (true) {
> +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> +			long nr_pages;
> +
>  			if (pfn != *pfn_base + pinned ||
>  			    rsvd != is_invalid_reserved_pfn(pfn))
>  				goto out;
>  
> +			/*
> +			 * Note: The current nr_pages does not achieve the optimal
> +			 * performance in scenarios where folio_nr_pages() exceeds
> +			 * batch->capacity. It is anticipated that future enhancements
> +			 * will address this limitation.
> +			 */
> +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> +						folio_page_idx(folio, batch->pages[batch->offset]));

We should use min_t() here, otherwise it looks good to me.

Peter, David, if you wouldn't mind double checking the folio usage
here, I'd appreciate it.  The underlying assumption used here is that
folios always have physically contiguous pages, so we can increment at
the remainder of the folio_nr_pages() rather than iterate each page.
Thanks,

Alex

> +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> +				nr_pages = 1;
> +
>  			/*
>  			 * Reserved pages aren't counted against the user,
>  			 * externally pinned pages are already counted against
>  			 * the user.
>  			 */
> -			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
> +			if (!rsvd && (nr_pages > 1 || !vfio_find_vpfn(dma, iova))) {
>  				if (!dma->lock_cap &&
> -				    mm->locked_vm + lock_acct + 1 > limit) {
> +				    mm->locked_vm + lock_acct + nr_pages > limit) {
>  					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
>  						__func__, limit << PAGE_SHIFT);
>  					ret = -ENOMEM;
>  					goto unpin_out;
>  				}
> -				lock_acct++;
> +				lock_acct += nr_pages;
>  			}
>  
> -			pinned++;
> -			npage--;
> -			vaddr += PAGE_SIZE;
> -			iova += PAGE_SIZE;
> -			batch->offset++;
> -			batch->size--;
> +			pinned += nr_pages;
> +			npage -= nr_pages;
> +			vaddr += PAGE_SIZE * nr_pages;
> +			iova += PAGE_SIZE * nr_pages;
> +			batch->offset += nr_pages;
> +			batch->size -= nr_pages;
>  
>  			if (!batch->size)
>  				break;


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 14:07 ` Alex Williamson
@ 2025-05-20 16:59   ` Peter Xu
  2025-05-20 17:41     ` David Hildenbrand
  2025-05-21  3:34   ` lizhe.67
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Xu @ 2025-05-20 16:59 UTC (permalink / raw)
  To: Alex Williamson
  Cc: lizhe.67, kvm, linux-kernel, muchun.song, David Hildenbrand

Hi, Alex,

On Tue, May 20, 2025 at 08:07:19AM -0600, Alex Williamson wrote:
> Peter, David, if you wouldn't mind double checking the folio usage
> here, I'd appreciate it.  The underlying assumption used here is that
> folios always have physically contiguous pages, so we can increment at
> the remainder of the folio_nr_pages() rather than iterate each page.

Yes I think so.  E.g., there's comment above folio definition too:

/**
 * struct folio - Represents a contiguous set of bytes.
 * ...
 * A folio is a physically, virtually and logically contiguous set
 * of bytes...
 */

For 1G, I wonder if in the future vfio can also use memfd_pin_folios()
internally when possible, e.g. after stumbled on top of a hugetlb folio
when filling the batch.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20  7:00 [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio lizhe.67
  2025-05-20 14:07 ` Alex Williamson
@ 2025-05-20 17:38 ` David Hildenbrand
  2025-05-20 22:21   ` Alex Williamson
  2025-05-21  3:45   ` lizhe.67
  2025-05-20 23:37 ` kernel test robot
  2 siblings, 2 replies; 21+ messages in thread
From: David Hildenbrand @ 2025-05-20 17:38 UTC (permalink / raw)
  To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, muchun.song

On 20.05.25 09:00, lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>

Subject: "huge folio" -> "large folios"

> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes huge folios, the function currently performs individual

Similar, we call it a "large" f

> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> Changelogs:
> 
> v2->v3:
> - Code simplification.
> - Fix some issues in comments.
> 
> v1->v2:
> - Fix some issues in comments and formatting.
> - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> - Move the processing logic for huge folio into the while(true) loop
>    and use a variable with a default value of 1 to indicate the number
>    of consecutive pages.
> 
> v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> 
>   drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
>   1 file changed, 37 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..48f06ce0e290 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
>   /*
>    * Helper Functions for host iova-pfn list
>    */
> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +
> +/*
> + * Find the first vfio_pfn that overlapping the range
> + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova, unsigned long npage)
>   {
>   	struct vfio_pfn *vpfn;
>   	struct rb_node *node = dma->pfn_list.rb_node;
> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>   
>   	while (node) {
>   		vpfn = rb_entry(node, struct vfio_pfn, node);
>   
> -		if (iova < vpfn->iova)
> +		if (end_iova <= vpfn->iova)
>   			node = node->rb_left;
>   		else if (iova > vpfn->iova)
>   			node = node->rb_right;
> @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>   	return NULL;
>   }
>   
> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> +{
> +	return vfio_find_vpfn_range(dma, iova, 1);
> +}
> +
>   static void vfio_link_pfn(struct vfio_dma *dma,
>   			  struct vfio_pfn *new)
>   {
> @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>   		 * and rsvd here, and therefore continues to use the batch.
>   		 */
>   		while (true) {
> +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> +			long nr_pages;
> +
>   			if (pfn != *pfn_base + pinned ||
>   			    rsvd != is_invalid_reserved_pfn(pfn))
>   				goto out;
>   
> +			/*
> +			 * Note: The current nr_pages does not achieve the optimal
> +			 * performance in scenarios where folio_nr_pages() exceeds
> +			 * batch->capacity. It is anticipated that future enhancements
> +			 * will address this limitation.
> +			 */
> +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> +						folio_page_idx(folio, batch->pages[batch->offset]));
> +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> +				nr_pages = 1;


You seem to assume that the batch really contains the consecutive pages 
of that folio.

This is not the case if we obtained the pages through GUP and we have

(a) A MAP_PRIVATE mapping

(b) We span multiple different VMAs


Are we sure we can rule out (a) and (b)?

A more future-proof approach would be at least looking whether the 
pages/pfns are actually consecutive.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 16:59   ` Peter Xu
@ 2025-05-20 17:41     ` David Hildenbrand
  2025-05-20 21:56       ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2025-05-20 17:41 UTC (permalink / raw)
  To: Peter Xu, Alex Williamson; +Cc: lizhe.67, kvm, linux-kernel, muchun.song

On 20.05.25 18:59, Peter Xu wrote:
> Hi, Alex,
> 
> On Tue, May 20, 2025 at 08:07:19AM -0600, Alex Williamson wrote:
>> Peter, David, if you wouldn't mind double checking the folio usage
>> here, I'd appreciate it.  The underlying assumption used here is that
>> folios always have physically contiguous pages, so we can increment at
>> the remainder of the folio_nr_pages() rather than iterate each page.
> 
> Yes I think so.  E.g., there's comment above folio definition too:

It has consecutive PFNs, yes (i.e., pfn++). The "struct page" might not 
be consecutive (i.e., page++ does not work for larger folios).

> 
> /**
>   * struct folio - Represents a contiguous set of bytes.
>   * ...
>   * A folio is a physically, virtually and logically contiguous set
>   * of bytes...
>   */
> 
> For 1G, I wonder if in the future vfio can also use memfd_pin_folios()
> internally when possible, e.g. after stumbled on top of a hugetlb folio
> when filling the batch.

Yeah, or have a better GUP interface that gives us folio ranges instead 
of individual pages.

Using memfd directly is obviously better where possible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 17:41     ` David Hildenbrand
@ 2025-05-20 21:56       ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2025-05-20 21:56 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: Peter Xu, lizhe.67, kvm, linux-kernel, muchun.song

On Tue, 20 May 2025 19:41:19 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 20.05.25 18:59, Peter Xu wrote:
> > Hi, Alex,
> > 
> > On Tue, May 20, 2025 at 08:07:19AM -0600, Alex Williamson wrote:  
> >> Peter, David, if you wouldn't mind double checking the folio usage
> >> here, I'd appreciate it.  The underlying assumption used here is that
> >> folios always have physically contiguous pages, so we can increment at
> >> the remainder of the folio_nr_pages() rather than iterate each page.  
> > 
> > Yes I think so.  E.g., there's comment above folio definition too:  
> 
> It has consecutive PFNs, yes (i.e., pfn++). The "struct page" might not 
> be consecutive (i.e., page++ does not work for larger folios).

The former, contiguous PFNs is all we need here.  We're feeding the
IOMMU mapping, so we're effectively just looking for the largest extent
of contiguous PFNs for mapping a given IOVA.  The struct page is really
just for GUP, finding the next pfn, and with this, finding the offset
into the large folio.

> > /**
> >   * struct folio - Represents a contiguous set of bytes.
> >   * ...
> >   * A folio is a physically, virtually and logically contiguous set
> >   * of bytes...
> >   */
> > 
> > For 1G, I wonder if in the future vfio can also use memfd_pin_folios()
> > internally when possible, e.g. after stumbled on top of a hugetlb folio
> > when filling the batch.  
> 
> Yeah, or have a better GUP interface that gives us folio ranges instead 
> of individual pages.
> 
> Using memfd directly is obviously better where possible.

Yeah, we brought up some of these issues during previous reviews.
Ultimately we want to move to IOMMUFD, which already has these
features, but it still lacks P2P DMA mapping and isn't as well
supported by various management stacks.  This leaves some performance
on the table, but has a pretty high return for a relatively trivial
change.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 17:38 ` David Hildenbrand
@ 2025-05-20 22:21   ` Alex Williamson
  2025-05-21  6:35     ` David Hildenbrand
  2025-05-21  3:45   ` lizhe.67
  1 sibling, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2025-05-20 22:21 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: lizhe.67, kvm, linux-kernel, muchun.song

On Tue, 20 May 2025 19:38:45 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 20.05.25 09:00, lizhe.67@bytedance.com wrote:
> > From: Li Zhe <lizhe.67@bytedance.com>  
> 
> Subject: "huge folio" -> "large folios"
> 
> > 
> > When vfio_pin_pages_remote() is called with a range of addresses that
> > includes huge folios, the function currently performs individual  
> 
> Similar, we call it a "large" f
> 
> > statistics counting operations for each page. This can lead to significant
> > performance overheads, especially when dealing with large ranges of pages.
> > 
> > This patch optimize this process by batching the statistics counting
> > operations.
> > 
> > The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > address space has been mapped to physical memory using hugetlbfs with
> > pagesize=2M.
> > 
> > Before this patch:
> > funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > 
> > After this patch:
> > funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> > 
> > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > Changelogs:
> > 
> > v2->v3:
> > - Code simplification.
> > - Fix some issues in comments.
> > 
> > v1->v2:
> > - Fix some issues in comments and formatting.
> > - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> > - Move the processing logic for huge folio into the while(true) loop
> >    and use a variable with a default value of 1 to indicate the number
> >    of consecutive pages.
> > 
> > v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> > v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> > 
> >   drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
> >   1 file changed, 37 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 0ac56072af9f..48f06ce0e290 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> >   /*
> >    * Helper Functions for host iova-pfn list
> >    */
> > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > +
> > +/*
> > + * Find the first vfio_pfn that overlapping the range
> > + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> > + */
> > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> > +		dma_addr_t iova, unsigned long npage)
> >   {
> >   	struct vfio_pfn *vpfn;
> >   	struct rb_node *node = dma->pfn_list.rb_node;
> > +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> >   
> >   	while (node) {
> >   		vpfn = rb_entry(node, struct vfio_pfn, node);
> >   
> > -		if (iova < vpfn->iova)
> > +		if (end_iova <= vpfn->iova)
> >   			node = node->rb_left;
> >   		else if (iova > vpfn->iova)
> >   			node = node->rb_right;
> > @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> >   	return NULL;
> >   }
> >   
> > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > +{
> > +	return vfio_find_vpfn_range(dma, iova, 1);
> > +}
> > +
> >   static void vfio_link_pfn(struct vfio_dma *dma,
> >   			  struct vfio_pfn *new)
> >   {
> > @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >   		 * and rsvd here, and therefore continues to use the batch.
> >   		 */
> >   		while (true) {
> > +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> > +			long nr_pages;
> > +
> >   			if (pfn != *pfn_base + pinned ||
> >   			    rsvd != is_invalid_reserved_pfn(pfn))
> >   				goto out;
> >   
> > +			/*
> > +			 * Note: The current nr_pages does not achieve the optimal
> > +			 * performance in scenarios where folio_nr_pages() exceeds
> > +			 * batch->capacity. It is anticipated that future enhancements
> > +			 * will address this limitation.
> > +			 */
> > +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> > +						folio_page_idx(folio, batch->pages[batch->offset]));
> > +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > +				nr_pages = 1;  
> 
> 
> You seem to assume that the batch really contains the consecutive pages 
> of that folio.

I don't think we are.  We're iterating through our batch of pages from
GUP to find consecutive pfns.  We use the page to get the pfn, the
folio, and immediately above, the offset into the folio.  batch->size is
the remaining length of the page array from GUP and batch->offset is our
current index into that array.
 
> This is not the case if we obtained the pages through GUP and we have
> 
> (a) A MAP_PRIVATE mapping
> 
> (b) We span multiple different VMAs
> 
> 
> Are we sure we can rule out (a) and (b)?
> 
> A more future-proof approach would be at least looking whether the 
> pages/pfns are actually consecutive.

The unmodified (pfn != *pfn_base + pinned) test is where we verify we
have the next consecutive pfn.  Maybe I'm not catching the dependency
you're seeing on consecutive pages, I think there isn't one unless
we're somehow misusing folio_page_idx() to get the offset into the
folio.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20  7:00 [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio lizhe.67
  2025-05-20 14:07 ` Alex Williamson
  2025-05-20 17:38 ` David Hildenbrand
@ 2025-05-20 23:37 ` kernel test robot
  2 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2025-05-20 23:37 UTC (permalink / raw)
  To: lizhe.67, alex.williamson
  Cc: oe-kbuild-all, kvm, linux-kernel, lizhe.67, muchun.song

Hi,

kernel test robot noticed the following build errors:

[auto build test ERROR on awilliam-vfio/next]
[also build test ERROR on awilliam-vfio/for-linus linus/master v6.15-rc7 next-20250516]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/lizhe-67-bytedance-com/vfio-type1-optimize-vfio_pin_pages_remote-for-huge-folio/20250520-150123
base:   https://github.com/awilliam/linux-vfio.git next
patch link:    https://lore.kernel.org/r/20250520070020.6181-1-lizhe.67%40bytedance.com
patch subject: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
config: i386-randconfig-013-20250521 (https://download.01.org/0day-ci/archive/20250521/202505210701.WY7sKXwU-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250521/202505210701.WY7sKXwU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202505210701.WY7sKXwU-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   drivers/vfio/vfio_iommu_type1.c: In function 'vfio_pin_pages_remote':
>> include/linux/compiler_types.h:542:45: error: call to '__compiletime_assert_499' declared with attribute error: min((long)batch->size, folio_nr_pages(folio) - (({ const struct page *__pg = (batch->pages[batch->offset]); int __sec = page_to_section(__pg); (unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); }) - folio_pfn(folio))) signedness error
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:523:25: note: in definition of macro '__compiletime_assert'
     523 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:93:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      93 |         BUILD_BUG_ON_MSG(!__types_ok(ux, uy),           \
         |         ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:98:9: note: in expansion of macro '__careful_cmp_once'
      98 |         __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:105:25: note: in expansion of macro '__careful_cmp'
     105 | #define min(x, y)       __careful_cmp(min, x, y)
         |                         ^~~~~~~~~~~~~
   drivers/vfio/vfio_iommu_type1.c:709:36: note: in expansion of macro 'min'
     709 |                         nr_pages = min((long)batch->size, folio_nr_pages(folio) -
         |                                    ^~~
--
   In file included from <command-line>:
   vfio_iommu_type1.c: In function 'vfio_pin_pages_remote':
>> include/linux/compiler_types.h:542:45: error: call to '__compiletime_assert_499' declared with attribute error: min((long)batch->size, folio_nr_pages(folio) - (({ const struct page *__pg = (batch->pages[batch->offset]); int __sec = page_to_section(__pg); (unsigned long)(__pg - __section_mem_map_addr(__nr_to_section(__sec))); }) - folio_pfn(folio))) signedness error
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:523:25: note: in definition of macro '__compiletime_assert'
     523 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:542:9: note: in expansion of macro '_compiletime_assert'
     542 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:93:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      93 |         BUILD_BUG_ON_MSG(!__types_ok(ux, uy),           \
         |         ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:98:9: note: in expansion of macro '__careful_cmp_once'
      98 |         __careful_cmp_once(op, x, y, __UNIQUE_ID(x_), __UNIQUE_ID(y_))
         |         ^~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:105:25: note: in expansion of macro '__careful_cmp'
     105 | #define min(x, y)       __careful_cmp(min, x, y)
         |                         ^~~~~~~~~~~~~
   vfio_iommu_type1.c:709:36: note: in expansion of macro 'min'
     709 |                         nr_pages = min((long)batch->size, folio_nr_pages(folio) -
         |                                    ^~~


vim +/__compiletime_assert_499 +542 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  528  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  529  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  530  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  531  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  532  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  533   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  534   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  535   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  536   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  537   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  538   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  539   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  540   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  541  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @542  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  543  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 14:07 ` Alex Williamson
  2025-05-20 16:59   ` Peter Xu
@ 2025-05-21  3:34   ` lizhe.67
  1 sibling, 0 replies; 21+ messages in thread
From: lizhe.67 @ 2025-05-21  3:34 UTC (permalink / raw)
  To: alex.williamson; +Cc: david, kvm, linux-kernel, lizhe.67, muchun.song, peterx

On Tue, 20 May 2025 08:07:19 -0600, alex.williamson@redhat.com wrote:

>> Before this patch:
>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>> 
>> After this patch:
>> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
>
>It looks like we're using the same numbers since the initial
>implementation, have these results changed?

Before the release of each version of the patch, I have conducted
performance test, and the results have consistently been in
close proximity to this value. Consequently, I decided there was
no need to update. I will include the latest test results in the
next version.

>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>
>Appreciate the credit, this should probably be Co-developed-by: though.
>In general a sign-off is something that needs to be explicitly given.

Thank you for the reminder. I will correct this error in the next
version.

>> +			/*
>> +			 * Note: The current nr_pages does not achieve the optimal
>> +			 * performance in scenarios where folio_nr_pages() exceeds
>> +			 * batch->capacity. It is anticipated that future enhancements
>> +			 * will address this limitation.
>> +			 */
>> +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
>> +						folio_page_idx(folio, batch->pages[batch->offset]));
>
>We should use min_t() here, otherwise it looks good to me.

Thank you once again for your review! I will correct this error in
the next version.

By the way, using min_t() also resolved the build error
reported by the kernel test robot[1].

[1]: https://lore.kernel.org/all/202505210701.WY7sKXwU-lkp@intel.com/

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 17:38 ` David Hildenbrand
  2025-05-20 22:21   ` Alex Williamson
@ 2025-05-21  3:45   ` lizhe.67
  1 sibling, 0 replies; 21+ messages in thread
From: lizhe.67 @ 2025-05-21  3:45 UTC (permalink / raw)
  To: david; +Cc: alex.williamson, kvm, linux-kernel, lizhe.67, muchun.song

On Tue, 20 May 2025 19:38:45 +0200, david@redhat.com wrote:

>Subject: "huge folio" -> "large folios"
>
>> 
>> When vfio_pin_pages_remote() is called with a range of addresses that
>> includes huge folios, the function currently performs individual
>
>Similar, we call it a "large" f

Thank you for the reminder. I will correct this error in the
next version.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-20 22:21   ` Alex Williamson
@ 2025-05-21  6:35     ` David Hildenbrand
  2025-05-21 15:40       ` Peter Xu
  2025-05-21 16:55       ` Alex Williamson
  0 siblings, 2 replies; 21+ messages in thread
From: David Hildenbrand @ 2025-05-21  6:35 UTC (permalink / raw)
  To: Alex Williamson; +Cc: lizhe.67, kvm, linux-kernel, muchun.song

On 21.05.25 00:21, Alex Williamson wrote:
> On Tue, 20 May 2025 19:38:45 +0200
> David Hildenbrand <david@redhat.com> wrote:
> 
>> On 20.05.25 09:00, lizhe.67@bytedance.com wrote:
>>> From: Li Zhe <lizhe.67@bytedance.com>
>>
>> Subject: "huge folio" -> "large folios"
>>
>>>
>>> When vfio_pin_pages_remote() is called with a range of addresses that
>>> includes huge folios, the function currently performs individual
>>
>> Similar, we call it a "large" f
>>
>>> statistics counting operations for each page. This can lead to significant
>>> performance overheads, especially when dealing with large ranges of pages.
>>>
>>> This patch optimize this process by batching the statistics counting
>>> operations.
>>>
>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
>>> address space has been mapped to physical memory using hugetlbfs with
>>> pagesize=2M.
>>>
>>> Before this patch:
>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>>>
>>> After this patch:
>>> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
>>>
>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>> Changelogs:
>>>
>>> v2->v3:
>>> - Code simplification.
>>> - Fix some issues in comments.
>>>
>>> v1->v2:
>>> - Fix some issues in comments and formatting.
>>> - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
>>> - Move the processing logic for huge folio into the while(true) loop
>>>     and use a variable with a default value of 1 to indicate the number
>>>     of consecutive pages.
>>>
>>> v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
>>> v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
>>>
>>>    drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
>>>    1 file changed, 37 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 0ac56072af9f..48f06ce0e290 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
>>>    /*
>>>     * Helper Functions for host iova-pfn list
>>>     */
>>> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>>> +
>>> +/*
>>> + * Find the first vfio_pfn that overlapping the range
>>> + * [iova, iova + PAGE_SIZE * npage) in rb tree.
>>> + */
>>> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
>>> +		dma_addr_t iova, unsigned long npage)
>>>    {
>>>    	struct vfio_pfn *vpfn;
>>>    	struct rb_node *node = dma->pfn_list.rb_node;
>>> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>>>    
>>>    	while (node) {
>>>    		vpfn = rb_entry(node, struct vfio_pfn, node);
>>>    
>>> -		if (iova < vpfn->iova)
>>> +		if (end_iova <= vpfn->iova)
>>>    			node = node->rb_left;
>>>    		else if (iova > vpfn->iova)
>>>    			node = node->rb_right;
>>> @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>>>    	return NULL;
>>>    }
>>>    
>>> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>>> +{
>>> +	return vfio_find_vpfn_range(dma, iova, 1);
>>> +}
>>> +
>>>    static void vfio_link_pfn(struct vfio_dma *dma,
>>>    			  struct vfio_pfn *new)
>>>    {
>>> @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>>>    		 * and rsvd here, and therefore continues to use the batch.
>>>    		 */
>>>    		while (true) {
>>> +			struct folio *folio = page_folio(batch->pages[batch->offset]);
>>> +			long nr_pages;
>>> +
>>>    			if (pfn != *pfn_base + pinned ||
>>>    			    rsvd != is_invalid_reserved_pfn(pfn))
>>>    				goto out;
>>>    
>>> +			/*
>>> +			 * Note: The current nr_pages does not achieve the optimal
>>> +			 * performance in scenarios where folio_nr_pages() exceeds
>>> +			 * batch->capacity. It is anticipated that future enhancements
>>> +			 * will address this limitation.
>>> +			 */
>>> +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
>>> +						folio_page_idx(folio, batch->pages[batch->offset]));
>>> +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
>>> +				nr_pages = 1;
>>
>>
>> You seem to assume that the batch really contains the consecutive pages
>> of that folio.
> 
> I don't think we are.  We're iterating through our batch of pages from
> GUP to find consecutive pfns.  We use the page to get the pfn, the
> folio, and immediately above, the offset into the folio.  batch->size is
> the remaining length of the page array from GUP and batch->offset is our
> current index into that array.

Let me try again using an example below ....

>   
>> This is not the case if we obtained the pages through GUP and we have
>>
>> (a) A MAP_PRIVATE mapping
>>
>> (b) We span multiple different VMAs
>>
>>
>> Are we sure we can rule out (a) and (b)?
>>
>> A more future-proof approach would be at least looking whether the
>> pages/pfns are actually consecutive.
> 
> The unmodified (pfn != *pfn_base + pinned) test is where we verify we
> have the next consecutive pfn.  Maybe I'm not catching the dependency
> you're seeing on consecutive pages, I think there isn't one unless
> we're somehow misusing folio_page_idx() to get the offset into the
> folio.

Assume our page tables look like this (case (a), a partially mapped 
large pagecache folio mixed with COW'ed anonymous folios):

   + page[0] of folio 0
   |              + COWed anonymous folio (folio 1)
   |              |    + page[4] of folio 0
   |              |    |
   v              v    v
F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7

If we GUP that range, we get exactly these pages, except that the PFNs 
are not consecutive, because F0P3 was replaced by another page. The 
large folio is partially mapped.


Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0
because we assume the batch would contain F1P0, where it really contains 
F0P4?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-21  6:35     ` David Hildenbrand
@ 2025-05-21 15:40       ` Peter Xu
  2025-05-21 16:02         ` David Hildenbrand
  2025-05-21 16:55       ` Alex Williamson
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Xu @ 2025-05-21 15:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Alex Williamson, lizhe.67, kvm, linux-kernel, muchun.song

On Wed, May 21, 2025 at 08:35:47AM +0200, David Hildenbrand wrote:
> On 21.05.25 00:21, Alex Williamson wrote:
> > On Tue, 20 May 2025 19:38:45 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > > On 20.05.25 09:00, lizhe.67@bytedance.com wrote:
> > > > From: Li Zhe <lizhe.67@bytedance.com>
> > > 
> > > Subject: "huge folio" -> "large folios"
> > > 
> > > > 
> > > > When vfio_pin_pages_remote() is called with a range of addresses that
> > > > includes huge folios, the function currently performs individual
> > > 
> > > Similar, we call it a "large" f
> > > 
> > > > statistics counting operations for each page. This can lead to significant
> > > > performance overheads, especially when dealing with large ranges of pages.
> > > > 
> > > > This patch optimize this process by batching the statistics counting
> > > > operations.
> > > > 
> > > > The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> > > > obtained through trace-cmd, are as follows. In this case, the 8G virtual
> > > > address space has been mapped to physical memory using hugetlbfs with
> > > > pagesize=2M.
> > > > 
> > > > Before this patch:
> > > > funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> > > > 
> > > > After this patch:
> > > > funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> > > > 
> > > > Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> > > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > > ---
> > > > Changelogs:
> > > > 
> > > > v2->v3:
> > > > - Code simplification.
> > > > - Fix some issues in comments.
> > > > 
> > > > v1->v2:
> > > > - Fix some issues in comments and formatting.
> > > > - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> > > > - Move the processing logic for huge folio into the while(true) loop
> > > >     and use a variable with a default value of 1 to indicate the number
> > > >     of consecutive pages.
> > > > 
> > > > v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> > > > v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> > > > 
> > > >    drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
> > > >    1 file changed, 37 insertions(+), 11 deletions(-)
> > > > 
> > > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > > index 0ac56072af9f..48f06ce0e290 100644
> > > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > > @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> > > >    /*
> > > >     * Helper Functions for host iova-pfn list
> > > >     */
> > > > -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > +
> > > > +/*
> > > > + * Find the first vfio_pfn that overlapping the range
> > > > + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> > > > + */
> > > > +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> > > > +		dma_addr_t iova, unsigned long npage)
> > > >    {
> > > >    	struct vfio_pfn *vpfn;
> > > >    	struct rb_node *node = dma->pfn_list.rb_node;
> > > > +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> > > >    	while (node) {
> > > >    		vpfn = rb_entry(node, struct vfio_pfn, node);
> > > > -		if (iova < vpfn->iova)
> > > > +		if (end_iova <= vpfn->iova)
> > > >    			node = node->rb_left;
> > > >    		else if (iova > vpfn->iova)
> > > >    			node = node->rb_right;
> > > > @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > >    	return NULL;
> > > >    }
> > > > +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> > > > +{
> > > > +	return vfio_find_vpfn_range(dma, iova, 1);
> > > > +}
> > > > +
> > > >    static void vfio_link_pfn(struct vfio_dma *dma,
> > > >    			  struct vfio_pfn *new)
> > > >    {
> > > > @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> > > >    		 * and rsvd here, and therefore continues to use the batch.
> > > >    		 */
> > > >    		while (true) {
> > > > +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> > > > +			long nr_pages;
> > > > +
> > > >    			if (pfn != *pfn_base + pinned ||
> > > >    			    rsvd != is_invalid_reserved_pfn(pfn))
> > > >    				goto out;
> > > > +			/*
> > > > +			 * Note: The current nr_pages does not achieve the optimal
> > > > +			 * performance in scenarios where folio_nr_pages() exceeds
> > > > +			 * batch->capacity. It is anticipated that future enhancements
> > > > +			 * will address this limitation.
> > > > +			 */
> > > > +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> > > > +						folio_page_idx(folio, batch->pages[batch->offset]));
> > > > +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> > > > +				nr_pages = 1;
> > > 
> > > 
> > > You seem to assume that the batch really contains the consecutive pages
> > > of that folio.
> > 
> > I don't think we are.  We're iterating through our batch of pages from
> > GUP to find consecutive pfns.  We use the page to get the pfn, the
> > folio, and immediately above, the offset into the folio.  batch->size is
> > the remaining length of the page array from GUP and batch->offset is our
> > current index into that array.
> 
> Let me try again using an example below ....
> 
> > > This is not the case if we obtained the pages through GUP and we have
> > > 
> > > (a) A MAP_PRIVATE mapping
> > > 
> > > (b) We span multiple different VMAs
> > > 
> > > 
> > > Are we sure we can rule out (a) and (b)?
> > > 
> > > A more future-proof approach would be at least looking whether the
> > > pages/pfns are actually consecutive.
> > 
> > The unmodified (pfn != *pfn_base + pinned) test is where we verify we
> > have the next consecutive pfn.  Maybe I'm not catching the dependency
> > you're seeing on consecutive pages, I think there isn't one unless
> > we're somehow misusing folio_page_idx() to get the offset into the
> > folio.
> 
> Assume our page tables look like this (case (a), a partially mapped large
> pagecache folio mixed with COW'ed anonymous folios):
> 
>   + page[0] of folio 0
>   |              + COWed anonymous folio (folio 1)
>   |              |    + page[4] of folio 0
>   |              |    |
>   v              v    v
> F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7
> 
> If we GUP that range, we get exactly these pages, except that the PFNs are
> not consecutive, because F0P3 was replaced by another page. The large folio
> is partially mapped.
> 
> 
> Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0
> because we assume the batch would contain F1P0, where it really contains
> F0P4?

Looks like a real issue (even if unlikely setup)..

Before a next-gen GUP.. Maybe we should stick with memfd_pin_folios(),
that'll require mmap read lock taken though when seeing a hugetlb folio, so
it'll be a fast path to try to ping hugetlb-only vmas.

VFIO will also need to check in the fast path on: (1) double check it's a
hugetlb VMA (in case vma layout changed after GUP and before mmap read
lock), (2) VMA boundary check, making sure it's not out-of-bound (3) stick
with VM_SHARED only for now (I don't think anyone uses MAP_PRIVATE anyway
that will also care about how fast it pins..). Then 1G will also work
there..

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-21 15:40       ` Peter Xu
@ 2025-05-21 16:02         ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2025-05-21 16:02 UTC (permalink / raw)
  To: Peter Xu; +Cc: Alex Williamson, lizhe.67, kvm, linux-kernel, muchun.song

On 21.05.25 17:40, Peter Xu wrote:
> On Wed, May 21, 2025 at 08:35:47AM +0200, David Hildenbrand wrote:
>> On 21.05.25 00:21, Alex Williamson wrote:
>>> On Tue, 20 May 2025 19:38:45 +0200
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> On 20.05.25 09:00, lizhe.67@bytedance.com wrote:
>>>>> From: Li Zhe <lizhe.67@bytedance.com>
>>>>
>>>> Subject: "huge folio" -> "large folios"
>>>>
>>>>>
>>>>> When vfio_pin_pages_remote() is called with a range of addresses that
>>>>> includes huge folios, the function currently performs individual
>>>>
>>>> Similar, we call it a "large" f
>>>>
>>>>> statistics counting operations for each page. This can lead to significant
>>>>> performance overheads, especially when dealing with large ranges of pages.
>>>>>
>>>>> This patch optimize this process by batching the statistics counting
>>>>> operations.
>>>>>
>>>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
>>>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
>>>>> address space has been mapped to physical memory using hugetlbfs with
>>>>> pagesize=2M.
>>>>>
>>>>> Before this patch:
>>>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
>>>>>
>>>>> After this patch:
>>>>> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
>>>>>
>>>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>> ---
>>>>> Changelogs:
>>>>>
>>>>> v2->v3:
>>>>> - Code simplification.
>>>>> - Fix some issues in comments.
>>>>>
>>>>> v1->v2:
>>>>> - Fix some issues in comments and formatting.
>>>>> - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
>>>>> - Move the processing logic for huge folio into the while(true) loop
>>>>>      and use a variable with a default value of 1 to indicate the number
>>>>>      of consecutive pages.
>>>>>
>>>>> v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
>>>>> v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
>>>>>
>>>>>     drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
>>>>>     1 file changed, 37 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 0ac56072af9f..48f06ce0e290 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
>>>>>     /*
>>>>>      * Helper Functions for host iova-pfn list
>>>>>      */
>>>>> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>>>>> +
>>>>> +/*
>>>>> + * Find the first vfio_pfn that overlapping the range
>>>>> + * [iova, iova + PAGE_SIZE * npage) in rb tree.
>>>>> + */
>>>>> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
>>>>> +		dma_addr_t iova, unsigned long npage)
>>>>>     {
>>>>>     	struct vfio_pfn *vpfn;
>>>>>     	struct rb_node *node = dma->pfn_list.rb_node;
>>>>> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
>>>>>     	while (node) {
>>>>>     		vpfn = rb_entry(node, struct vfio_pfn, node);
>>>>> -		if (iova < vpfn->iova)
>>>>> +		if (end_iova <= vpfn->iova)
>>>>>     			node = node->rb_left;
>>>>>     		else if (iova > vpfn->iova)
>>>>>     			node = node->rb_right;
>>>>> @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>>>>>     	return NULL;
>>>>>     }
>>>>> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>>>>> +{
>>>>> +	return vfio_find_vpfn_range(dma, iova, 1);
>>>>> +}
>>>>> +
>>>>>     static void vfio_link_pfn(struct vfio_dma *dma,
>>>>>     			  struct vfio_pfn *new)
>>>>>     {
>>>>> @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>>>>>     		 * and rsvd here, and therefore continues to use the batch.
>>>>>     		 */
>>>>>     		while (true) {
>>>>> +			struct folio *folio = page_folio(batch->pages[batch->offset]);
>>>>> +			long nr_pages;
>>>>> +
>>>>>     			if (pfn != *pfn_base + pinned ||
>>>>>     			    rsvd != is_invalid_reserved_pfn(pfn))
>>>>>     				goto out;
>>>>> +			/*
>>>>> +			 * Note: The current nr_pages does not achieve the optimal
>>>>> +			 * performance in scenarios where folio_nr_pages() exceeds
>>>>> +			 * batch->capacity. It is anticipated that future enhancements
>>>>> +			 * will address this limitation.
>>>>> +			 */
>>>>> +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
>>>>> +						folio_page_idx(folio, batch->pages[batch->offset]));
>>>>> +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
>>>>> +				nr_pages = 1;
>>>>
>>>>
>>>> You seem to assume that the batch really contains the consecutive pages
>>>> of that folio.
>>>
>>> I don't think we are.  We're iterating through our batch of pages from
>>> GUP to find consecutive pfns.  We use the page to get the pfn, the
>>> folio, and immediately above, the offset into the folio.  batch->size is
>>> the remaining length of the page array from GUP and batch->offset is our
>>> current index into that array.
>>
>> Let me try again using an example below ....
>>
>>>> This is not the case if we obtained the pages through GUP and we have
>>>>
>>>> (a) A MAP_PRIVATE mapping
>>>>
>>>> (b) We span multiple different VMAs
>>>>
>>>>
>>>> Are we sure we can rule out (a) and (b)?
>>>>
>>>> A more future-proof approach would be at least looking whether the
>>>> pages/pfns are actually consecutive.
>>>
>>> The unmodified (pfn != *pfn_base + pinned) test is where we verify we
>>> have the next consecutive pfn.  Maybe I'm not catching the dependency
>>> you're seeing on consecutive pages, I think there isn't one unless
>>> we're somehow misusing folio_page_idx() to get the offset into the
>>> folio.
>>
>> Assume our page tables look like this (case (a), a partially mapped large
>> pagecache folio mixed with COW'ed anonymous folios):
>>
>>    + page[0] of folio 0
>>    |              + COWed anonymous folio (folio 1)
>>    |              |    + page[4] of folio 0
>>    |              |    |
>>    v              v    v
>> F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7
>>
>> If we GUP that range, we get exactly these pages, except that the PFNs are
>> not consecutive, because F0P3 was replaced by another page. The large folio
>> is partially mapped.
>>
>>
>> Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0
>> because we assume the batch would contain F1P0, where it really contains
>> F0P4?
> 
> Looks like a real issue (even if unlikely setup)..

There are other ways to achieve that (e.g., simply mapping parts of 
multiple files etc.).

Of course, for hugetlb (as you note below) such folios partial mappings 
are impossible.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-21  6:35     ` David Hildenbrand
  2025-05-21 15:40       ` Peter Xu
@ 2025-05-21 16:55       ` Alex Williamson
  2025-05-26 20:19         ` Jason Gunthorpe
  1 sibling, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2025-05-21 16:55 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: lizhe.67, kvm, linux-kernel, muchun.song

On Wed, 21 May 2025 08:35:47 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 21.05.25 00:21, Alex Williamson wrote:
> > On Tue, 20 May 2025 19:38:45 +0200
> > David Hildenbrand <david@redhat.com> wrote:
> >   
> >> On 20.05.25 09:00, lizhe.67@bytedance.com wrote:  
> >>> From: Li Zhe <lizhe.67@bytedance.com>  
> >>
> >> Subject: "huge folio" -> "large folios"
> >>  
> >>>
> >>> When vfio_pin_pages_remote() is called with a range of addresses that
> >>> includes huge folios, the function currently performs individual  
> >>
> >> Similar, we call it a "large" f
> >>  
> >>> statistics counting operations for each page. This can lead to significant
> >>> performance overheads, especially when dealing with large ranges of pages.
> >>>
> >>> This patch optimize this process by batching the statistics counting
> >>> operations.
> >>>
> >>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> >>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> >>> address space has been mapped to physical memory using hugetlbfs with
> >>> pagesize=2M.
> >>>
> >>> Before this patch:
> >>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> >>>
> >>> After this patch:
> >>> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> >>>
> >>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>> ---
> >>> Changelogs:
> >>>
> >>> v2->v3:
> >>> - Code simplification.
> >>> - Fix some issues in comments.
> >>>
> >>> v1->v2:
> >>> - Fix some issues in comments and formatting.
> >>> - Consolidate vfio_find_vpfn_range() and vfio_find_vpfn().
> >>> - Move the processing logic for huge folio into the while(true) loop
> >>>     and use a variable with a default value of 1 to indicate the number
> >>>     of consecutive pages.
> >>>
> >>> v2 patch: https://lore.kernel.org/all/20250519070419.25827-1-lizhe.67@bytedance.com/
> >>> v1 patch: https://lore.kernel.org/all/20250513035730.96387-1-lizhe.67@bytedance.com/
> >>>
> >>>    drivers/vfio/vfio_iommu_type1.c | 48 +++++++++++++++++++++++++--------
> >>>    1 file changed, 37 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index 0ac56072af9f..48f06ce0e290 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -319,15 +319,22 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
> >>>    /*
> >>>     * Helper Functions for host iova-pfn list
> >>>     */
> >>> -static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> >>> +
> >>> +/*
> >>> + * Find the first vfio_pfn that overlapping the range
> >>> + * [iova, iova + PAGE_SIZE * npage) in rb tree.
> >>> + */
> >>> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> >>> +		dma_addr_t iova, unsigned long npage)
> >>>    {
> >>>    	struct vfio_pfn *vpfn;
> >>>    	struct rb_node *node = dma->pfn_list.rb_node;
> >>> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> >>>    
> >>>    	while (node) {
> >>>    		vpfn = rb_entry(node, struct vfio_pfn, node);
> >>>    
> >>> -		if (iova < vpfn->iova)
> >>> +		if (end_iova <= vpfn->iova)
> >>>    			node = node->rb_left;
> >>>    		else if (iova > vpfn->iova)
> >>>    			node = node->rb_right;
> >>> @@ -337,6 +344,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> >>>    	return NULL;
> >>>    }
> >>>    
> >>> +static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
> >>> +{
> >>> +	return vfio_find_vpfn_range(dma, iova, 1);
> >>> +}
> >>> +
> >>>    static void vfio_link_pfn(struct vfio_dma *dma,
> >>>    			  struct vfio_pfn *new)
> >>>    {
> >>> @@ -681,32 +693,46 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
> >>>    		 * and rsvd here, and therefore continues to use the batch.
> >>>    		 */
> >>>    		while (true) {
> >>> +			struct folio *folio = page_folio(batch->pages[batch->offset]);
> >>> +			long nr_pages;
> >>> +
> >>>    			if (pfn != *pfn_base + pinned ||
> >>>    			    rsvd != is_invalid_reserved_pfn(pfn))
> >>>    				goto out;
> >>>    
> >>> +			/*
> >>> +			 * Note: The current nr_pages does not achieve the optimal
> >>> +			 * performance in scenarios where folio_nr_pages() exceeds
> >>> +			 * batch->capacity. It is anticipated that future enhancements
> >>> +			 * will address this limitation.
> >>> +			 */
> >>> +			nr_pages = min((long)batch->size, folio_nr_pages(folio) -
> >>> +						folio_page_idx(folio, batch->pages[batch->offset]));
> >>> +			if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> >>> +				nr_pages = 1;  
> >>
> >>
> >> You seem to assume that the batch really contains the consecutive pages
> >> of that folio.  
> > 
> > I don't think we are.  We're iterating through our batch of pages from
> > GUP to find consecutive pfns.  We use the page to get the pfn, the
> > folio, and immediately above, the offset into the folio.  batch->size is
> > the remaining length of the page array from GUP and batch->offset is our
> > current index into that array.  
> 
> Let me try again using an example below ....
> 
> >     
> >> This is not the case if we obtained the pages through GUP and we have
> >>
> >> (a) A MAP_PRIVATE mapping
> >>
> >> (b) We span multiple different VMAs
> >>
> >>
> >> Are we sure we can rule out (a) and (b)?
> >>
> >> A more future-proof approach would be at least looking whether the
> >> pages/pfns are actually consecutive.  
> > 
> > The unmodified (pfn != *pfn_base + pinned) test is where we verify we
> > have the next consecutive pfn.  Maybe I'm not catching the dependency
> > you're seeing on consecutive pages, I think there isn't one unless
> > we're somehow misusing folio_page_idx() to get the offset into the
> > folio.  
> 
> Assume our page tables look like this (case (a), a partially mapped 
> large pagecache folio mixed with COW'ed anonymous folios):
> 
>    + page[0] of folio 0
>    |              + COWed anonymous folio (folio 1)
>    |              |    + page[4] of folio 0
>    |              |    |
>    v              v    v
> F0P0 F0P1 F0P2 F1P0 F0P4 P0P5 F0P6 F0P7
> 
> If we GUP that range, we get exactly these pages, except that the PFNs 
> are not consecutive, because F0P3 was replaced by another page. The 
> large folio is partially mapped.

Ok, I asked the wrong question, this seemed like a good optimization,
but I think you've identified the key misunderstanding that makes it
appear that way.  Thank you.

This optimization does rely on an assumption of consecutive _pages_ in
the array returned from GUP.  If we cannot assume the next array index
is the next page from the same folio (which afaict we have no basis to
do), we cannot use the folio as the basis for any optimization.

I expect this assumption works for effectively all QEMU use cases, but
regardless, it's bogus for the general case.  I guess that using
something like memfd would provide that guarantee of the linear mapping
in user virtual address space.  Without memfd we'd need to be able to
get at the extent of the actual page table entry, like we added for
pfnmap.

> Maybe I misunderstand that code, but wouldn't we just "jump" over F1P0
> because we assume the batch would contain F1P0, where it really contains 
> F0P4?

This optimization would incorrectly assume that the GUP page array
indexes are consecutive from the same folio and it'd end up mapping
page_to_pfn(F0P3) (or really page_to_pfn(F0P0) + 3) rather than
page_to_pfn(F1P0) at that index.

Thanks for catching this!

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-21 16:55       ` Alex Williamson
@ 2025-05-26 20:19         ` Jason Gunthorpe
  2025-05-27 19:52           ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-05-26 20:19 UTC (permalink / raw)
  To: Alex Williamson
  Cc: David Hildenbrand, lizhe.67, kvm, linux-kernel, muchun.song

On Wed, May 21, 2025 at 10:55:12AM -0600, Alex Williamson wrote:

> This optimization does rely on an assumption of consecutive _pages_ in
> the array returned from GUP.  If we cannot assume the next array index
> is the next page from the same folio (which afaict we have no basis to
> do), we cannot use the folio as the basis for any optimization.

Right! I was wondering why this code was messing with folios, it
really can't learn anything from folios. The only advantage to folios
is during unpinning where we can batch the atomics for all the folio
sub pages, which the core mm helpers are doing.

Which brings me back to my first remark - this is all solved in
iommufd, in a much better way :( I continue to think we should just
leave this type1 stuff as-is upstream and encourage people to move
forward.

Lots of CSPs are running iommufd now. There is a commonly used OOT
patch to add the insecure P2P support like VFIO. I know lots of folks
have backported iommufd.. No idea about libvirt, but you can run it in
compatibility mode and then you don't need to change libvirt.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-26 20:19         ` Jason Gunthorpe
@ 2025-05-27 19:52           ` Alex Williamson
  2025-05-27 23:46             ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Alex Williamson @ 2025-05-27 19:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, lizhe.67, kvm, linux-kernel, muchun.song

On Mon, 26 May 2025 17:19:55 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, May 21, 2025 at 10:55:12AM -0600, Alex Williamson wrote:
> 
> > This optimization does rely on an assumption of consecutive _pages_ in
> > the array returned from GUP.  If we cannot assume the next array index
> > is the next page from the same folio (which afaict we have no basis to
> > do), we cannot use the folio as the basis for any optimization.  
> 
> Right! I was wondering why this code was messing with folios, it
> really can't learn anything from folios. The only advantage to folios
> is during unpinning where we can batch the atomics for all the folio
> sub pages, which the core mm helpers are doing.

I *think* all we're gaining is that comparing page pointers is
slightly more lightweight than iterating page_to_pfn() and that we can
skip checking whether we've crossed an inflection in pages considered
reserved within a folio.  I'm curious to see to what extent this
optimization is still worthwhile.

> Which brings me back to my first remark - this is all solved in
> iommufd, in a much better way :( I continue to think we should just
> leave this type1 stuff as-is upstream and encourage people to move
> forward.
> 
> Lots of CSPs are running iommufd now. There is a commonly used OOT
> patch to add the insecure P2P support like VFIO. I know lots of folks
> have backported iommufd.. No idea about libvirt, but you can run it in
> compatibility mode and then you don't need to change libvirt.

For distributions that don't have an upstream first policy, sure, they
can patch whatever they like.  I can't recommend that solution though.

Otherwise the problem with compatibility mode is that it's a compile
time choice.  A single kernel binary cannot interchangeably provide
either P2P DMA with legacy vfio or better IOMMUFD improvements without
P2P DMA.  If libvirt had IOMMUFD support, XML could specify the
interface on a per-device bases, and maybe even allow opt-in to a
system-wide policy choice.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-27 19:52           ` Alex Williamson
@ 2025-05-27 23:46             ` Jason Gunthorpe
  2025-05-28 20:09               ` Alex Williamson
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-05-27 23:46 UTC (permalink / raw)
  To: Alex Williamson
  Cc: David Hildenbrand, lizhe.67, kvm, linux-kernel, muchun.song

On Tue, May 27, 2025 at 01:52:52PM -0600, Alex Williamson wrote:

> > Lots of CSPs are running iommufd now. There is a commonly used OOT
> > patch to add the insecure P2P support like VFIO. I know lots of folks
> > have backported iommufd.. No idea about libvirt, but you can run it in
> > compatibility mode and then you don't need to change libvirt.
> 
> For distributions that don't have an upstream first policy, sure, they
> can patch whatever they like.  I can't recommend that solution though.

I appreciate that, and we are working on it.. The first round of
patches for DMA API improvements that Christoph asked for were sent as
a PR yesterday.

> Otherwise the problem with compatibility mode is that it's a compile
> time choice.

The compile time choice is not the compatability mode.

Any iommufd, even if opened from /dev/iommu, is usable as a VFIO
container in the classic group based ioctls.

The group path in VFIO calls vfio_group_ioctl_set_container() ->
iommufd_ctx_from_file() which works with iommufd from any source.

The type 1 emulation ioctls are also always available on any iommufd.
After set container VFIO does iommufd_vfio_compat_ioas_create() to
setup the default compatability stuff.

All the compile time option does is replace /dev/vfio/vfio with
/dev/iommu, but they have *exactly* the same fops:

static struct miscdevice iommu_misc_dev = {
	.minor = MISC_DYNAMIC_MINOR,
	.name = "iommu",
	.fops = &iommufd_fops,

static struct miscdevice vfio_misc_dev = {
	.minor = VFIO_MINOR,
	.name = "vfio",
	.fops = &iommufd_fops,

So you can have libvirt open /dev/iommu, or you can have the admin
symlink /dev/iommu to /dev/vfio/vfio and opt in on a case by case
basis.

The compile time choice is really just a way to make testing easier
and down the road if a distro decides they don't want to support both
code bases then can choose to disable the type 1 code entirely and
still be uAPI compatible, but I think that is down the road a ways
still.

> A single kernel binary cannot interchangeably provide
> either P2P DMA with legacy vfio or better IOMMUFD improvements without
> P2P DMA.

See above, it can, and it was deliberately made easy to do without
having to change any applications.

The idea was you can sort of incrementally decide which things to move
over. For instance you can keep all the type 1 code and vfio
group/container stuff unchanged but use a combination of
IOMMU_VFIO_IOAS_GET and then IOMMUFD_CMD_IOAS_MAP_FILE to map a memfd.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-27 23:46             ` Jason Gunthorpe
@ 2025-05-28 20:09               ` Alex Williamson
  2025-05-29  0:51                 ` Jason Gunthorpe
  2025-05-29  4:31                 ` lizhe.67
  0 siblings, 2 replies; 21+ messages in thread
From: Alex Williamson @ 2025-05-28 20:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, lizhe.67, kvm, linux-kernel, muchun.song

On Tue, 27 May 2025 20:46:27 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Tue, May 27, 2025 at 01:52:52PM -0600, Alex Williamson wrote:
> 
> > > Lots of CSPs are running iommufd now. There is a commonly used OOT
> > > patch to add the insecure P2P support like VFIO. I know lots of folks
> > > have backported iommufd.. No idea about libvirt, but you can run it in
> > > compatibility mode and then you don't need to change libvirt.  
> > 
> > For distributions that don't have an upstream first policy, sure, they
> > can patch whatever they like.  I can't recommend that solution though.  
> 
> I appreciate that, and we are working on it.. The first round of
> patches for DMA API improvements that Christoph asked for were sent as
> a PR yesterday.
> 
> > Otherwise the problem with compatibility mode is that it's a compile
> > time choice.  
> 
> The compile time choice is not the compatability mode.
> 
> Any iommufd, even if opened from /dev/iommu, is usable as a VFIO
> container in the classic group based ioctls.
> 
> The group path in VFIO calls vfio_group_ioctl_set_container() ->
> iommufd_ctx_from_file() which works with iommufd from any source.
> 
> The type 1 emulation ioctls are also always available on any iommufd.
> After set container VFIO does iommufd_vfio_compat_ioas_create() to
> setup the default compatability stuff.
> 
> All the compile time option does is replace /dev/vfio/vfio with
> /dev/iommu, but they have *exactly* the same fops:
> 
> static struct miscdevice iommu_misc_dev = {
> 	.minor = MISC_DYNAMIC_MINOR,
> 	.name = "iommu",
> 	.fops = &iommufd_fops,
> 
> static struct miscdevice vfio_misc_dev = {
> 	.minor = VFIO_MINOR,
> 	.name = "vfio",
> 	.fops = &iommufd_fops,
> 
> So you can have libvirt open /dev/iommu, or you can have the admin
> symlink /dev/iommu to /dev/vfio/vfio and opt in on a case by case
> basis.

Yes, I'd forgotten we added this.  It's QEMU opening /dev/vfio/vfio and
QEMU already has native iommufd support, so a per VM hack could be done
via qemu:args or a QEMU wrapper script to instantiate an iommufd object
in the VM xml, or as noted, a system-wide change could be done
transparently via 'ln -sf /dev/iommu /dev/vfio/vfio'.

To be fair to libvirt, we'd really like libvirt to make use of iommufd
whenever it's available, but without feature parity this would break
users.  And without feature parity, it's not clear how libvirt should
probe for feature parity.  Things get a lot easier for libvirt if we
can switch the default at a point where we expect no regressions.

> The compile time choice is really just a way to make testing easier
> and down the road if a distro decides they don't want to support both
> code bases then can choose to disable the type 1 code entirely and
> still be uAPI compatible, but I think that is down the road a ways
> still.

Yep.
 
> > A single kernel binary cannot interchangeably provide
> > either P2P DMA with legacy vfio or better IOMMUFD improvements without
> > P2P DMA.  
> 
> See above, it can, and it was deliberately made easy to do without
> having to change any applications.
> 
> The idea was you can sort of incrementally decide which things to move
> over. For instance you can keep all the type 1 code and vfio
> group/container stuff unchanged but use a combination of
> IOMMU_VFIO_IOAS_GET and then IOMMUFD_CMD_IOAS_MAP_FILE to map a memfd.

Right, sorry, it'd slipped my mind that we'd created the "soft"
compatibility mode too.

Zhe, so if you have no dependencies on P2P DMA within your device
assignment VMs, the options above may be useful or at least a data
point for comparison of type1 vs IOMMUFD performance.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-28 20:09               ` Alex Williamson
@ 2025-05-29  0:51                 ` Jason Gunthorpe
  2025-05-29  2:58                   ` Alex Williamson
  2025-05-29  4:31                 ` lizhe.67
  1 sibling, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2025-05-29  0:51 UTC (permalink / raw)
  To: Alex Williamson
  Cc: David Hildenbrand, lizhe.67, kvm, linux-kernel, muchun.song

On Wed, May 28, 2025 at 02:09:41PM -0600, Alex Williamson wrote:

> To be fair to libvirt, we'd really like libvirt to make use of iommufd
> whenever it's available, but without feature parity this would break
> users.  

If people ask there should be no issue with making API functionality
discoverable with a query command of some kind. Alot of new stuff is
already discoverable by invoking an ioctl with bogus arguments and
checking for EOPNOTSUPP/ENOTTY.

But most likely P2P will use the same ioctl as memfd so it will not
work that way.

So for now libvirt could assume no P2P support in iommufd. A simple
algorithm would be to look for more than 1 VFIO device. Or add an xml
"disable P2P" which is a useful thing anyhow.

> Zhe, so if you have no dependencies on P2P DMA within your device
> assignment VMs, the options above may be useful or at least a data
> point for comparison of type1 vs IOMMUFD performance.  Thanks,

Even if Zhe does have P2P DMA I have a feeling the OOT P2P patch may
be workable <shrug>

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-29  0:51                 ` Jason Gunthorpe
@ 2025-05-29  2:58                   ` Alex Williamson
  0 siblings, 0 replies; 21+ messages in thread
From: Alex Williamson @ 2025-05-29  2:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, lizhe.67, kvm, linux-kernel, muchun.song

On Wed, 28 May 2025 21:51:47 -0300
Jason Gunthorpe <jgg@ziepe.ca> wrote:

> On Wed, May 28, 2025 at 02:09:41PM -0600, Alex Williamson wrote:
> 
> > To be fair to libvirt, we'd really like libvirt to make use of iommufd
> > whenever it's available, but without feature parity this would break
> > users.    
> 
> If people ask there should be no issue with making API functionality
> discoverable with a query command of some kind. Alot of new stuff is
> already discoverable by invoking an ioctl with bogus arguments and
> checking for EOPNOTSUPP/ENOTTY.
> 
> But most likely P2P will use the same ioctl as memfd so it will not
> work that way.
> 
> So for now libvirt could assume no P2P support in iommufd. A simple
> algorithm would be to look for more than 1 VFIO device. Or add an xml
> "disable P2P" which is a useful thing anyhow.

Hotplug is always a problem if we make assumptions based on device
counts.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio
  2025-05-28 20:09               ` Alex Williamson
  2025-05-29  0:51                 ` Jason Gunthorpe
@ 2025-05-29  4:31                 ` lizhe.67
  1 sibling, 0 replies; 21+ messages in thread
From: lizhe.67 @ 2025-05-29  4:31 UTC (permalink / raw)
  To: alex.williamson, jgg; +Cc: david, kvm, linux-kernel, lizhe.67, muchun.song

On Wed, 28 May 2025 14:09:41 -0600, alex.williamson@redhat.com wrote:

> On Tue, 27 May 2025 20:46:27 -0300
> Jason Gunthorpe <jgg@ziepe.ca> wrote:
> 
> > On Tue, May 27, 2025 at 01:52:52PM -0600, Alex Williamson wrote:
> > 
> > > > Lots of CSPs are running iommufd now. There is a commonly used OOT
> > > > patch to add the insecure P2P support like VFIO. I know lots of folks
> > > > have backported iommufd.. No idea about libvirt, but you can run it in
> > > > compatibility mode and then you don't need to change libvirt.  
> > > 
> > > For distributions that don't have an upstream first policy, sure, they
> > > can patch whatever they like.  I can't recommend that solution though.  
> > 
> > I appreciate that, and we are working on it.. The first round of
> > patches for DMA API improvements that Christoph asked for were sent as
> > a PR yesterday.
> > 
> > > Otherwise the problem with compatibility mode is that it's a compile
> > > time choice.  
> > 
> > The compile time choice is not the compatability mode.
> > 
> > Any iommufd, even if opened from /dev/iommu, is usable as a VFIO
> > container in the classic group based ioctls.
> > 
> > The group path in VFIO calls vfio_group_ioctl_set_container() ->
> > iommufd_ctx_from_file() which works with iommufd from any source.
> > 
> > The type 1 emulation ioctls are also always available on any iommufd.
> > After set container VFIO does iommufd_vfio_compat_ioas_create() to
> > setup the default compatability stuff.
> > 
> > All the compile time option does is replace /dev/vfio/vfio with
> > /dev/iommu, but they have *exactly* the same fops:
> > 
> > static struct miscdevice iommu_misc_dev = {
> > 	.minor = MISC_DYNAMIC_MINOR,
> > 	.name = "iommu",
> > 	.fops = &iommufd_fops,
> > 
> > static struct miscdevice vfio_misc_dev = {
> > 	.minor = VFIO_MINOR,
> > 	.name = "vfio",
> > 	.fops = &iommufd_fops,
> > 
> > So you can have libvirt open /dev/iommu, or you can have the admin
> > symlink /dev/iommu to /dev/vfio/vfio and opt in on a case by case
> > basis.
> 
> Yes, I'd forgotten we added this.  It's QEMU opening /dev/vfio/vfio and
> QEMU already has native iommufd support, so a per VM hack could be done
> via qemu:args or a QEMU wrapper script to instantiate an iommufd object
> in the VM xml, or as noted, a system-wide change could be done
> transparently via 'ln -sf /dev/iommu /dev/vfio/vfio'.
> 
> To be fair to libvirt, we'd really like libvirt to make use of iommufd
> whenever it's available, but without feature parity this would break
> users.  And without feature parity, it's not clear how libvirt should
> probe for feature parity.  Things get a lot easier for libvirt if we
> can switch the default at a point where we expect no regressions.
> 
> > The compile time choice is really just a way to make testing easier
> > and down the road if a distro decides they don't want to support both
> > code bases then can choose to disable the type 1 code entirely and
> > still be uAPI compatible, but I think that is down the road a ways
> > still.
> 
> Yep.
>  
> > > A single kernel binary cannot interchangeably provide
> > > either P2P DMA with legacy vfio or better IOMMUFD improvements without
> > > P2P DMA.  
> > 
> > See above, it can, and it was deliberately made easy to do without
> > having to change any applications.
> > 
> > The idea was you can sort of incrementally decide which things to move
> > over. For instance you can keep all the type 1 code and vfio
> > group/container stuff unchanged but use a combination of
> > IOMMU_VFIO_IOAS_GET and then IOMMUFD_CMD_IOAS_MAP_FILE to map a memfd.
> 
> Right, sorry, it'd slipped my mind that we'd created the "soft"
> compatibility mode too.
> 
> Zhe, so if you have no dependencies on P2P DMA within your device
> assignment VMs, the options above may be useful or at least a data
> point for comparison of type1 vs IOMMUFD performance.  Thanks,
> 
> Alex

Hi Alex, Jason, thank a lot for your suggestions. I will try to explore
the usage of iommufd in our scenario. I also hope that the P2P DMA patches
will be merged into the mainline soon.

Thanks,
Zhe

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-05-29  4:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20  7:00 [PATCH v3] vfio/type1: optimize vfio_pin_pages_remote() for huge folio lizhe.67
2025-05-20 14:07 ` Alex Williamson
2025-05-20 16:59   ` Peter Xu
2025-05-20 17:41     ` David Hildenbrand
2025-05-20 21:56       ` Alex Williamson
2025-05-21  3:34   ` lizhe.67
2025-05-20 17:38 ` David Hildenbrand
2025-05-20 22:21   ` Alex Williamson
2025-05-21  6:35     ` David Hildenbrand
2025-05-21 15:40       ` Peter Xu
2025-05-21 16:02         ` David Hildenbrand
2025-05-21 16:55       ` Alex Williamson
2025-05-26 20:19         ` Jason Gunthorpe
2025-05-27 19:52           ` Alex Williamson
2025-05-27 23:46             ` Jason Gunthorpe
2025-05-28 20:09               ` Alex Williamson
2025-05-29  0:51                 ` Jason Gunthorpe
2025-05-29  2:58                   ` Alex Williamson
2025-05-29  4:31                 ` lizhe.67
2025-05-21  3:45   ` lizhe.67
2025-05-20 23:37 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).