Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: lizhe.67@bytedance.com
Cc: david@redhat.com, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org, muchun.song@linux.dev,
	peterx@redhat.com
Subject: Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio
Date: Thu, 22 May 2025 14:52:07 -0600	[thread overview]
Message-ID: <20250522145207.01734386.alex.williamson@redhat.com> (raw)
In-Reply-To: <20250522082524.75076-1-lizhe.67@bytedance.com>

On Thu, 22 May 2025 16:25:24 +0800
lizhe.67@bytedance.com wrote:

> On Thu, 22 May 2025 09:22:50 +0200, david@redhat.com wrote:
> 
> >On 22.05.25 05:49, lizhe.67@bytedance.com wrote:  
> >> On Wed, 21 May 2025 13:17:11 -0600, alex.williamson@redhat.com wrote:
> >>   
> >>>> From: Li Zhe <lizhe.67@bytedance.com>
> >>>>
> >>>> When vfio_pin_pages_remote() is called with a range of addresses that
> >>>> includes large folios, the function currently performs individual
> >>>> statistics counting operations for each page. This can lead to significant
> >>>> performance overheads, especially when dealing with large ranges of pages.
> >>>>
> >>>> This patch optimize this process by batching the statistics counting
> >>>> operations.
> >>>>
> >>>> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> >>>> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> >>>> address space has been mapped to physical memory using hugetlbfs with
> >>>> pagesize=2M.
> >>>>
> >>>> Before this patch:
> >>>> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> >>>>
> >>>> After this patch:
> >>>> funcgraph_entry:      # 16071.378 us |  vfio_pin_map_dma();
> >>>>
> >>>> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> >>>> Co-developed-by: Alex Williamson <alex.williamson@redhat.com>
> >>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>>> ---  
> >>>
> >>> Given the discussion on v3, this is currently a Nak.  Follow-up in that
> >>> thread if there are further ideas how to salvage this.  Thanks,  
> >> 
> >> How about considering the solution David mentioned to check whether the
> >> pages or PFNs are actually consecutive?
> >> 
> >> I have conducted a preliminary attempt, and the performance testing
> >> revealed that the time consumption is approximately 18,000 microseconds.
> >> Compared to the previous 33,000 microseconds, this also represents a
> >> significant improvement.
> >> 
> >> The modification is quite straightforward. The code below reflects the
> >> changes I have made based on this patch.
> >> 
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index bd46ed9361fe..1cc1f76d4020 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -627,6 +627,19 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >>          return ret;
> >>   }
> >>   
> >> +static inline long continuous_page_num(struct vfio_batch *batch, long npage)
> >> +{
> >> +       long i;
> >> +       unsigned long next_pfn = page_to_pfn(batch->pages[batch->offset]) + 1;
> >> +
> >> +       for (i = 1; i < npage; ++i) {
> >> +               if (page_to_pfn(batch->pages[batch->offset + i]) != next_pfn)
> >> +                       break;
> >> +               next_pfn++;
> >> +       }
> >> +       return i;
> >> +}  
> >
> >
> >What might be faster is obtaining the folio, and then calculating the 
> >next expected page pointer, comparing whether the page pointers match.
> >
> >Essentially, using folio_page() to calculate the expected next page.
> >
> >nth_page() is a simple pointer arithmetic with CONFIG_SPARSEMEM_VMEMMAP, 
> >so that might be rather fast.
> >
> >
> >So we'd obtain
> >
> >start_idx = folio_idx(folio, batch->pages[batch->offset]);  
> 
> Do you mean using folio_page_idx()?
> 
> >and then check for
> >
> >batch->pages[batch->offset + i] == folio_page(folio, start_idx + i)  
> 
> Thank you for your reminder. This is indeed a better solution.
> The updated code might look like this:
> 
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index bd46ed9361fe..f9a11b1d8433 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -627,6 +627,20 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>         return ret;
>  }
>  
> +static inline long continuous_pages_num(struct folio *folio,
> +               struct vfio_batch *batch, long npage)

Note this becomes long enough that we should just let the compiler
decide whether to inline or not.

> +{
> +       long i;
> +       unsigned long start_idx =
> +                       folio_page_idx(folio, batch->pages[batch->offset]);
> +
> +       for (i = 1; i < npage; ++i)
> +               if (batch->pages[batch->offset + i] !=
> +                               folio_page(folio, start_idx + i))
> +                       break;
> +       return i;
> +}
> +
>  /*
>   * Attempt to pin pages.  We really don't want to track all the pfns and
>   * the iommu can only map chunks of consecutive pfns anyway, so get the
> @@ -708,8 +722,12 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>                          */
>                         nr_pages = min_t(long, batch->size, folio_nr_pages(folio) -
>                                                 folio_page_idx(folio, batch->pages[batch->offset]));
> -                       if (nr_pages > 1 && vfio_find_vpfn_range(dma, iova, nr_pages))
> -                               nr_pages = 1;
> +                       if (nr_pages > 1) {
> +                               if (vfio_find_vpfn_range(dma, iova, nr_pages))
> +                                       nr_pages = 1;
> +                               else
> +                                       nr_pages = continuous_pages_num(folio, batch, nr_pages);
> +                       }


I think we can refactor this a bit better and maybe if we're going to
the trouble of comparing pages we can be a bit more resilient to pages
already accounted as vpfns.  I took a shot at it, compile tested only,
is there still a worthwhile gain?

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 0ac56072af9f..e8bba32148f7 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -319,7 +319,13 @@ static void vfio_dma_bitmap_free_all(struct vfio_iommu *iommu)
 /*
  * Helper Functions for host iova-pfn list
  */
-static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+
+/*
+ * Find the first vfio_pfn that overlapping the range
+ * [iova_start, iova_end) in rb tree.
+ */
+static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
+		dma_addr_t iova_start, dma_addr_t iova_end)
 {
 	struct vfio_pfn *vpfn;
 	struct rb_node *node = dma->pfn_list.rb_node;
@@ -327,9 +333,9 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	while (node) {
 		vpfn = rb_entry(node, struct vfio_pfn, node);
 
-		if (iova < vpfn->iova)
+		if (iova_end <= vpfn->iova)
 			node = node->rb_left;
-		else if (iova > vpfn->iova)
+		else if (iova_start > vpfn->iova)
 			node = node->rb_right;
 		else
 			return vpfn;
@@ -337,6 +343,11 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
 	return NULL;
 }
 
+static inline struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
+{
+	return vfio_find_vpfn_range(dma, iova, iova + PAGE_SIZE);
+}
+
 static void vfio_link_pfn(struct vfio_dma *dma,
 			  struct vfio_pfn *new)
 {
@@ -615,6 +626,43 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	return ret;
 }
 
+static long contig_pages(struct vfio_dma *dma,
+			 struct vfio_batch *batch, dma_addr_t iova)
+{
+	struct page *page = batch->pages[batch->offset];
+	struct folio *folio = page_folio(page);
+	long idx = folio_page_idx(folio, page);
+	long max = min_t(long, batch->size, folio_nr_pages(folio) - idx);
+	long nr_pages;
+
+	for (nr_pages = 1; nr_pages < max; nr_pages++) {
+		if (batch->pages[batch->offset + nr_pages] !=
+		    folio_page(folio, idx + nr_pages))
+			break;
+	}
+
+	return nr_pages;
+}
+
+static long vpfn_pages(struct vfio_dma *dma,
+		       dma_addr_t iova_start, long nr_pages)
+{
+	dma_addr_t iova_end = iova_start + (nr_pages << PAGE_SHIFT);
+	struct vfio_pfn *vpfn;
+	long count = 0;
+
+	do {
+		vpfn = vfio_find_vpfn_range(dma, iova_start, iova_end);
+		if (likely(!vpfn))
+			break;
+
+		count++;
+		iova_start = vpfn->iova + PAGE_SIZE;
+	} while (iova_start < iova_end);
+
+	return count;
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -681,32 +729,40 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 		 * and rsvd here, and therefore continues to use the batch.
 		 */
 		while (true) {
+			long nr_pages, acct_pages = 0;
+
 			if (pfn != *pfn_base + pinned ||
 			    rsvd != is_invalid_reserved_pfn(pfn))
 				goto out;
 
+			nr_pages = contig_pages(dma, batch, iova);
+			if (!rsvd) {
+				acct_pages = nr_pages;
+				acct_pages -= vpfn_pages(dma, iova, nr_pages);
+			}
+
 			/*
 			 * Reserved pages aren't counted against the user,
 			 * externally pinned pages are already counted against
 			 * the user.
 			 */
-			if (!rsvd && !vfio_find_vpfn(dma, iova)) {
+			if (acct_pages) {
 				if (!dma->lock_cap &&
-				    mm->locked_vm + lock_acct + 1 > limit) {
+				    mm->locked_vm + lock_acct + acct_pages > limit) {
 					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 						__func__, limit << PAGE_SHIFT);
 					ret = -ENOMEM;
 					goto unpin_out;
 				}
-				lock_acct++;
+				lock_acct += acct_pages;
 			}
 
-			pinned++;
-			npage--;
-			vaddr += PAGE_SIZE;
-			iova += PAGE_SIZE;
-			batch->offset++;
-			batch->size--;
+			pinned += nr_pages;
+			npage -= nr_pages;
+			vaddr += PAGE_SIZE * nr_pages;
+			iova += PAGE_SIZE * nr_pages;
+			batch->offset += nr_pages;
+			batch->size -= nr_pages;
 
 			if (!batch->size)
 				break;

next prev parent reply	other threads:[~2025-05-22 20:52 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-21  4:25 [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folios lizhe.67
2025-05-21  6:36 ` David Hildenbrand
2025-05-21 19:17 ` Alex Williamson
2025-05-22  3:49   ` [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio lizhe.67
2025-05-22  7:22     ` David Hildenbrand
2025-05-22  8:25       ` lizhe.67
2025-05-22 20:52         ` Alex Williamson [this message]
2025-05-23  3:42           ` lizhe.67
2025-05-23 14:54             ` Alex Williamson
2025-05-26  3:37               ` lizhe.67
2025-05-27 19:14                 ` Alex Williamson
2025-05-28  4:21                   ` lizhe.67
2025-05-28 20:11                     ` Alex Williamson

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:0ac56072af9 dfblob:e8bba32148f )
 OR (
bs:"Re: [PATCH v4] vfio/type1: optimize vfio_pin_pages_remote() for large folio" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250522145207.01734386.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=david@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizhe.67@bytedance.com \
    --cc=muchun.song@linux.dev \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).