Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Alex Williamson <alex.williamson@redhat.com>
To: lizhe.67@bytedance.com
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	muchun.song@linux.dev, Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio
Date: Thu, 15 May 2025 15:19:46 -0600	[thread overview]
Message-ID: <20250515151946.1e6edf8b.alex.williamson@redhat.com> (raw)
In-Reply-To: <20250513035730.96387-1-lizhe.67@bytedance.com>

On Tue, 13 May 2025 11:57:30 +0800
lizhe.67@bytedance.com wrote:

> From: Li Zhe <lizhe.67@bytedance.com>
> 
> When vfio_pin_pages_remote() is called with a range of addresses that
> includes hugetlbfs folios, the function currently performs individual
> statistics counting operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
> 
> This patch optimize this process by batching the statistics counting
> operations.
> 
> The performance test results for completing the 8G VFIO IOMMU DMA mapping,
> obtained through trace-cmd, are as follows. In this case, the 8G virtual
> address space has been mapped to physical memory using hugetlbfs with
> pagesize=2M.
> 
> Before this patch:
> funcgraph_entry:      # 33813.703 us |  vfio_pin_map_dma();
> 
> After this patch:
> funcgraph_entry:      # 15635.055 us |  vfio_pin_map_dma();
> 
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
>  drivers/vfio/vfio_iommu_type1.c | 49 +++++++++++++++++++++++++++++++++
>  1 file changed, 49 insertions(+)

Hi,

Thanks for looking at improvements in this area...

> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 0ac56072af9f..bafa7f8c4cc6 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -337,6 +337,30 @@ static struct vfio_pfn *vfio_find_vpfn(struct vfio_dma *dma, dma_addr_t iova)
>  	return NULL;
>  }
>  
> +/*
> + * Find a random vfio_pfn that belongs to the range
> + * [iova, iova + PAGE_SIZE * npage)
> + */
> +static struct vfio_pfn *vfio_find_vpfn_range(struct vfio_dma *dma,
> +		dma_addr_t iova, unsigned long npage)
> +{
> +	struct vfio_pfn *vpfn;
> +	struct rb_node *node = dma->pfn_list.rb_node;
> +	dma_addr_t end_iova = iova + PAGE_SIZE * npage;
> +
> +	while (node) {
> +		vpfn = rb_entry(node, struct vfio_pfn, node);
> +
> +		if (end_iova <= vpfn->iova)
> +			node = node->rb_left;
> +		else if (iova > vpfn->iova)
> +			node = node->rb_right;
> +		else
> +			return vpfn;
> +	}
> +	return NULL;
> +}

This essentially duplicates vfio_find_vpfn(), where the existing
function only finds a single page.  The existing function should be
extended for this new use case and callers updated.  Also the vfio_pfn
is not "random", it's the first vfio_pfn overlapping the range.

> +
>  static void vfio_link_pfn(struct vfio_dma *dma,
>  			  struct vfio_pfn *new)
>  {
> @@ -670,6 +694,31 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
>  				iova += (PAGE_SIZE * ret);
>  				continue;
>  			}
> +

Spurious new blank line.

> +		}

A new blank line here would be appreciated.

> +		/* Handle hugetlbfs page */
> +		if (likely(!disable_hugepages) &&

Isn't this already accounted for with npage = 1?

> +				folio_test_hugetlb(page_folio(batch->pages[batch->offset]))) {

I don't follow how this guarantees the entire batch->size is
contiguous.  Isn't it possible that a batch could contain multiple
hugetlb folios?  Is the assumption here only true if folio_nr_pages()
(or specifically the pages remaining) is >= batch->capacity?  What
happens if we try to map the last half of one 2MB hugetlb page and
first half of the non-physically-contiguous next page?  Or what if the
hugetlb size is 64KB and the batch contains multiple folios that are
not physically contiguous?

> +			if (pfn != *pfn_base + pinned)
> +				goto out;
> +
> +			if (!rsvd && !vfio_find_vpfn_range(dma, iova, batch->size)) {
> +				if (!dma->lock_cap &&
> +				    mm->locked_vm + lock_acct + batch->size > limit) {
> +					pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +						__func__, limit << PAGE_SHIFT);
> +					ret = -ENOMEM;
> +					goto unpin_out;
> +				}
> +				pinned += batch->size;
> +				npage -= batch->size;
> +				vaddr += PAGE_SIZE * batch->size;
> +				iova += PAGE_SIZE * batch->size;
> +				lock_acct += batch->size;
> +				batch->offset += batch->size;
> +				batch->size = 0;
> +				continue;
> +			}

There's a lot of duplication with the existing page-iterative loop.  I
think they could be consolidated if we extract the number of known
contiguous pages based on the folio into a variable, default 1.

Also, while this approach is an improvement, it leaves a lot on the
table in scenarios where folio_nr_pages() exceeds batch->capacity.  For
example we're at best incrementing 1GB hugetlb pages in 2MB increments.
We're also wasting a lot of cycles to fill pages points we mostly don't
use.  Thanks,

Alex

next prev parent reply	other threads:[~2025-05-15 21:19 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-13  3:57 [PATCH] vfio/type1: optimize vfio_pin_pages_remote() for hugetlbfs folio lizhe.67
2025-05-15 21:19 ` Alex Williamson [this message]
2025-05-16  8:16   ` lizhe.67
2025-05-16 14:17     ` Alex Williamson
2025-05-16 14:18   ` Jason Gunthorpe
2025-05-16 14:55     ` Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250515151946.1e6edf8b.alex.williamson@redhat.com \
    --to=alex.williamson@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizhe.67@bytedance.com \
    --cc=muchun.song@linux.dev \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).