From: Alex Williamson <alex.williamson@redhat.com>
To: lizhe.67@bytedance.com
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Peter Xu <peterx@redhat.com>,
David Hildenbrand <david@redhat.com>
Subject: Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
Date: Thu, 12 Jun 2025 16:32:39 -0600 [thread overview]
Message-ID: <20250612163239.5e45afc6.alex.williamson@redhat.com> (raw)
In-Reply-To: <20250610045753.6405-1-lizhe.67@bytedance.com>
[Cc+ David, Peter]
On Tue, 10 Jun 2025 12:57:53 +0800
lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
>
> This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> for large folios'[1].
>
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
>
> This patch optimize this process by batching the put_pfn() operations.
>
> The performance test results, based on v6.15, for completing the 16G VFIO
> IOMMU DMA unmapping, obtained through unit test[2] with slight
> modifications[3], are as follows.
>
> Base(v6.15):
> ./vfio-pci-mem-dma-map 0000:03:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.048 s (331.3 GB/s)
> VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.281 s (57.0 GB/s)
> VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.053 s (301.2 GB/s)
> VFIO UNMAP DMA in 0.139 s (115.2 GB/s)
>
> Map[1] + This patches:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.028 s (578.4 GB/s)
> VFIO UNMAP DMA in 0.049 s (324.8 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.293 s (54.6 GB/s)
> VFIO UNMAP DMA in 0.308 s (51.9 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.032 s (494.5 GB/s)
> VFIO UNMAP DMA in 0.050 s (322.8 GB/s)
>
> For large folio, we achieve an approximate 64% performance improvement
> in the VFIO UNMAP DMA item. For small folios, the performance test
> results appear to show no significant changes.
>
> [1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
> [2]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
> [3]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
> Changelogs:
>
> v1->v2:
> - Refactor the implementation of the optimized code
>
> v1 patch: https://lore.kernel.org/all/20250605124923.21896-1-lizhe.67@bytedance.com/
>
> drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
> 1 file changed, 41 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 28ee4b8d39ae..2f6c0074d7b3 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> return true;
> }
>
> -static int put_pfn(unsigned long pfn, int prot)
> +static inline void _put_pfns(struct page *page, int npages, int prot)
> {
> - if (!is_invalid_reserved_pfn(pfn)) {
> - struct page *page = pfn_to_page(pfn);
> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> +}
>
> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> - return 1;
> +/*
> + * The caller must ensure that these npages PFNs belong to the same folio.
> + */
> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + _put_pfns(pfn_to_page(pfn), npages, prot);
> + return npages;
> }
> return 0;
> }
>
> +static inline int put_pfn(unsigned long pfn, int prot)
> +{
> + return put_pfns(pfn, 1, prot);
> +}
> +
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>
> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> unsigned long pfn, unsigned long npage,
> bool do_accounting)
> {
> - long unlocked = 0, locked = 0;
> - long i;
> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>
> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> - if (put_pfn(pfn++, dma->prot)) {
> - unlocked++;
> - if (vfio_find_vpfn(dma, iova))
> - locked++;
> + while (npage) {
> + struct folio *folio;
> + struct page *page;
> + long step = 1;
> +
> + if (is_invalid_reserved_pfn(pfn))
> + goto next;
> +
> + page = pfn_to_page(pfn);
> + folio = page_folio(page);
> +
> + if (!folio_test_large(folio)) {
> + _put_pfns(page, 1, dma->prot);
> + } else {
> + step = min_t(long, npage,
> + folio_nr_pages(folio) -
> + folio_page_idx(folio, page));
> + _put_pfns(page, step, dma->prot);
> }
> +
> + unlocked += step;
> +next:
Usage of @step is inconsistent, goto isn't really necessary either, how
about:
while (npage) {
unsigned long step = 1;
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
struct folio *folio = page_folio(page);
long nr_pages = folio_nr_pages(folio);
if (nr_pages > 1)
step = min_t(long, npage,
nr_pages -
folio_page_idx(folio, page));
_put_pfns(page, step, dma->prot);
unlocked += step;
}
> + pfn += step;
> + iova += PAGE_SIZE * step;
> + npage -= step;
> }
>
> if (do_accounting)
AIUI, the idea is that we know we have npage contiguous pfns and we
currently test invalid/reserved, call pfn_to_page(), call
unpin_user_pages_dirty_lock(), and test vpfn for each individually.
This instead wants to batch the vpfn accounted pfns using the range
helper added for the mapping patch, infer that continuous pfns have the
same invalid/reserved state, the pages are sequential, and that we can
use the end of the folio to mark any inflections in those assumptions
otherwise. Do I have that correct?
I think this could be split into two patches, one simply batching the
vpfn accounting and the next introducing the folio dependency. The
contributions of each to the overall performance improvement would be
interesting.
I'll Cc some mm folks to see if they can punch holes in it. Thanks,
Alex
next prev parent reply other threads:[~2025-06-12 22:32 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-10 4:57 [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-12 22:32 ` Alex Williamson [this message]
2025-06-13 6:29 ` lizhe.67
2025-06-13 13:37 ` David Hildenbrand
2025-06-13 14:16 ` Alex Williamson
2025-06-13 16:08 ` David Hildenbrand
2025-06-13 17:38 ` Alex Williamson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250612163239.5e45afc6.alex.williamson@redhat.com \
--to=alex.williamson@redhat.com \
--cc=david@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lizhe.67@bytedance.com \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).