* [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
@ 2025-06-10 4:57 lizhe.67
2025-06-12 22:32 ` Alex Williamson
0 siblings, 1 reply; 7+ messages in thread
From: lizhe.67 @ 2025-06-10 4:57 UTC (permalink / raw)
To: alex.williamson; +Cc: kvm, linux-kernel, lizhe.67
From: Li Zhe <lizhe.67@bytedance.com>
This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
for large folios'[1].
When vfio_unpin_pages_remote() is called with a range of addresses that
includes large folios, the function currently performs individual
put_pfn() operations for each page. This can lead to significant
performance overheads, especially when dealing with large ranges of pages.
This patch optimize this process by batching the put_pfn() operations.
The performance test results, based on v6.15, for completing the 16G VFIO
IOMMU DMA unmapping, obtained through unit test[2] with slight
modifications[3], are as follows.
Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.048 s (331.3 GB/s)
VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.281 s (57.0 GB/s)
VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.053 s (301.2 GB/s)
VFIO UNMAP DMA in 0.139 s (115.2 GB/s)
Map[1] + This patches:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.028 s (578.4 GB/s)
VFIO UNMAP DMA in 0.049 s (324.8 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.293 s (54.6 GB/s)
VFIO UNMAP DMA in 0.308 s (51.9 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (494.5 GB/s)
VFIO UNMAP DMA in 0.050 s (322.8 GB/s)
For large folio, we achieve an approximate 64% performance improvement
in the VFIO UNMAP DMA item. For small folios, the performance test
results appear to show no significant changes.
[1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
[2]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
[3]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
---
Changelogs:
v1->v2:
- Refactor the implementation of the optimized code
v1 patch: https://lore.kernel.org/all/20250605124923.21896-1-lizhe.67@bytedance.com/
drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
1 file changed, 41 insertions(+), 12 deletions(-)
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 28ee4b8d39ae..2f6c0074d7b3 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
return true;
}
-static int put_pfn(unsigned long pfn, int prot)
+static inline void _put_pfns(struct page *page, int npages, int prot)
{
- if (!is_invalid_reserved_pfn(pfn)) {
- struct page *page = pfn_to_page(pfn);
+ unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
+}
- unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
- return 1;
+/*
+ * The caller must ensure that these npages PFNs belong to the same folio.
+ */
+static inline int put_pfns(unsigned long pfn, int npages, int prot)
+{
+ if (!is_invalid_reserved_pfn(pfn)) {
+ _put_pfns(pfn_to_page(pfn), npages, prot);
+ return npages;
}
return 0;
}
+static inline int put_pfn(unsigned long pfn, int prot)
+{
+ return put_pfns(pfn, 1, prot);
+}
+
#define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
static void __vfio_batch_init(struct vfio_batch *batch, bool single)
@@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
unsigned long pfn, unsigned long npage,
bool do_accounting)
{
- long unlocked = 0, locked = 0;
- long i;
+ long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
- for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
- if (put_pfn(pfn++, dma->prot)) {
- unlocked++;
- if (vfio_find_vpfn(dma, iova))
- locked++;
+ while (npage) {
+ struct folio *folio;
+ struct page *page;
+ long step = 1;
+
+ if (is_invalid_reserved_pfn(pfn))
+ goto next;
+
+ page = pfn_to_page(pfn);
+ folio = page_folio(page);
+
+ if (!folio_test_large(folio)) {
+ _put_pfns(page, 1, dma->prot);
+ } else {
+ step = min_t(long, npage,
+ folio_nr_pages(folio) -
+ folio_page_idx(folio, page));
+ _put_pfns(page, step, dma->prot);
}
+
+ unlocked += step;
+next:
+ pfn += step;
+ iova += PAGE_SIZE * step;
+ npage -= step;
}
if (do_accounting)
--
2.20.1
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-10 4:57 [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
@ 2025-06-12 22:32 ` Alex Williamson
2025-06-13 6:29 ` lizhe.67
0 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2025-06-12 22:32 UTC (permalink / raw)
To: lizhe.67; +Cc: kvm, linux-kernel, Peter Xu, David Hildenbrand
[Cc+ David, Peter]
On Tue, 10 Jun 2025 12:57:53 +0800
lizhe.67@bytedance.com wrote:
> From: Li Zhe <lizhe.67@bytedance.com>
>
> This patch is based on patch 'vfio/type1: optimize vfio_pin_pages_remote()
> for large folios'[1].
>
> When vfio_unpin_pages_remote() is called with a range of addresses that
> includes large folios, the function currently performs individual
> put_pfn() operations for each page. This can lead to significant
> performance overheads, especially when dealing with large ranges of pages.
>
> This patch optimize this process by batching the put_pfn() operations.
>
> The performance test results, based on v6.15, for completing the 16G VFIO
> IOMMU DMA unmapping, obtained through unit test[2] with slight
> modifications[3], are as follows.
>
> Base(v6.15):
> ./vfio-pci-mem-dma-map 0000:03:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.048 s (331.3 GB/s)
> VFIO UNMAP DMA in 0.138 s (116.1 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.281 s (57.0 GB/s)
> VFIO UNMAP DMA in 0.313 s (51.1 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.053 s (301.2 GB/s)
> VFIO UNMAP DMA in 0.139 s (115.2 GB/s)
>
> Map[1] + This patches:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.028 s (578.4 GB/s)
> VFIO UNMAP DMA in 0.049 s (324.8 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.293 s (54.6 GB/s)
> VFIO UNMAP DMA in 0.308 s (51.9 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.032 s (494.5 GB/s)
> VFIO UNMAP DMA in 0.050 s (322.8 GB/s)
>
> For large folio, we achieve an approximate 64% performance improvement
> in the VFIO UNMAP DMA item. For small folios, the performance test
> results appear to show no significant changes.
>
> [1]: https://lore.kernel.org/all/20250529064947.38433-1-lizhe.67@bytedance.com/
> [2]: https://github.com/awilliam/tests/blob/vfio-pci-mem-dma-map/vfio-pci-mem-dma-map.c
> [3]: https://lore.kernel.org/all/20250610031013.98556-1-lizhe.67@bytedance.com/
>
> Signed-off-by: Li Zhe <lizhe.67@bytedance.com>
> ---
> Changelogs:
>
> v1->v2:
> - Refactor the implementation of the optimized code
>
> v1 patch: https://lore.kernel.org/all/20250605124923.21896-1-lizhe.67@bytedance.com/
>
> drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
> 1 file changed, 41 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 28ee4b8d39ae..2f6c0074d7b3 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> return true;
> }
>
> -static int put_pfn(unsigned long pfn, int prot)
> +static inline void _put_pfns(struct page *page, int npages, int prot)
> {
> - if (!is_invalid_reserved_pfn(pfn)) {
> - struct page *page = pfn_to_page(pfn);
> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> +}
>
> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> - return 1;
> +/*
> + * The caller must ensure that these npages PFNs belong to the same folio.
> + */
> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + _put_pfns(pfn_to_page(pfn), npages, prot);
> + return npages;
> }
> return 0;
> }
>
> +static inline int put_pfn(unsigned long pfn, int prot)
> +{
> + return put_pfns(pfn, 1, prot);
> +}
> +
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>
> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> unsigned long pfn, unsigned long npage,
> bool do_accounting)
> {
> - long unlocked = 0, locked = 0;
> - long i;
> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>
> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> - if (put_pfn(pfn++, dma->prot)) {
> - unlocked++;
> - if (vfio_find_vpfn(dma, iova))
> - locked++;
> + while (npage) {
> + struct folio *folio;
> + struct page *page;
> + long step = 1;
> +
> + if (is_invalid_reserved_pfn(pfn))
> + goto next;
> +
> + page = pfn_to_page(pfn);
> + folio = page_folio(page);
> +
> + if (!folio_test_large(folio)) {
> + _put_pfns(page, 1, dma->prot);
> + } else {
> + step = min_t(long, npage,
> + folio_nr_pages(folio) -
> + folio_page_idx(folio, page));
> + _put_pfns(page, step, dma->prot);
> }
> +
> + unlocked += step;
> +next:
Usage of @step is inconsistent, goto isn't really necessary either, how
about:
while (npage) {
unsigned long step = 1;
if (!is_invalid_reserved_pfn(pfn)) {
struct page *page = pfn_to_page(pfn);
struct folio *folio = page_folio(page);
long nr_pages = folio_nr_pages(folio);
if (nr_pages > 1)
step = min_t(long, npage,
nr_pages -
folio_page_idx(folio, page));
_put_pfns(page, step, dma->prot);
unlocked += step;
}
> + pfn += step;
> + iova += PAGE_SIZE * step;
> + npage -= step;
> }
>
> if (do_accounting)
AIUI, the idea is that we know we have npage contiguous pfns and we
currently test invalid/reserved, call pfn_to_page(), call
unpin_user_pages_dirty_lock(), and test vpfn for each individually.
This instead wants to batch the vpfn accounted pfns using the range
helper added for the mapping patch, infer that continuous pfns have the
same invalid/reserved state, the pages are sequential, and that we can
use the end of the folio to mark any inflections in those assumptions
otherwise. Do I have that correct?
I think this could be split into two patches, one simply batching the
vpfn accounting and the next introducing the folio dependency. The
contributions of each to the overall performance improvement would be
interesting.
I'll Cc some mm folks to see if they can punch holes in it. Thanks,
Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-12 22:32 ` Alex Williamson
@ 2025-06-13 6:29 ` lizhe.67
2025-06-13 13:37 ` David Hildenbrand
2025-06-13 17:38 ` Alex Williamson
0 siblings, 2 replies; 7+ messages in thread
From: lizhe.67 @ 2025-06-13 6:29 UTC (permalink / raw)
To: alex.williamson; +Cc: david, kvm, linux-kernel, lizhe.67, peterx
On Thu, 12 Jun 2025 16:32:39 -0600, alex.williamson@redhat.com wrote:
> > drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
> > 1 file changed, 41 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 28ee4b8d39ae..2f6c0074d7b3 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> > return true;
> > }
> >
> > -static int put_pfn(unsigned long pfn, int prot)
> > +static inline void _put_pfns(struct page *page, int npages, int prot)
> > {
> > - if (!is_invalid_reserved_pfn(pfn)) {
> > - struct page *page = pfn_to_page(pfn);
> > + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> > +}
> >
> > - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> > - return 1;
> > +/*
> > + * The caller must ensure that these npages PFNs belong to the same folio.
> > + */
> > +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> > +{
> > + if (!is_invalid_reserved_pfn(pfn)) {
> > + _put_pfns(pfn_to_page(pfn), npages, prot);
> > + return npages;
> > }
> > return 0;
> > }
> >
> > +static inline int put_pfn(unsigned long pfn, int prot)
> > +{
> > + return put_pfns(pfn, 1, prot);
> > +}
> > +
> > #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> >
> > static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> > @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > unsigned long pfn, unsigned long npage,
> > bool do_accounting)
> > {
> > - long unlocked = 0, locked = 0;
> > - long i;
> > + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> >
> > - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > - if (put_pfn(pfn++, dma->prot)) {
> > - unlocked++;
> > - if (vfio_find_vpfn(dma, iova))
> > - locked++;
> > + while (npage) {
> > + struct folio *folio;
> > + struct page *page;
> > + long step = 1;
> > +
> > + if (is_invalid_reserved_pfn(pfn))
> > + goto next;
> > +
> > + page = pfn_to_page(pfn);
> > + folio = page_folio(page);
> > +
> > + if (!folio_test_large(folio)) {
> > + _put_pfns(page, 1, dma->prot);
> > + } else {
> > + step = min_t(long, npage,
> > + folio_nr_pages(folio) -
> > + folio_page_idx(folio, page));
> > + _put_pfns(page, step, dma->prot);
> > }
> > +
> > + unlocked += step;
> > +next:
>
> Usage of @step is inconsistent, goto isn't really necessary either, how
> about:
>
> while (npage) {
> unsigned long step = 1;
>
> if (!is_invalid_reserved_pfn(pfn)) {
> struct page *page = pfn_to_page(pfn);
> struct folio *folio = page_folio(page);
> long nr_pages = folio_nr_pages(folio);
>
> if (nr_pages > 1)
> step = min_t(long, npage,
> nr_pages -
> folio_page_idx(folio, page));
>
> _put_pfns(page, step, dma->prot);
> unlocked += step;
> }
>
That's great. This implementation is much better.
I'm a bit uncertain about the best type to use for the 'step'
variable here. I've been trying to keep things consistent with the
put_pfn() function, so I set the type of the second parameter in
_put_pfns() to 'int'(we pass 'step' as the second argument to
_put_pfns()).
Using unsigned long for 'step' should definitely work here, as the
number of pages in a large folio currently falls within the range
that can be represented by an int. However, there is still a
potential risk of truncation that we need to be mindful of.
> > + pfn += step;
> > + iova += PAGE_SIZE * step;
> > + npage -= step;
> > }
> >
> > if (do_accounting)
>
> AIUI, the idea is that we know we have npage contiguous pfns and we
> currently test invalid/reserved, call pfn_to_page(), call
> unpin_user_pages_dirty_lock(), and test vpfn for each individually.
>
> This instead wants to batch the vpfn accounted pfns using the range
> helper added for the mapping patch,
Yes. We use vpfn_pages() just to track the locked pages.
> infer that continuous pfns have the
> same invalid/reserved state, the pages are sequential, and that we can
> use the end of the folio to mark any inflections in those assumptions
> otherwise. Do I have that correct?
Yes. I think we're definitely on the same page here.
> I think this could be split into two patches, one simply batching the
> vpfn accounting and the next introducing the folio dependency. The
> contributions of each to the overall performance improvement would be
> interesting.
I've made an initial attempt, and here are the two patches after
splitting them up.
1. batch-vpfn-accounting-patch:
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 28ee4b8d39ae..c8ddcee5aa68 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
unsigned long pfn, unsigned long npage,
bool do_accounting)
{
- long unlocked = 0, locked = 0;
+ long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
long i;
- for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
- if (put_pfn(pfn++, dma->prot)) {
+ for (i = 0; i < npage; i++, iova += PAGE_SIZE)
+ if (put_pfn(pfn++, dma->prot))
unlocked++;
- if (vfio_find_vpfn(dma, iova))
- locked++;
- }
- }
if (do_accounting)
vfio_lock_acct(dma, locked - unlocked, true);
-----------------
2. large-folio-optimization-patch:
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c8ddcee5aa68..48c2ba4ba4eb 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
return true;
}
-static int put_pfn(unsigned long pfn, int prot)
+static inline void _put_pfns(struct page *page, int npages, int prot)
{
- if (!is_invalid_reserved_pfn(pfn)) {
- struct page *page = pfn_to_page(pfn);
+ unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
+}
- unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
- return 1;
+/*
+ * The caller must ensure that these npages PFNs belong to the same folio.
+ */
+static inline int put_pfns(unsigned long pfn, int npages, int prot)
+{
+ if (!is_invalid_reserved_pfn(pfn)) {
+ _put_pfns(pfn_to_page(pfn), npages, prot);
+ return npages;
}
return 0;
}
+static inline int put_pfn(unsigned long pfn, int prot)
+{
+ return put_pfns(pfn, 1, prot);
+}
+
#define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
static void __vfio_batch_init(struct vfio_batch *batch, bool single)
@@ -806,11 +817,28 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
bool do_accounting)
{
long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
- long i;
- for (i = 0; i < npage; i++, iova += PAGE_SIZE)
- if (put_pfn(pfn++, dma->prot))
- unlocked++;
+ while (npage) {
+ long step = 1;
+
+ if (!is_invalid_reserved_pfn(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+ struct folio *folio = page_folio(page);
+ long nr_pages = folio_nr_pages(folio);
+
+ if (nr_pages > 1)
+ step = min_t(long, npage,
+ nr_pages -
+ folio_page_idx(folio, page));
+
+ _put_pfns(page, step, dma->prot);
+ unlocked += step;
+ }
+
+ pfn += step;
+ iova += PAGE_SIZE * step;
+ npage -= step;
+ }
if (do_accounting)
vfio_lock_acct(dma, locked - unlocked, true);
-----------------
Here are the results of the performance tests.
Base(v6.15):
./vfio-pci-mem-dma-map 0000:03:00.0 16
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.048 s (333.5 GB/s)
VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.273 s (58.6 GB/s)
VFIO UNMAP DMA in 0.302 s (52.9 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.052 s (305.3 GB/s)
VFIO UNMAP DMA in 0.141 s (113.8 GB/s)
Base + Map + batch-vpfn-accounting-patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (591.1 GB/s)
VFIO UNMAP DMA in 0.138 s (115.7 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.292 s (54.8 GB/s)
VFIO UNMAP DMA in 0.308 s (52.0 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (505.5 GB/s)
VFIO UNMAP DMA in 0.140 s (114.1 GB/s)
Base + Map + batch-vpfn-accounting-patch + large-folio-optimization-patch:
------- AVERAGE (MADV_HUGEPAGE) --------
VFIO MAP DMA in 0.027 s (591.2 GB/s)
VFIO UNMAP DMA in 0.049 s (327.6 GB/s)
------- AVERAGE (MAP_POPULATE) --------
VFIO MAP DMA in 0.291 s (55.0 GB/s)
VFIO UNMAP DMA in 0.306 s (52.3 GB/s)
------- AVERAGE (HUGETLBFS) --------
VFIO MAP DMA in 0.032 s (498.3 GB/s)
VFIO UNMAP DMA in 0.049 s (326.2 GB/s)
It seems that batching the vpfn accounting doesn't seem to have much
of an impact in my environment. Perhaps this is because the rbtree
for vfpn is empty, allowing vfio_find_vpfn to execute quickly?
Thanks,
Zhe
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-13 6:29 ` lizhe.67
@ 2025-06-13 13:37 ` David Hildenbrand
2025-06-13 14:16 ` Alex Williamson
2025-06-13 17:38 ` Alex Williamson
1 sibling, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2025-06-13 13:37 UTC (permalink / raw)
To: lizhe.67, alex.williamson; +Cc: kvm, linux-kernel, peterx
On 13.06.25 08:29, lizhe.67@bytedance.com wrote:
> On Thu, 12 Jun 2025 16:32:39 -0600, alex.williamson@redhat.com wrote:
>
>>> drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
>>> 1 file changed, 41 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 28ee4b8d39ae..2f6c0074d7b3 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>> return true;
>>> }
>>>
>>> -static int put_pfn(unsigned long pfn, int prot)
>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>> {
>>> - if (!is_invalid_reserved_pfn(pfn)) {
>>> - struct page *page = pfn_to_page(pfn);
>>> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>> +}
>>>
>>> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>> - return 1;
>>> +/*
>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>> + */
>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>> +{
>>> + if (!is_invalid_reserved_pfn(pfn)) {
>>> + _put_pfns(pfn_to_page(pfn), npages, prot);
>>> + return npages;
>>> }
>>> return 0;
>>> }
>>>
>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>> +{
>>> + return put_pfns(pfn, 1, prot);
>>> +}
>>> +
>>> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>
>>> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>> @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>> unsigned long pfn, unsigned long npage,
>>> bool do_accounting)
>>> {
>>> - long unlocked = 0, locked = 0;
>>> - long i;
>>> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>>
>>> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>>> - if (put_pfn(pfn++, dma->prot)) {
>>> - unlocked++;
>>> - if (vfio_find_vpfn(dma, iova))
>>> - locked++;
>>> + while (npage) {
>>> + struct folio *folio;
>>> + struct page *page;
>>> + long step = 1;
>>> +
>>> + if (is_invalid_reserved_pfn(pfn))
>>> + goto next;
>>> +
>>> + page = pfn_to_page(pfn);
>>> + folio = page_folio(page);
>>> +
>>> + if (!folio_test_large(folio)) {
>>> + _put_pfns(page, 1, dma->prot);
>>> + } else {
>>> + step = min_t(long, npage,
>>> + folio_nr_pages(folio) -
>>> + folio_page_idx(folio, page));
>>> + _put_pfns(page, step, dma->prot);
>>> }
>>> +
>>> + unlocked += step;
>>> +next:
>>
>> Usage of @step is inconsistent, goto isn't really necessary either, how
>> about:
>>
>> while (npage) {
>> unsigned long step = 1;
>>
>> if (!is_invalid_reserved_pfn(pfn)) {
>> struct page *page = pfn_to_page(pfn);
>> struct folio *folio = page_folio(page);
>> long nr_pages = folio_nr_pages(folio);
>>
>> if (nr_pages > 1)
>> step = min_t(long, npage,
>> nr_pages -
>> folio_page_idx(folio, page));
>>
>> _put_pfns(page, step, dma->prot);
>> unlocked += step;
>> }
>>
>
> That's great. This implementation is much better.
>
> I'm a bit uncertain about the best type to use for the 'step'
> variable here. I've been trying to keep things consistent with the
> put_pfn() function, so I set the type of the second parameter in
> _put_pfns() to 'int'(we pass 'step' as the second argument to
> _put_pfns()).
>
> Using unsigned long for 'step' should definitely work here, as the
> number of pages in a large folio currently falls within the range
> that can be represented by an int. However, there is still a
> potential risk of truncation that we need to be mindful of.
>
>>> + pfn += step;
>>> + iova += PAGE_SIZE * step;
>>> + npage -= step;
>>> }
>>>
>>> if (do_accounting)
>>
>> AIUI, the idea is that we know we have npage contiguous pfns and we
>> currently test invalid/reserved, call pfn_to_page(), call
>> unpin_user_pages_dirty_lock(), and test vpfn for each individually.
>>
>> This instead wants to batch the vpfn accounted pfns using the range
>> helper added for the mapping patch,
>
> Yes. We use vpfn_pages() just to track the locked pages.
>
>> infer that continuous pfns have the
>> same invalid/reserved state, the pages are sequential, and that we can
>> use the end of the folio to mark any inflections in those assumptions
>> otherwise. Do I have that correct?
>
> Yes. I think we're definitely on the same page here.
>
>> I think this could be split into two patches, one simply batching the
>> vpfn accounting and the next introducing the folio dependency. The
>> contributions of each to the overall performance improvement would be
>> interesting.
>
> I've made an initial attempt, and here are the two patches after
> splitting them up.
>
> 1. batch-vpfn-accounting-patch:
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 28ee4b8d39ae..c8ddcee5aa68 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> unsigned long pfn, unsigned long npage,
> bool do_accounting)
> {
> - long unlocked = 0, locked = 0;
> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> long i;
>
> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> - if (put_pfn(pfn++, dma->prot)) {
> + for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> + if (put_pfn(pfn++, dma->prot))
> unlocked++;
> - if (vfio_find_vpfn(dma, iova))
> - locked++;
> - }
> - }
>
> if (do_accounting)
> vfio_lock_acct(dma, locked - unlocked, true);
> -----------------
>
> 2. large-folio-optimization-patch:
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c8ddcee5aa68..48c2ba4ba4eb 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> return true;
> }
>
> -static int put_pfn(unsigned long pfn, int prot)
> +static inline void _put_pfns(struct page *page, int npages, int prot)
> {
> - if (!is_invalid_reserved_pfn(pfn)) {
> - struct page *page = pfn_to_page(pfn);
> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> +}
>
> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> - return 1;
> +/*
> + * The caller must ensure that these npages PFNs belong to the same folio.
> + */
> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + _put_pfns(pfn_to_page(pfn), npages, prot);
> + return npages;
> }
> return 0;
> }
>
> +static inline int put_pfn(unsigned long pfn, int prot)
> +{
> + return put_pfns(pfn, 1, prot);
> +}
> +
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>
> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> @@ -806,11 +817,28 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> bool do_accounting)
> {
> long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> - long i;
>
> - for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> - if (put_pfn(pfn++, dma->prot))
> - unlocked++;
> + while (npage) {
> + long step = 1;
> +
> + if (!is_invalid_reserved_pfn(pfn)) {
> + struct page *page = pfn_to_page(pfn);
> + struct folio *folio = page_folio(page);
> + long nr_pages = folio_nr_pages(folio);
> +
> + if (nr_pages > 1)
> + step = min_t(long, npage,
> + nr_pages -
> + folio_page_idx(folio, page));
> +
> + _put_pfns(page, step, dma->prot);
I'm confused, why do we batch pages by looking at the folio, to then
pass the pages into unpin_user_page_range_dirty_lock?
Why does the folio relationship matter at all here?
Aren't we making the same mistake that we are jumping over pages we
shouldn't be jumping over, because we assume they belong to that folio?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-13 13:37 ` David Hildenbrand
@ 2025-06-13 14:16 ` Alex Williamson
2025-06-13 16:08 ` David Hildenbrand
0 siblings, 1 reply; 7+ messages in thread
From: Alex Williamson @ 2025-06-13 14:16 UTC (permalink / raw)
To: David Hildenbrand; +Cc: lizhe.67, kvm, linux-kernel, peterx
On Fri, 13 Jun 2025 15:37:40 +0200
David Hildenbrand <david@redhat.com> wrote:
> On 13.06.25 08:29, lizhe.67@bytedance.com wrote:
> > On Thu, 12 Jun 2025 16:32:39 -0600, alex.williamson@redhat.com wrote:
> >
> >>> drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
> >>> 1 file changed, 41 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >>> index 28ee4b8d39ae..2f6c0074d7b3 100644
> >>> --- a/drivers/vfio/vfio_iommu_type1.c
> >>> +++ b/drivers/vfio/vfio_iommu_type1.c
> >>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>> return true;
> >>> }
> >>>
> >>> -static int put_pfn(unsigned long pfn, int prot)
> >>> +static inline void _put_pfns(struct page *page, int npages, int prot)
> >>> {
> >>> - if (!is_invalid_reserved_pfn(pfn)) {
> >>> - struct page *page = pfn_to_page(pfn);
> >>> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> >>> +}
> >>>
> >>> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> >>> - return 1;
> >>> +/*
> >>> + * The caller must ensure that these npages PFNs belong to the same folio.
> >>> + */
> >>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> >>> +{
> >>> + if (!is_invalid_reserved_pfn(pfn)) {
> >>> + _put_pfns(pfn_to_page(pfn), npages, prot);
> >>> + return npages;
> >>> }
> >>> return 0;
> >>> }
> >>>
> >>> +static inline int put_pfn(unsigned long pfn, int prot)
> >>> +{
> >>> + return put_pfns(pfn, 1, prot);
> >>> +}
> >>> +
> >>> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> >>>
> >>> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> >>> @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> >>> unsigned long pfn, unsigned long npage,
> >>> bool do_accounting)
> >>> {
> >>> - long unlocked = 0, locked = 0;
> >>> - long i;
> >>> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> >>>
> >>> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> >>> - if (put_pfn(pfn++, dma->prot)) {
> >>> - unlocked++;
> >>> - if (vfio_find_vpfn(dma, iova))
> >>> - locked++;
> >>> + while (npage) {
> >>> + struct folio *folio;
> >>> + struct page *page;
> >>> + long step = 1;
> >>> +
> >>> + if (is_invalid_reserved_pfn(pfn))
> >>> + goto next;
> >>> +
> >>> + page = pfn_to_page(pfn);
> >>> + folio = page_folio(page);
> >>> +
> >>> + if (!folio_test_large(folio)) {
> >>> + _put_pfns(page, 1, dma->prot);
> >>> + } else {
> >>> + step = min_t(long, npage,
> >>> + folio_nr_pages(folio) -
> >>> + folio_page_idx(folio, page));
> >>> + _put_pfns(page, step, dma->prot);
> >>> }
> >>> +
> >>> + unlocked += step;
> >>> +next:
> >>
> >> Usage of @step is inconsistent, goto isn't really necessary either, how
> >> about:
> >>
> >> while (npage) {
> >> unsigned long step = 1;
> >>
> >> if (!is_invalid_reserved_pfn(pfn)) {
> >> struct page *page = pfn_to_page(pfn);
> >> struct folio *folio = page_folio(page);
> >> long nr_pages = folio_nr_pages(folio);
> >>
> >> if (nr_pages > 1)
> >> step = min_t(long, npage,
> >> nr_pages -
> >> folio_page_idx(folio, page));
> >>
> >> _put_pfns(page, step, dma->prot);
> >> unlocked += step;
> >> }
> >>
> >
> > That's great. This implementation is much better.
> >
> > I'm a bit uncertain about the best type to use for the 'step'
> > variable here. I've been trying to keep things consistent with the
> > put_pfn() function, so I set the type of the second parameter in
> > _put_pfns() to 'int'(we pass 'step' as the second argument to
> > _put_pfns()).
> >
> > Using unsigned long for 'step' should definitely work here, as the
> > number of pages in a large folio currently falls within the range
> > that can be represented by an int. However, there is still a
> > potential risk of truncation that we need to be mindful of.
> >
> >>> + pfn += step;
> >>> + iova += PAGE_SIZE * step;
> >>> + npage -= step;
> >>> }
> >>>
> >>> if (do_accounting)
> >>
> >> AIUI, the idea is that we know we have npage contiguous pfns and we
> >> currently test invalid/reserved, call pfn_to_page(), call
> >> unpin_user_pages_dirty_lock(), and test vpfn for each individually.
> >>
> >> This instead wants to batch the vpfn accounted pfns using the range
> >> helper added for the mapping patch,
> >
> > Yes. We use vpfn_pages() just to track the locked pages.
> >
> >> infer that continuous pfns have the
> >> same invalid/reserved state, the pages are sequential, and that we can
> >> use the end of the folio to mark any inflections in those assumptions
> >> otherwise. Do I have that correct?
> >
> > Yes. I think we're definitely on the same page here.
> >
> >> I think this could be split into two patches, one simply batching the
> >> vpfn accounting and the next introducing the folio dependency. The
> >> contributions of each to the overall performance improvement would be
> >> interesting.
> >
> > I've made an initial attempt, and here are the two patches after
> > splitting them up.
> >
> > 1. batch-vpfn-accounting-patch:
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index 28ee4b8d39ae..c8ddcee5aa68 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > unsigned long pfn, unsigned long npage,
> > bool do_accounting)
> > {
> > - long unlocked = 0, locked = 0;
> > + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> > long i;
> >
> > - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > - if (put_pfn(pfn++, dma->prot)) {
> > + for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > + if (put_pfn(pfn++, dma->prot))
> > unlocked++;
> > - if (vfio_find_vpfn(dma, iova))
> > - locked++;
> > - }
> > - }
> >
> > if (do_accounting)
> > vfio_lock_acct(dma, locked - unlocked, true);
> > -----------------
> >
> > 2. large-folio-optimization-patch:
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > index c8ddcee5aa68..48c2ba4ba4eb 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> > return true;
> > }
> >
> > -static int put_pfn(unsigned long pfn, int prot)
> > +static inline void _put_pfns(struct page *page, int npages, int prot)
> > {
> > - if (!is_invalid_reserved_pfn(pfn)) {
> > - struct page *page = pfn_to_page(pfn);
> > + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> > +}
> >
> > - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> > - return 1;
> > +/*
> > + * The caller must ensure that these npages PFNs belong to the same folio.
> > + */
> > +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> > +{
> > + if (!is_invalid_reserved_pfn(pfn)) {
> > + _put_pfns(pfn_to_page(pfn), npages, prot);
> > + return npages;
> > }
> > return 0;
> > }
> >
> > +static inline int put_pfn(unsigned long pfn, int prot)
> > +{
> > + return put_pfns(pfn, 1, prot);
> > +}
> > +
> > #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> >
> > static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> > @@ -806,11 +817,28 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > bool do_accounting)
> > {
> > long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> > - long i;
> >
> > - for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > - if (put_pfn(pfn++, dma->prot))
> > - unlocked++;
> > + while (npage) {
> > + long step = 1;
> > +
> > + if (!is_invalid_reserved_pfn(pfn)) {
> > + struct page *page = pfn_to_page(pfn);
> > + struct folio *folio = page_folio(page);
> > + long nr_pages = folio_nr_pages(folio);
> > +
> > + if (nr_pages > 1)
> > + step = min_t(long, npage,
> > + nr_pages -
> > + folio_page_idx(folio, page));
> > +
> > + _put_pfns(page, step, dma->prot);
>
> I'm confused, why do we batch pages by looking at the folio, to then
> pass the pages into unpin_user_page_range_dirty_lock?
>
> Why does the folio relationship matter at all here?
>
> Aren't we making the same mistake that we are jumping over pages we
> shouldn't be jumping over, because we assume they belong to that folio?
That's my concern as well. On the mapping side we had an array of
pages from gup and we tested each page in the gup array relative to the
folio pages. I think that's because the gup array could have
non-sequential pages, but aiui the folio should have sequential
pages(?). Here I think we're trying to assume that sequential pfns
results in sequential pages and folios should have sequential pages, so
the folio just gives us a point to look for changes in invalid/reserved.
Is that valid? Thanks,
Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-13 14:16 ` Alex Williamson
@ 2025-06-13 16:08 ` David Hildenbrand
0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand @ 2025-06-13 16:08 UTC (permalink / raw)
To: Alex Williamson; +Cc: lizhe.67, kvm, linux-kernel, peterx
On 13.06.25 16:16, Alex Williamson wrote:
> On Fri, 13 Jun 2025 15:37:40 +0200
> David Hildenbrand <david@redhat.com> wrote:
>
>> On 13.06.25 08:29, lizhe.67@bytedance.com wrote:
>>> On Thu, 12 Jun 2025 16:32:39 -0600, alex.williamson@redhat.com wrote:
>>>
>>>>> drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
>>>>> 1 file changed, 41 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index 28ee4b8d39ae..2f6c0074d7b3 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>>>> return true;
>>>>> }
>>>>>
>>>>> -static int put_pfn(unsigned long pfn, int prot)
>>>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>>>> {
>>>>> - if (!is_invalid_reserved_pfn(pfn)) {
>>>>> - struct page *page = pfn_to_page(pfn);
>>>>> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>>>> +}
>>>>>
>>>>> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>>>> - return 1;
>>>>> +/*
>>>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>>>> + */
>>>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>>>> +{
>>>>> + if (!is_invalid_reserved_pfn(pfn)) {
>>>>> + _put_pfns(pfn_to_page(pfn), npages, prot);
>>>>> + return npages;
>>>>> }
>>>>> return 0;
>>>>> }
>>>>>
>>>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>>>> +{
>>>>> + return put_pfns(pfn, 1, prot);
>>>>> +}
>>>>> +
>>>>> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>>>
>>>>> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>>>> @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>>>> unsigned long pfn, unsigned long npage,
>>>>> bool do_accounting)
>>>>> {
>>>>> - long unlocked = 0, locked = 0;
>>>>> - long i;
>>>>> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>>>>
>>>>> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>>>>> - if (put_pfn(pfn++, dma->prot)) {
>>>>> - unlocked++;
>>>>> - if (vfio_find_vpfn(dma, iova))
>>>>> - locked++;
>>>>> + while (npage) {
>>>>> + struct folio *folio;
>>>>> + struct page *page;
>>>>> + long step = 1;
>>>>> +
>>>>> + if (is_invalid_reserved_pfn(pfn))
>>>>> + goto next;
>>>>> +
>>>>> + page = pfn_to_page(pfn);
>>>>> + folio = page_folio(page);
>>>>> +
>>>>> + if (!folio_test_large(folio)) {
>>>>> + _put_pfns(page, 1, dma->prot);
>>>>> + } else {
>>>>> + step = min_t(long, npage,
>>>>> + folio_nr_pages(folio) -
>>>>> + folio_page_idx(folio, page));
>>>>> + _put_pfns(page, step, dma->prot);
>>>>> }
>>>>> +
>>>>> + unlocked += step;
>>>>> +next:
>>>>
>>>> Usage of @step is inconsistent, goto isn't really necessary either, how
>>>> about:
>>>>
>>>> while (npage) {
>>>> unsigned long step = 1;
>>>>
>>>> if (!is_invalid_reserved_pfn(pfn)) {
>>>> struct page *page = pfn_to_page(pfn);
>>>> struct folio *folio = page_folio(page);
>>>> long nr_pages = folio_nr_pages(folio);
>>>>
>>>> if (nr_pages > 1)
>>>> step = min_t(long, npage,
>>>> nr_pages -
>>>> folio_page_idx(folio, page));
>>>>
>>>> _put_pfns(page, step, dma->prot);
>>>> unlocked += step;
>>>> }
>>>>
>>>
>>> That's great. This implementation is much better.
>>>
>>> I'm a bit uncertain about the best type to use for the 'step'
>>> variable here. I've been trying to keep things consistent with the
>>> put_pfn() function, so I set the type of the second parameter in
>>> _put_pfns() to 'int'(we pass 'step' as the second argument to
>>> _put_pfns()).
>>>
>>> Using unsigned long for 'step' should definitely work here, as the
>>> number of pages in a large folio currently falls within the range
>>> that can be represented by an int. However, there is still a
>>> potential risk of truncation that we need to be mindful of.
>>>
>>>>> + pfn += step;
>>>>> + iova += PAGE_SIZE * step;
>>>>> + npage -= step;
>>>>> }
>>>>>
>>>>> if (do_accounting)
>>>>
>>>> AIUI, the idea is that we know we have npage contiguous pfns and we
>>>> currently test invalid/reserved, call pfn_to_page(), call
>>>> unpin_user_pages_dirty_lock(), and test vpfn for each individually.
>>>>
>>>> This instead wants to batch the vpfn accounted pfns using the range
>>>> helper added for the mapping patch,
>>>
>>> Yes. We use vpfn_pages() just to track the locked pages.
>>>
>>>> infer that continuous pfns have the
>>>> same invalid/reserved state, the pages are sequential, and that we can
>>>> use the end of the folio to mark any inflections in those assumptions
>>>> otherwise. Do I have that correct?
>>>
>>> Yes. I think we're definitely on the same page here.
>>>
>>>> I think this could be split into two patches, one simply batching the
>>>> vpfn accounting and the next introducing the folio dependency. The
>>>> contributions of each to the overall performance improvement would be
>>>> interesting.
>>>
>>> I've made an initial attempt, and here are the two patches after
>>> splitting them up.
>>>
>>> 1. batch-vpfn-accounting-patch:
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index 28ee4b8d39ae..c8ddcee5aa68 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>> unsigned long pfn, unsigned long npage,
>>> bool do_accounting)
>>> {
>>> - long unlocked = 0, locked = 0;
>>> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>> long i;
>>>
>>> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
>>> - if (put_pfn(pfn++, dma->prot)) {
>>> + for (i = 0; i < npage; i++, iova += PAGE_SIZE)
>>> + if (put_pfn(pfn++, dma->prot))
>>> unlocked++;
>>> - if (vfio_find_vpfn(dma, iova))
>>> - locked++;
>>> - }
>>> - }
>>>
>>> if (do_accounting)
>>> vfio_lock_acct(dma, locked - unlocked, true);
>>> -----------------
>>>
>>> 2. large-folio-optimization-patch:
>>>
>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>> index c8ddcee5aa68..48c2ba4ba4eb 100644
>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>>> return true;
>>> }
>>>
>>> -static int put_pfn(unsigned long pfn, int prot)
>>> +static inline void _put_pfns(struct page *page, int npages, int prot)
>>> {
>>> - if (!is_invalid_reserved_pfn(pfn)) {
>>> - struct page *page = pfn_to_page(pfn);
>>> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
>>> +}
>>>
>>> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
>>> - return 1;
>>> +/*
>>> + * The caller must ensure that these npages PFNs belong to the same folio.
>>> + */
>>> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
>>> +{
>>> + if (!is_invalid_reserved_pfn(pfn)) {
>>> + _put_pfns(pfn_to_page(pfn), npages, prot);
>>> + return npages;
>>> }
>>> return 0;
>>> }
>>>
>>> +static inline int put_pfn(unsigned long pfn, int prot)
>>> +{
>>> + return put_pfns(pfn, 1, prot);
>>> +}
>>> +
>>> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>>>
>>> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
>>> @@ -806,11 +817,28 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
>>> bool do_accounting)
>>> {
>>> long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
>>> - long i;
>>>
>>> - for (i = 0; i < npage; i++, iova += PAGE_SIZE)
>>> - if (put_pfn(pfn++, dma->prot))
>>> - unlocked++;
>>> + while (npage) {
>>> + long step = 1;
>>> +
>>> + if (!is_invalid_reserved_pfn(pfn)) {
>>> + struct page *page = pfn_to_page(pfn);
>>> + struct folio *folio = page_folio(page);
>>> + long nr_pages = folio_nr_pages(folio);
>>> +
>>> + if (nr_pages > 1)
>>> + step = min_t(long, npage,
>>> + nr_pages -
>>> + folio_page_idx(folio, page));
>>> +
>>> + _put_pfns(page, step, dma->prot);
>>
>> I'm confused, why do we batch pages by looking at the folio, to then
>> pass the pages into unpin_user_page_range_dirty_lock?
>>
>> Why does the folio relationship matter at all here?
>>
>> Aren't we making the same mistake that we are jumping over pages we
>> shouldn't be jumping over, because we assume they belong to that folio?
>
> That's my concern as well. On the mapping side we had an array of
> pages from gup and we tested each page in the gup array relative to the
> folio pages. I think that's because the gup array could have
> non-sequential pages, but aiui the folio should have sequential
> pages(?). Here I think we're trying to assume that sequential pfns
> results in sequential pages and folios should have sequential pages, so
> the folio just gives us a point to look for changes in invalid/reserved.
>
> Is that valid? Thanks,
Oh, it is!
I thought we would be iterating pages, but if we are iterating PFNs it
should be fine.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio
2025-06-13 6:29 ` lizhe.67
2025-06-13 13:37 ` David Hildenbrand
@ 2025-06-13 17:38 ` Alex Williamson
1 sibling, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2025-06-13 17:38 UTC (permalink / raw)
To: lizhe.67; +Cc: david, kvm, linux-kernel, peterx
On Fri, 13 Jun 2025 14:29:20 +0800
lizhe.67@bytedance.com wrote:
> On Thu, 12 Jun 2025 16:32:39 -0600, alex.williamson@redhat.com wrote:
>
> > > drivers/vfio/vfio_iommu_type1.c | 53 +++++++++++++++++++++++++--------
> > > 1 file changed, 41 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> > > index 28ee4b8d39ae..2f6c0074d7b3 100644
> > > --- a/drivers/vfio/vfio_iommu_type1.c
> > > +++ b/drivers/vfio/vfio_iommu_type1.c
> > > @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> > > return true;
> > > }
> > >
> > > -static int put_pfn(unsigned long pfn, int prot)
> > > +static inline void _put_pfns(struct page *page, int npages, int prot)
> > > {
> > > - if (!is_invalid_reserved_pfn(pfn)) {
> > > - struct page *page = pfn_to_page(pfn);
> > > + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> > > +}
> > >
> > > - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> > > - return 1;
> > > +/*
> > > + * The caller must ensure that these npages PFNs belong to the same folio.
> > > + */
> > > +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> > > +{
> > > + if (!is_invalid_reserved_pfn(pfn)) {
> > > + _put_pfns(pfn_to_page(pfn), npages, prot);
> > > + return npages;
> > > }
> > > return 0;
> > > }
> > >
> > > +static inline int put_pfn(unsigned long pfn, int prot)
> > > +{
> > > + return put_pfns(pfn, 1, prot);
> > > +}
> > > +
> > > #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
> > >
> > > static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> > > @@ -805,15 +816,33 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> > > unsigned long pfn, unsigned long npage,
> > > bool do_accounting)
> > > {
> > > - long unlocked = 0, locked = 0;
> > > - long i;
> > > + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> > >
> > > - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > > - if (put_pfn(pfn++, dma->prot)) {
> > > - unlocked++;
> > > - if (vfio_find_vpfn(dma, iova))
> > > - locked++;
> > > + while (npage) {
> > > + struct folio *folio;
> > > + struct page *page;
> > > + long step = 1;
> > > +
> > > + if (is_invalid_reserved_pfn(pfn))
> > > + goto next;
> > > +
> > > + page = pfn_to_page(pfn);
> > > + folio = page_folio(page);
> > > +
> > > + if (!folio_test_large(folio)) {
> > > + _put_pfns(page, 1, dma->prot);
> > > + } else {
> > > + step = min_t(long, npage,
> > > + folio_nr_pages(folio) -
> > > + folio_page_idx(folio, page));
> > > + _put_pfns(page, step, dma->prot);
> > > }
> > > +
> > > + unlocked += step;
> > > +next:
> >
> > Usage of @step is inconsistent, goto isn't really necessary either, how
> > about:
> >
> > while (npage) {
> > unsigned long step = 1;
> >
> > if (!is_invalid_reserved_pfn(pfn)) {
> > struct page *page = pfn_to_page(pfn);
> > struct folio *folio = page_folio(page);
> > long nr_pages = folio_nr_pages(folio);
> >
> > if (nr_pages > 1)
> > step = min_t(long, npage,
> > nr_pages -
> > folio_page_idx(folio, page));
> >
> > _put_pfns(page, step, dma->prot);
> > unlocked += step;
> > }
> >
>
> That's great. This implementation is much better.
>
> I'm a bit uncertain about the best type to use for the 'step'
> variable here. I've been trying to keep things consistent with the
> put_pfn() function, so I set the type of the second parameter in
> _put_pfns() to 'int'(we pass 'step' as the second argument to
> _put_pfns()).
>
> Using unsigned long for 'step' should definitely work here, as the
> number of pages in a large folio currently falls within the range
> that can be represented by an int. However, there is still a
> potential risk of truncation that we need to be mindful of.
>
> > > + pfn += step;
> > > + iova += PAGE_SIZE * step;
> > > + npage -= step;
> > > }
> > >
> > > if (do_accounting)
> >
> > AIUI, the idea is that we know we have npage contiguous pfns and we
> > currently test invalid/reserved, call pfn_to_page(), call
> > unpin_user_pages_dirty_lock(), and test vpfn for each individually.
> >
> > This instead wants to batch the vpfn accounted pfns using the range
> > helper added for the mapping patch,
>
> Yes. We use vpfn_pages() just to track the locked pages.
>
> > infer that continuous pfns have the
> > same invalid/reserved state, the pages are sequential, and that we can
> > use the end of the folio to mark any inflections in those assumptions
> > otherwise. Do I have that correct?
>
> Yes. I think we're definitely on the same page here.
>
> > I think this could be split into two patches, one simply batching the
> > vpfn accounting and the next introducing the folio dependency. The
> > contributions of each to the overall performance improvement would be
> > interesting.
>
> I've made an initial attempt, and here are the two patches after
> splitting them up.
>
> 1. batch-vpfn-accounting-patch:
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index 28ee4b8d39ae..c8ddcee5aa68 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -805,16 +805,12 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> unsigned long pfn, unsigned long npage,
> bool do_accounting)
> {
> - long unlocked = 0, locked = 0;
> + long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> long i;
>
> - for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> - if (put_pfn(pfn++, dma->prot)) {
> + for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> + if (put_pfn(pfn++, dma->prot))
> unlocked++;
> - if (vfio_find_vpfn(dma, iova))
> - locked++;
> - }
> - }
>
> if (do_accounting)
> vfio_lock_acct(dma, locked - unlocked, true);
> -----------------
>
> 2. large-folio-optimization-patch:
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index c8ddcee5aa68..48c2ba4ba4eb 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -469,17 +469,28 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> return true;
> }
>
> -static int put_pfn(unsigned long pfn, int prot)
> +static inline void _put_pfns(struct page *page, int npages, int prot)
> {
> - if (!is_invalid_reserved_pfn(pfn)) {
> - struct page *page = pfn_to_page(pfn);
> + unpin_user_page_range_dirty_lock(page, npages, prot & IOMMU_WRITE);
> +}
>
> - unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
> - return 1;
> +/*
> + * The caller must ensure that these npages PFNs belong to the same folio.
> + */
> +static inline int put_pfns(unsigned long pfn, int npages, int prot)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + _put_pfns(pfn_to_page(pfn), npages, prot);
> + return npages;
> }
> return 0;
> }
>
> +static inline int put_pfn(unsigned long pfn, int prot)
> +{
> + return put_pfns(pfn, 1, prot);
> +}
> +
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>
> static void __vfio_batch_init(struct vfio_batch *batch, bool single)
> @@ -806,11 +817,28 @@ static long vfio_unpin_pages_remote(struct vfio_dma *dma, dma_addr_t iova,
> bool do_accounting)
> {
> long unlocked = 0, locked = vpfn_pages(dma, iova, npage);
> - long i;
>
> - for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> - if (put_pfn(pfn++, dma->prot))
> - unlocked++;
> + while (npage) {
> + long step = 1;
> +
> + if (!is_invalid_reserved_pfn(pfn)) {
> + struct page *page = pfn_to_page(pfn);
> + struct folio *folio = page_folio(page);
> + long nr_pages = folio_nr_pages(folio);
> +
> + if (nr_pages > 1)
> + step = min_t(long, npage,
> + nr_pages -
> + folio_page_idx(folio, page));
> +
> + _put_pfns(page, step, dma->prot);
> + unlocked += step;
> + }
> +
> + pfn += step;
> + iova += PAGE_SIZE * step;
> + npage -= step;
> + }
>
> if (do_accounting)
> vfio_lock_acct(dma, locked - unlocked, true);
> -----------------
>
> Here are the results of the performance tests.
>
> Base(v6.15):
> ./vfio-pci-mem-dma-map 0000:03:00.0 16
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.048 s (333.5 GB/s)
> VFIO UNMAP DMA in 0.139 s (115.1 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.273 s (58.6 GB/s)
> VFIO UNMAP DMA in 0.302 s (52.9 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.052 s (305.3 GB/s)
> VFIO UNMAP DMA in 0.141 s (113.8 GB/s)
>
> Base + Map + batch-vpfn-accounting-patch:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.027 s (591.1 GB/s)
> VFIO UNMAP DMA in 0.138 s (115.7 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.292 s (54.8 GB/s)
> VFIO UNMAP DMA in 0.308 s (52.0 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.032 s (505.5 GB/s)
> VFIO UNMAP DMA in 0.140 s (114.1 GB/s)
>
> Base + Map + batch-vpfn-accounting-patch + large-folio-optimization-patch:
> ------- AVERAGE (MADV_HUGEPAGE) --------
> VFIO MAP DMA in 0.027 s (591.2 GB/s)
> VFIO UNMAP DMA in 0.049 s (327.6 GB/s)
> ------- AVERAGE (MAP_POPULATE) --------
> VFIO MAP DMA in 0.291 s (55.0 GB/s)
> VFIO UNMAP DMA in 0.306 s (52.3 GB/s)
> ------- AVERAGE (HUGETLBFS) --------
> VFIO MAP DMA in 0.032 s (498.3 GB/s)
> VFIO UNMAP DMA in 0.049 s (326.2 GB/s)
>
> It seems that batching the vpfn accounting doesn't seem to have much
> of an impact in my environment. Perhaps this is because the rbtree
> for vfpn is empty, allowing vfio_find_vpfn to execute quickly?
Right, the rbtree is generally empty, but I thought it might still have
some benefit. It might, but it's probably below the noise threshold of
the test. I think it still makes sense to split the patches, the first
change is logically separate and the second patch builds on it.
long seems fine for the type of step. Do note though that the previous
patch on the mapping side used nr_pages as the increment size, it might
make sense to be consistent and use something different for the folio
nr_pages and replace "step" with "nr_pages".
Also please add a comment explaining the use of the folio as a basis
for inferring that we won't have an inflection in invalid/reserved
state for the remaining extent of the folio and that the folio has
sequential pages and therefore reflects the contiguous pfn range.
Thanks,
Alex
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2025-06-13 17:38 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-10 4:57 [RFC v2] vfio/type1: optimize vfio_unpin_pages_remote() for large folio lizhe.67
2025-06-12 22:32 ` Alex Williamson
2025-06-13 6:29 ` lizhe.67
2025-06-13 13:37 ` David Hildenbrand
2025-06-13 14:16 ` Alex Williamson
2025-06-13 16:08 ` David Hildenbrand
2025-06-13 17:38 ` Alex Williamson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).