kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages
@ 2025-12-23 23:00 Aaron Lewis
  2025-12-23 23:00 ` [RFC PATCH 1/2] " Aaron Lewis
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Aaron Lewis @ 2025-12-23 23:00 UTC (permalink / raw)
  To: alex.williamson, jgg, dmatlack; +Cc: kvm, seanjc, Aaron Lewis

This RFC explores the current state of DMA mapping performance across
vfio, and proposes an implementation to improve the performance
for "vfio_type1_iommu" for huge pages.

In putting this together the IOMMU modes: vfio_type1_iommu,
iommufd_compat_type1, and iommufd were used to get performance metrics
using the selftest, "vfio_dma_mapping_perf_test" (included in this
series).

These changes were developed on the branch "vfio/next" in the repro:
 - https://github.com/awilliam/linux-vfio

The optimization demonstrated in patch 1/2 shows a >300x speed up when
pinning gigantic pages in "vfio_type1_iommu".  More work will be needed to
improve iommufd's mapping performance for gigantic pages, but a
callstack showing the slow path is included in that patch to help drive
the conversation forward.

The iommu mode "iommufd_compat_type1" lags much farther behind the other
two.  If the intention is to have it perform on par (or near par) I can
attach a callstack in a follow up to see if there is any low hanging
fruit to be had.  But as it sits right now the performance of this iommu
mode is an order of magnitude slower than the others.

This is being sent as an RFC because while there is a proposed solution
for "vfio_type1_iommu", there are no solutions for the other two iommu
modes.  Attached is a callstack in patch 1/2 showing where the latency
issues are for iommufd, however, I haven't posted one
for "iommufd_compat_type1". I'm also not clear on what the intention is
for "iommufd_compat_type1" w.r.t. this issue.  Especially given it is so
much slower than the others.

Aaron Lewis (2):
  vfio: Improve DMA mapping performance for huge pages
  vfio: selftest: Add vfio_dma_mapping_perf_test

 drivers/vfio/vfio_iommu_type1.c               |  37 ++-
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../vfio/vfio_dma_mapping_perf_test.c         | 247 ++++++++++++++++++
 3 files changed, 277 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/vfio/vfio_dma_mapping_perf_test.c

-- 
2.52.0.351.gbe84eed79e-goog


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2025-12-23 23:00 [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Aaron Lewis
@ 2025-12-23 23:00 ` Aaron Lewis
  2025-12-24  2:10   ` Jason Gunthorpe
  2025-12-23 23:00 ` [RFC PATCH 2/2] vfio: selftest: Add vfio_dma_mapping_perf_test Aaron Lewis
  2025-12-24  2:04 ` [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Jason Gunthorpe
  2 siblings, 1 reply; 10+ messages in thread
From: Aaron Lewis @ 2025-12-23 23:00 UTC (permalink / raw)
  To: alex.williamson, jgg, dmatlack; +Cc: kvm, seanjc, Aaron Lewis

Huge pages are pinned at 4K granularity.  That means when pinning a 1G
page, 2^18 pages are pinned one at a time.  This adds needless toil
which results in high latencies.

To improve this, increase the number of pages the batch is operating on
to the number of 4k pages in a huge page.  Doing that allows huge
pages to be pinned in larger chunks, reducing the number of individual
pages being pinned one at a time.

This results in a major speed up vs baseline and can be demonstrated
by the selftest "vfio_dma_mapping_perf_test".

Using this selftest and a sample profiler it is easy to see where all
the time is being spent, vfio_pin_pages_remote().  Optimizing for that
led to the gains called out below.

 Samples     Percentage  Name
 ---------------------------------------------
 30,947,392      0.209%  vfs_ioctl (ioctl.c) Inlined
 30,947,392      0.209%  vfio_fops_unl_ioctl (container.c)
 30,947,392      0.209%  vfio_iommu_type1_ioctl (vfio_iommu_type1.c)
 30,947,392      0.209%  vfio_iommu_type1_map_dma (vfio_iommu_type1.c) Inlined
 30,947,392      0.209%  vfio_dma_do_map (vfio_iommu_type1.c) Inlined
 30,947,392      0.209%  vfio_pin_map_dma (vfio_iommu_type1.c)
 30,947,392      0.209%  vfio_pin_pages_remote (vfio_iommu_type1.c)
 29,616,402      0.200%  vaddr_get_pfns (vfio_iommu_type1.c)
 29,616,402      0.200%  internal_get_user_pages_fast (gup.c)
 29,616,402      0.200%  __gup_longterm_locked (gup.c)
 28,962,361      0.195%  __get_user_pages_locked (gup.c) Inlined
 28,962,361      0.195%  __get_user_pages (gup.c)
 25,548,889      0.172%  follow_page_mask (gup.c)
 24,852,062      0.168%  follow_p4d_mask (gup.c) Inlined
 24,852,062      0.168%  follow_pud_mask (gup.c) Inlined
 17,683,506      0.119%  follow_devmap_pud (huge_memory.c)
 6,584,772       0.044%  pud_lock (mm.h) Inlined


 Results when mapping 8G of memory:
 ----------------------------------

Baseline
 - vfio_type1_iommu:      2.87ms
 - iommufd_compat_type1: 53.23ms
 - iommufd:               4.53ms

With fast huge page pinning
 - vfio_type1_iommu:      0.01ms
 - improvements to iommufd_compat_type1 and iommufd are tbd.


 Results when mapping 256G of memory:
 ----------------------------------

Baseline
 - vfio_type1_iommu:       99.36ms
 - iommufd_compat_type1: 1576.51ms
 - iommufd:               144.65ms

With fast huge page pinning:
 - vfio_type1_iommu:        0.20ms
 - improvements to iommufd_compat_type1 and iommufd are tbd.

Based on these results that is more than a 300x speed up for
vfio_type1_iommu!  E.g. 2.87ms -> 0.01ms and 99.36ms -> 0.20ms.

As of now there is only a proposal to speed up "vfio_type1_iommu".


 IOMMUFD:
 --------

More effort will be needed to see what kind of speed ups can be achieved
by optimizing iommufd.  Sample profiler results (below) show that it
isn't the GUP calls that are slowing it down like they were in the
"vfio_type1_iommu" case.  The majority of the slowdown is coming from
batch_from_pages(), and the majority of that time is being spent in
batch_add_pfn_num().

 Samples     Percentage  Name
 ---------------------------------------------
 2,710,320,547   10.22%  iommufd_fops_ioctl (main.c)
 2,710,320,547   10.22%  iommufd_ioas_map (ioas.c)
 2,710,320,547   10.22%  iopt_map_user_pages (io_pagetable.c)
 2,710,320,547   10.22%  iopt_map_common (io_pagetable.c)
 2,710,320,547   10.22%  iopt_map_pages (io_pagetable.c)
 2,710,320,547   10.22%  iopt_fill_domains_pages (io_pagetable.c) Inlined
 2,710,320,547   10.22%  iopt_area_fill_domains (pages.c)
 2,710,320,547   10.22%  pfn_reader_first (pages.c)
 2,710,320,547   10.22%  pfn_reader_next (pages.c)
 2,709,376,056   10.21%  pfn_reader_fill_span (pages.c) Inlined
   2,435,947,359   9.182%  batch_from_pages (pages.c) Inlined
     1,864,604,611  7.028%   batch_add_pfn (pages.c) Inlined
     1,864,604,611  7.028%   batch_add_pfn_num (pages.c) Inlined
   273,428,697     1.031%  pfn_reader_user_pin (pages.c)
     271,538,567     1.024%  gup_fast_fallback (gup.c)

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
---
 drivers/vfio/vfio_iommu_type1.c | 37 ++++++++++++++++++++++++++-------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5167bec14e363..c1f7c2600a2d0 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -108,6 +108,7 @@ struct vfio_batch {
 	unsigned int		capacity;	/* length of pages array */
 	unsigned int		size;		/* of batch currently */
 	unsigned int		offset;		/* of next entry in pages */
+	bool			override_size;	/* has size been overriden. */
 };
 
 struct vfio_iommu_group {
@@ -493,6 +494,7 @@ static int put_pfn(unsigned long pfn, int prot)
 
 static void __vfio_batch_init(struct vfio_batch *batch, bool single)
 {
+	batch->override_size = false;
 	batch->size = 0;
 	batch->offset = 0;
 
@@ -598,7 +600,18 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
 	ret = pin_user_pages_remote(mm, vaddr, pin_pages, flags | FOLL_LONGTERM,
 				    batch->pages, NULL);
 	if (ret > 0) {
+		unsigned long nr_pages = compound_nr(batch->pages[0]);
+		bool override_size = false;
+
+		if (PageHead(batch->pages[0]) && nr_pages > pin_pages &&
+		    ret == pin_pages) {
+			override_size = true;
+			ret = nr_pages;
+			page_ref_add(batch->pages[0], nr_pages - pin_pages);
+		}
+
 		*pfn = page_to_pfn(batch->pages[0]);
+		batch->override_size = override_size;
 		batch->size = ret;
 		batch->offset = 0;
 		goto done;
@@ -668,6 +681,20 @@ static long vpfn_pages(struct vfio_dma *dma,
 	return ret;
 }
 
+static long vfio_batch_num_pages_contiguous(struct vfio_batch *batch)
+{
+	if (batch->override_size) {
+		return batch->size;
+	}
+
+	/*
+	 * Using GUP with the FOLL_LONGTERM in vaddr_get_pfns() will not
+	 * return invalid or reserved pages.
+	 */
+	return num_pages_contiguous(&batch->pages[batch->offset],
+				    batch->size);
+}
+
 /*
  * Attempt to pin pages.  We really don't want to track all the pfns and
  * the iommu can only map chunks of consecutive pfns anyway, so get the
@@ -747,14 +774,8 @@ static long vfio_pin_pages_remote(struct vfio_dma *dma, unsigned long vaddr,
 			    rsvd != is_invalid_reserved_pfn(pfn))
 				goto out;
 
-			/*
-			 * Using GUP with the FOLL_LONGTERM in
-			 * vaddr_get_pfns() will not return invalid
-			 * or reserved pages.
-			 */
-			nr_pages = num_pages_contiguous(
-					&batch->pages[batch->offset],
-					batch->size);
+			nr_pages = vfio_batch_num_pages_contiguous(batch);
+
 			if (!rsvd) {
 				acct_pages = nr_pages;
 				acct_pages -= vpfn_pages(dma, iova, nr_pages);
-- 
2.52.0.351.gbe84eed79e-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/2] vfio: selftest: Add vfio_dma_mapping_perf_test
  2025-12-23 23:00 [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Aaron Lewis
  2025-12-23 23:00 ` [RFC PATCH 1/2] " Aaron Lewis
@ 2025-12-23 23:00 ` Aaron Lewis
  2025-12-24  2:04 ` [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Jason Gunthorpe
  2 siblings, 0 replies; 10+ messages in thread
From: Aaron Lewis @ 2025-12-23 23:00 UTC (permalink / raw)
  To: alex.williamson, jgg, dmatlack; +Cc: kvm, seanjc, Aaron Lewis

Add a selftest to help investigate latencies when mapping pages in vfio.

To disabiguate the prefetch time from the pinning time the mmap() flag
MAP_POPULATE is used.  With prefetching done in mmap() that exposes
pinning latencies when mapping gigantic pages.

For this test 8G of memory is pinned to keep the test responsive. Pinning
more memory could result in the test timing out, e.g. if it pinned 256G.
The majority of that time is spent prefetching the memory in mmap().

This test has 4 main phases: mmap(), iova_alloc(), iommu_map(), and
iommu_unmap().  Each of these stages are timed and reported in
milliseconds.

This test doesn't set targets for any of its phases as targets are error
prone and lead to flaky tests.  Therefore, instead of having targets
this test simply reports how long each phase took.

Signed-off-by: Aaron Lewis <aaronlewis@google.com>
---
 tools/testing/selftests/vfio/Makefile         |   1 +
 .../vfio/vfio_dma_mapping_perf_test.c         | 247 ++++++++++++++++++
 2 files changed, 248 insertions(+)
 create mode 100644 tools/testing/selftests/vfio/vfio_dma_mapping_perf_test.c

diff --git a/tools/testing/selftests/vfio/Makefile b/tools/testing/selftests/vfio/Makefile
index 3c796ca99a509..134ce40b81790 100644
--- a/tools/testing/selftests/vfio/Makefile
+++ b/tools/testing/selftests/vfio/Makefile
@@ -1,5 +1,6 @@
 CFLAGS = $(KHDR_INCLUDES)
 TEST_GEN_PROGS += vfio_dma_mapping_test
+TEST_GEN_PROGS += vfio_dma_mapping_perf_test
 TEST_GEN_PROGS += vfio_iommufd_setup_test
 TEST_GEN_PROGS += vfio_pci_device_test
 TEST_GEN_PROGS += vfio_pci_device_init_perf_test
diff --git a/tools/testing/selftests/vfio/vfio_dma_mapping_perf_test.c b/tools/testing/selftests/vfio/vfio_dma_mapping_perf_test.c
new file mode 100644
index 0000000000000..c70f6935e0291
--- /dev/null
+++ b/tools/testing/selftests/vfio/vfio_dma_mapping_perf_test.c
@@ -0,0 +1,247 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <stdio.h>
+#include <sys/mman.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <uapi/linux/types.h>
+#include <linux/iommufd.h>
+#include <linux/limits.h>
+#include <linux/mman.h>
+#include <linux/sizes.h>
+#include <linux/time64.h>
+#include <linux/vfio.h>
+
+#include <libvfio.h>
+
+#include "../kselftest_harness.h"
+
+static const char *device_bdf;
+
+struct iommu_mapping {
+	u64 pgd;
+	u64 p4d;
+	u64 pud;
+	u64 pmd;
+	u64 pte;
+};
+
+static s64 to_ns(struct timespec ts)
+{
+	return (s64)ts.tv_nsec + NSEC_PER_SEC * (s64)ts.tv_sec;
+}
+
+static double to_ms(struct timespec ts)
+{
+	return to_ns(ts) / 1000.0 / 1000.0;
+}
+
+static struct timespec to_timespec(s64 ns)
+{
+	struct timespec ts = {
+		.tv_nsec = ns % NSEC_PER_SEC,
+		.tv_sec = ns / NSEC_PER_SEC,
+	};
+
+	return ts;
+}
+
+static struct timespec timespec_sub(struct timespec a, struct timespec b)
+{
+	return to_timespec(to_ns(a) - to_ns(b));
+}
+
+static double timespec_elapsed_ms(struct timespec start)
+{
+	struct timespec end;
+
+	clock_gettime(CLOCK_MONOTONIC, &end);
+	return to_ms(timespec_sub(end, start));
+}
+
+
+static void parse_next_value(char **line, u64 *value)
+{
+	char *token;
+
+	token = strtok_r(*line, " \t|\n", line);
+	if (!token)
+		return;
+
+	/* Caller verifies `value`. No need to check return value. */
+	sscanf(token, "0x%lx", value);
+}
+
+static int intel_iommu_mapping_get(const char *bdf, u64 iova,
+				   struct iommu_mapping *mapping)
+{
+	char iommu_mapping_path[PATH_MAX], line[PATH_MAX];
+	u64 line_iova = -1;
+	int ret = -ENOENT;
+	FILE *file;
+	char *rest;
+
+	snprintf(iommu_mapping_path, sizeof(iommu_mapping_path),
+		 "/sys/kernel/debug/iommu/intel/%s/domain_translation_struct",
+		 bdf);
+
+	file = fopen(iommu_mapping_path, "r");
+	VFIO_ASSERT_NOT_NULL(file, "fopen(%s) failed", iommu_mapping_path);
+
+	while (fgets(line, sizeof(line), file)) {
+		rest = line;
+
+		parse_next_value(&rest, &line_iova);
+		if (line_iova != (iova / getpagesize()))
+			continue;
+
+		/*
+		 * Ensure each struct field is initialized in case of empty
+		 * page table values.
+		 */
+		memset(mapping, 0, sizeof(*mapping));
+		parse_next_value(&rest, &mapping->pgd);
+		parse_next_value(&rest, &mapping->p4d);
+		parse_next_value(&rest, &mapping->pud);
+		parse_next_value(&rest, &mapping->pmd);
+		parse_next_value(&rest, &mapping->pte);
+
+		ret = 0;
+		break;
+	}
+
+	fclose(file);
+
+	return ret;
+}
+
+static int iommu_mapping_get(const char *bdf, u64 iova,
+			     struct iommu_mapping *mapping)
+{
+	if (!access("/sys/kernel/debug/iommu/intel", F_OK))
+		return intel_iommu_mapping_get(bdf, iova, mapping);
+
+	return -EOPNOTSUPP;
+}
+
+FIXTURE(vfio_dma_mapping_perf_test) {
+	struct iommu *iommu;
+	struct vfio_pci_device *device;
+	struct iova_allocator *iova_allocator;
+};
+
+FIXTURE_VARIANT(vfio_dma_mapping_perf_test) {
+	const char *iommu_mode;
+	u64 size;
+	int mmap_flags;
+	const char *file;
+};
+
+#define FIXTURE_VARIANT_ADD_IOMMU_MODE(_iommu_mode, _name, _size, _mmap_flags) \
+FIXTURE_VARIANT_ADD(vfio_dma_mapping_perf_test, _iommu_mode ## _ ## _name) {   \
+	.iommu_mode = #_iommu_mode,					       \
+	.size = (_size),						       \
+	.mmap_flags = MAP_ANONYMOUS | MAP_PRIVATE |			       \
+		      MAP_POPULATE | (_mmap_flags), 			       \
+}
+
+FIXTURE_VARIANT_ADD_ALL_IOMMU_MODES(anonymous_hugetlb_1gb, SZ_1G, MAP_HUGETLB | MAP_HUGE_1GB);
+
+#undef FIXTURE_VARIANT_ADD_IOMMU_MODE
+
+FIXTURE_SETUP(vfio_dma_mapping_perf_test)
+{
+	self->iommu = iommu_init(variant->iommu_mode);
+	self->device = vfio_pci_device_init(device_bdf, self->iommu);
+	self->iova_allocator = iova_allocator_init(self->iommu);
+}
+
+FIXTURE_TEARDOWN(vfio_dma_mapping_perf_test)
+{
+	iova_allocator_cleanup(self->iova_allocator);
+	vfio_pci_device_cleanup(self->device);
+	iommu_cleanup(self->iommu);
+}
+
+TEST_F(vfio_dma_mapping_perf_test, dma_map_unmap)
+{
+	u64 mapping_size = variant->size ?: getpagesize();
+	const u64 size = 8ULL * /*1GB=*/(1ULL << 30);
+	const int flags = variant->mmap_flags;
+	struct dma_region region;
+	struct iommu_mapping mapping;
+	struct timespec start;
+	u64 unmapped;
+	int rc;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	region.vaddr = mmap(NULL, size, PROT_READ | PROT_WRITE, flags, -1, 0);
+	printf("Mmap duration = %.2lfms\n", timespec_elapsed_ms(start));
+
+	/* Skip the test if there aren't enough HugeTLB pages available. */
+	if (flags & MAP_HUGETLB && region.vaddr == MAP_FAILED)
+		SKIP(return, "mmap() failed: %s (%d)\n", strerror(errno), errno);
+	else
+		ASSERT_NE(region.vaddr, MAP_FAILED);
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	region.iova = iova_allocator_alloc(self->iova_allocator, size);
+	region.size = size;
+	printf("IOVA alloc duration = %.2lfms\n", timespec_elapsed_ms(start));
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	iommu_map(self->iommu, &region);
+	printf("DMA map duration = %.2lfms\n", timespec_elapsed_ms(start));
+
+	ASSERT_EQ(region.iova, to_iova(self->device, region.vaddr));
+
+	rc = iommu_mapping_get(device_bdf, region.iova, &mapping);
+	if (rc == -EOPNOTSUPP) {
+		goto unmap;
+	}
+
+
+	/*
+	 * IOMMUFD compatibility-mode does not support huge mappings when
+	 * using VFIO_TYPE1_IOMMU.
+	 */
+	if (!strcmp(variant->iommu_mode, "iommufd_compat_type1"))
+		mapping_size = SZ_4K;
+
+
+	ASSERT_EQ(0, rc);
+
+	switch (mapping_size) {
+	case SZ_4K:
+		ASSERT_NE(0, mapping.pte);
+		break;
+	case SZ_2M:
+		ASSERT_EQ(0, mapping.pte);
+		ASSERT_NE(0, mapping.pmd);
+		break;
+	case SZ_1G:
+		ASSERT_EQ(0, mapping.pte);
+		ASSERT_EQ(0, mapping.pmd);
+		ASSERT_NE(0, mapping.pud);
+		break;
+	default:
+		VFIO_FAIL("Unrecognized size: 0x%lx\n", mapping_size);
+	}
+
+unmap:
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	rc = __iommu_unmap(self->iommu, &region, &unmapped);
+	printf("DMA unmap duration = %.2lfms\n", timespec_elapsed_ms(start));
+	ASSERT_EQ(rc, 0);
+	ASSERT_EQ(unmapped, region.size);
+	ASSERT_NE(0, __to_iova(self->device, region.vaddr, NULL));
+	ASSERT_NE(0, iommu_mapping_get(device_bdf, region.iova, &mapping));
+
+	ASSERT_TRUE(!munmap(region.vaddr, size));
+}
+
+int main(int argc, char *argv[])
+{
+	device_bdf = vfio_selftests_get_bdf(&argc, argv);
+	return test_harness_run(argc, argv);
+}
-- 
2.52.0.351.gbe84eed79e-goog


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages
  2025-12-23 23:00 [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Aaron Lewis
  2025-12-23 23:00 ` [RFC PATCH 1/2] " Aaron Lewis
  2025-12-23 23:00 ` [RFC PATCH 2/2] vfio: selftest: Add vfio_dma_mapping_perf_test Aaron Lewis
@ 2025-12-24  2:04 ` Jason Gunthorpe
  2 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2025-12-24  2:04 UTC (permalink / raw)
  To: Aaron Lewis; +Cc: alex.williamson, dmatlack, kvm, seanjc

On Tue, Dec 23, 2025 at 11:00:42PM +0000, Aaron Lewis wrote:
> This RFC explores the current state of DMA mapping performance across
> vfio, and proposes an implementation to improve the performance
> for "vfio_type1_iommu" for huge pages.

We are trying to sunset type1, please just use the memfd path on
iommufd which is supposed to be the fastest available option. You can
try to improve iommufd as well.

But we definately should not be making type 1 the best option and then
encouraging people to use it - that is the opposite of what we are
trying to do!

> This is being sent as an RFC because while there is a proposed solution
> for "vfio_type1_iommu", there are no solutions for the other two iommu
> modes.  Attached is a callstack in patch 1/2 showing where the latency
> issues are for iommufd, however, I haven't posted one
> for "iommufd_compat_type1". I'm also not clear on what the intention is
> for "iommufd_compat_type1" w.r.t. this issue.  Especially given it is so
> much slower than the others.

Probably your test is using VFIO_TYPE1_IOMMU for iommufd and comparing
it to VFIO_TYPE1v2_IOMMU for vfio. v0 forces 4k pages at the iommu
domain which is much slower.

It is also really weird because I have seen other people reporting a
30x speedup using iommufd and here you are reporting it is the slowest
option?

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2025-12-23 23:00 ` [RFC PATCH 1/2] " Aaron Lewis
@ 2025-12-24  2:10   ` Jason Gunthorpe
  2025-12-29 21:40     ` Aaron Lewis
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2025-12-24  2:10 UTC (permalink / raw)
  To: Aaron Lewis; +Cc: alex.williamson, dmatlack, kvm, seanjc

On Tue, Dec 23, 2025 at 11:00:43PM +0000, Aaron Lewis wrote:

> More effort will be needed to see what kind of speed ups can be achieved
> by optimizing iommufd.  Sample profiler results (below) show that it
> isn't the GUP calls that are slowing it down like they were in the
> "vfio_type1_iommu" case.  The majority of the slowdown is coming from
> batch_from_pages(), and the majority of that time is being spent in
> batch_add_pfn_num().

This is probably because vfio is now using num_pages_contiguous()
which seems to be a little bit faster than batch_add_pfn()

Since that is pretty new maybe it explains the discrepancy in reports.

> @@ -598,7 +600,18 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
>  	ret = pin_user_pages_remote(mm, vaddr, pin_pages, flags | FOLL_LONGTERM,
>  				    batch->pages, NULL);
>  	if (ret > 0) {
> +		unsigned long nr_pages = compound_nr(batch->pages[0]);
> +		bool override_size = false;
> +
> +		if (PageHead(batch->pages[0]) && nr_pages > pin_pages &&
> +		    ret == pin_pages) {
> +			override_size = true;
> +			ret = nr_pages;
> +			page_ref_add(batch->pages[0], nr_pages - pin_pages);
> +		}

This isn't right, num_pages_contiguous() is the best we can do for
lists returns by pin_user_pages(). In a VMA context you cannot blindly
assume the whole folio was mapped contiguously. Indeed I seem to
recall this was already proposed and rejected and that is how we ended
up with num_pages_contiguous().

Again, use the memfd path which supports this optimization already.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2025-12-24  2:10   ` Jason Gunthorpe
@ 2025-12-29 21:40     ` Aaron Lewis
  2025-12-30  1:12       ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Lewis @ 2025-12-29 21:40 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: alex.williamson, dmatlack, kvm, seanjc

I tried the memfd path on iommufd and it is indeed fast.  It was in
the same ballpark as the optimization I posted for VFIO_TYPE1_IOMMU.
I also tried it with a HugeTLB fd, e.g. /mnt/huge/tmp.bin, and it was
fast too.  I haven't had a chance to try it with DevDax w/ 1G pages,
though.  I noticed DevDax was quite a bit slower than HugeTLB when I
tested it on VFIO_TYPE1_IOMMU.  Have you heard how DevDax performs on
IOMMU_IOAS_MAP_FILE?  I'm curious if it has similar slowdowns compared
to HugeTLB.

I'll post an updated vfio_dma_mapping_perf_test that includes the
memfd updates in a follow up.

On Tue, Dec 23, 2025 at 6:10 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 23, 2025 at 11:00:43PM +0000, Aaron Lewis wrote:
>
> > More effort will be needed to see what kind of speed ups can be achieved
> > by optimizing iommufd.  Sample profiler results (below) show that it
> > isn't the GUP calls that are slowing it down like they were in the
> > "vfio_type1_iommu" case.  The majority of the slowdown is coming from
> > batch_from_pages(), and the majority of that time is being spent in
> > batch_add_pfn_num().
>
> This is probably because vfio is now using num_pages_contiguous()
> which seems to be a little bit faster than batch_add_pfn()
>
> Since that is pretty new maybe it explains the discrepancy in reports.
>
> > @@ -598,7 +600,18 @@ static long vaddr_get_pfns(struct mm_struct *mm, unsigned long vaddr,
> >       ret = pin_user_pages_remote(mm, vaddr, pin_pages, flags | FOLL_LONGTERM,
> >                                   batch->pages, NULL);
> >       if (ret > 0) {
> > +             unsigned long nr_pages = compound_nr(batch->pages[0]);
> > +             bool override_size = false;
> > +
> > +             if (PageHead(batch->pages[0]) && nr_pages > pin_pages &&
> > +                 ret == pin_pages) {
> > +                     override_size = true;
> > +                     ret = nr_pages;
> > +                     page_ref_add(batch->pages[0], nr_pages - pin_pages);
> > +             }
>
> This isn't right, num_pages_contiguous() is the best we can do for
> lists returns by pin_user_pages(). In a VMA context you cannot blindly
> assume the whole folio was mapped contiguously. Indeed I seem to
> recall this was already proposed and rejected and that is how we ended
> up with num_pages_contiguous().

Can't we assume a single page will be mapped contiguously?  If we are
operating on 1GB pages and the batch only covers ~30MB of that, if we
are a head page can't we assume the VA and PA will be contiguous at
least until the end of the current page?  Is the issue with the VMA
context that we have to ensure the VMA includes the entire page?  If
so, would something like this resolve that?  Instead of simply
checking:

> > +             if (PageHead(batch->pages[0]) && nr_pages > pin_pages &&
> > +                 ret == pin_pages) {

Do this instead:

> > +             if (pin_huge_page_fast(...) {

Where pin_huge_page_fast() does this:

+/* Must hold mmap_read_lock(mm). */
+static bool pin_huge_page_fast(struct page *page, struct mm_struct *mm,
+        unsigned long vaddr, unsigned long npages,
+        unsigned long pin_pages, long npinned) {
+ unsigned long nr_pages, nr_vmas_pages, untagged_vaddr;
+ struct vm_area_struct *vma;
+
+ /* Did pin_user_pages_remote() pin as many pages as expected? */
+ if (npinned != pin_pages)
+ return false;
+
+ /*
+ * Only consider head pages.  That will give the maximum
+ * benefits and simplify some of the checks below.
+ */
+ if (!PageHead(page))
+ return false;
+
+ /*
+ * Reject small pages.  Very large pages give the best benefits.
+ */
+ nr_pages = compound_nr(page);
+ if (nr_pages <= pin_pages)
+ return false;
+
+ /*
+ * Don't pin more pages than requested by VFIO_IOMMU_MAP_DMA.
+ */
+ if (nr_pages > npages)
+ return false;
+
+ untagged_vaddr = untagged_addr_remote(mm, vaddr);
+ vma = vma_lookup(mm, untagged_vaddr);
+
+ /* Does the virtual address map to a VMA? */
+ if (!vma)
+ return false;
+
+ /* Is the virtual address in the VMA? */
+ if (untagged_vaddr < vma->vm_start ||
+     untagged_vaddr >= vma->vm_end)
+ return false;
+
+ /*
+ * Does the VMA contain the entire very large page?  I.e. it
+ * isn't being overrun.
+ */
+ nr_vmas_pages = ((vma->vm_end - untagged_vaddr) >> PAGE_SHIFT);
+ if (nr_pages > nr_vmas_pages)
+ return false;
+
+ /*
+ * This is a good candidate for huge page optimizations.  The
+ * page is larger than a batch and fits in its VMA.  Therefore,
+ * pin the entire page all at once, rather than walking over
+ * each struct page in this huge page.
+ */
+ return true;
+}

>
> Again, use the memfd path which supports this optimization already.
>
> Jason

Using memfd sounds reasonable assuming DevDax w/ 1GB pages performs
well.  I guess I'm more asking about pin_huge_page_fast() as a sanity
check to make sure I'm not missing anything, though if you are not
looking to take changes for VFIO_TYPE1_IOMMU I can remove it from the
series.

Aaron

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2025-12-29 21:40     ` Aaron Lewis
@ 2025-12-30  1:12       ` Jason Gunthorpe
  2026-01-05 18:31         ` David Matlack
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2025-12-30  1:12 UTC (permalink / raw)
  To: Aaron Lewis; +Cc: alex.williamson, dmatlack, kvm, seanjc

On Mon, Dec 29, 2025 at 01:40:02PM -0800, Aaron Lewis wrote:
> I tried the memfd path on iommufd and it is indeed fast.  It was in
> the same ballpark as the optimization I posted for VFIO_TYPE1_IOMMU.

I would expect it to be better than just ball park, it does a lot less
work with 1G pages.

> I also tried it with a HugeTLB fd, e.g. /mnt/huge/tmp.bin, and it was
> fast too.  I haven't had a chance to try it with DevDax w/ 1G pages,
> though.  

I don't think any of the DAX cases are supported. I doubt anyone would
complain about adding support to memfd_pin_folios() for devdax..

> I noticed DevDax was quite a bit slower than HugeTLB when I
> tested it on VFIO_TYPE1_IOMMU.  

If you are testing latest upstream kernels it should be the same,
Alistair fixed it up to install folios in the PUD level:

aed877c2b425 ("device/dax: properly refcount device dax pages when mapping")

So the folios can be properly 1G sized and what comes out of
pin_user_pages() should not be any different from hugetlbfs.

If your devdax has 1G folios and PUD mappings is another question..

Older kerenls were all broken here and num_pages_contiguous() wouldn't
work right.

> > This isn't right, num_pages_contiguous() is the best we can do for
> > lists returns by pin_user_pages(). In a VMA context you cannot blindly
> > assume the whole folio was mapped contiguously. Indeed I seem to
> > recall this was already proposed and rejected and that is how we ended
> > up with num_pages_contiguous().
> 
> Can't we assume a single page will be mapped contiguously?  

No.

pin_user_pages() ignores VMA boundaries and the user can create a
combination of VMAs that slices a folio.

In general the output of pin_user_pages() cannot be assumed to work
like this.

> If we are operating on 1GB pages and the batch only covers ~30MB of
> that, if we are a head page can't we assume the VA and PA will be
> contiguous at least until the end of the current page?  

No, only memfd_pin_folios() can use that logic because it can assume
there is no discontiguity. This is why it returns folios, and why
num_pages_contiguous() exists.

> + untagged_vaddr = untagged_addr_remote(mm, vaddr);
> + vma = vma_lookup(mm, untagged_vaddr);

Searching the VMA like this is kind of ridiculous for a performance
path.

Even within a VMA I don't think we actually have a universal rule that
folios have to be installed contiguously in PTEs, I believe they are
permitted to be sliced, though a memfd wouldn't do that for its own
VMAs.

> Using memfd sounds reasonable assuming DevDax w/ 1GB pages performs
> well.

IMHO you are better to improve memfd_pin_folios() for your use case
than to mess with this stuff and type1. It is simpler without unknown
corner cases.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2025-12-30  1:12       ` Jason Gunthorpe
@ 2026-01-05 18:31         ` David Matlack
  2026-01-05 19:01           ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: David Matlack @ 2026-01-05 18:31 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Aaron Lewis, alex.williamson, kvm, seanjc

On Mon, Dec 29, 2025 at 5:12 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Mon, Dec 29, 2025 at 01:40:02PM -0800, Aaron Lewis wrote:
> > Can't we assume a single page will be mapped contiguously?
>
> No.
>
> pin_user_pages() ignores VMA boundaries and the user can create a
> combination of VMAs that slices a folio.
>
> In general the output of pin_user_pages() cannot be assumed to work
> like this.

Ack on the feedback that this is not a general solution and we should
switch to iommufd with memfd pinning for our use-case. But I think
Google will need to carry an optimization locally to type1 until we
can make that switch to meet our goals.

For HugeTLB mappings specifically, can it be assumed the VMA contains
the entire folio? I'm wondering what is the safest way to achieve
performance close to what Aaron achieved in his patch in type1 for
HugeTLB and DevDAX. (Not for upstream, just internally.)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2026-01-05 18:31         ` David Matlack
@ 2026-01-05 19:01           ` Jason Gunthorpe
  2026-01-05 19:36             ` David Matlack
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2026-01-05 19:01 UTC (permalink / raw)
  To: David Matlack; +Cc: Aaron Lewis, alex.williamson, kvm, seanjc

On Mon, Jan 05, 2026 at 10:31:14AM -0800, David Matlack wrote:

> Ack on the feedback that this is not a general solution and we should
> switch to iommufd with memfd pinning for our use-case. But I think
> Google will need to carry an optimization locally to type1 until we
> can make that switch to meet our goals.
> 
> For HugeTLB mappings specifically, can it be assumed the VMA contains
> the entire folio? I'm wondering what is the safest way to achieve
> performance close to what Aaron achieved in his patch in type1 for
> HugeTLB and DevDAX. (Not for upstream, just internally.)

If you are certain the address range in question is a single VMA and
that VMA is one of the special memfd-like types then you should be
able to do that.

The issue here is that VFIO doesn't have any idea about VMAs and
pin_user_pages_fast() doesn't check them. So you need to give up
pin_user_pages_fast() and have vfio code bound the work to actual VMAs
under a lock with the table read to make this solution work fully
properly.

Of course for your very special use case the VMM isn't creating sliced
up VMAs and trying to attack the kernel with them so the simple
solution here is workable with VMM co-operation. (though it opens a
sort of security problem of course)

But I wouldn't want to see such hackery in upstream.

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 1/2] vfio: Improve DMA mapping performance for huge pages
  2026-01-05 19:01           ` Jason Gunthorpe
@ 2026-01-05 19:36             ` David Matlack
  0 siblings, 0 replies; 10+ messages in thread
From: David Matlack @ 2026-01-05 19:36 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Aaron Lewis, alex.williamson, kvm, seanjc

On Mon, Jan 5, 2026 at 11:01 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Jan 05, 2026 at 10:31:14AM -0800, David Matlack wrote:
>
> > Ack on the feedback that this is not a general solution and we should
> > switch to iommufd with memfd pinning for our use-case. But I think
> > Google will need to carry an optimization locally to type1 until we
> > can make that switch to meet our goals.
> >
> > For HugeTLB mappings specifically, can it be assumed the VMA contains
> > the entire folio? I'm wondering what is the safest way to achieve
> > performance close to what Aaron achieved in his patch in type1 for
> > HugeTLB and DevDAX. (Not for upstream, just internally.)
>
> If you are certain the address range in question is a single VMA and
> that VMA is one of the special memfd-like types then you should be
> able to do that.
>
> The issue here is that VFIO doesn't have any idea about VMAs and
> pin_user_pages_fast() doesn't check them. So you need to give up
> pin_user_pages_fast() and have vfio code bound the work to actual VMAs
> under a lock with the table read to make this solution work fully
> properly.

If the page returned by pin_user_pages_fast() is PageHuge(), then I
think VFIO can assume that the folio is within a single VMA? HugeTLB
does not support mmap() smaller than the huge page size.

Userspace could of course slice the HugeTLB page in
VFIO_IOMMU_MAP_DMA, but that's easy to check for in VFIO.

> Of course for your very special use case the VMM isn't creating sliced
> up VMAs and trying to attack the kernel with them so the simple
> solution here is workable with VMM co-operation. (though it opens a
> sort of security problem of course)

For DevDAX we might have to go down a path like this... or add
restrictions to mmap() similar to HugeTLB.

> But I wouldn't want to see such hackery in upstream.

Ack, we'll keep the type1 code to ourselves :). Thanks for your help though.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-01-05 19:37 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-23 23:00 [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Aaron Lewis
2025-12-23 23:00 ` [RFC PATCH 1/2] " Aaron Lewis
2025-12-24  2:10   ` Jason Gunthorpe
2025-12-29 21:40     ` Aaron Lewis
2025-12-30  1:12       ` Jason Gunthorpe
2026-01-05 18:31         ` David Matlack
2026-01-05 19:01           ` Jason Gunthorpe
2026-01-05 19:36             ` David Matlack
2025-12-23 23:00 ` [RFC PATCH 2/2] vfio: selftest: Add vfio_dma_mapping_perf_test Aaron Lewis
2025-12-24  2:04 ` [RFC PATCH 0/2] vfio: Improve DMA mapping performance for huge pages Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).