[PATCH 0/3] vfio/type1: map/unmap chunking + conditional rescheduling

Linux IOMMU Development
 help / color / mirror / Atom feed

* [PATCH 0/3] vfio/type1: map/unmap chunking + conditional rescheduling
@ 2015-01-15 17:35 Alex Williamson
       [not found] ` <20150115171023.29162.90105.stgit-GCcqpEzw8uZBDLzU/O5InQ@public.gmane.org>
  2015-01-15 17:35 ` [PATCH 3/3] vfio/type1: Add conditional rescheduling Alex Williamson
  0 siblings, 2 replies; 4+ messages in thread
From: Alex Williamson @ 2015-01-15 17:35 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA

This series is inspired by the IOMMU tracing code added by Shuah Khan
and a request to report some of those traces for vfio-based device
assignment.  What I saw in the trace was an absurd number of calls to
iommu_unmap().  In fact, I had to up the trace log buffer size several
times before I was actually able to capture a full guest startup and
shutdown.  The problem was that I was testing on a VT-d system without
IOMMU superpage support, which the current code relies on to produce
any sort or optimization in the unmap path.  Without it, we explicitly
unmap every single page.  We can do better.  In fact, doing better on
Intel simply requires asking the IOMMU for the next translations to
determine the extent of a physically contiguous range to unmap.  With
hugetlbfs, this can boost a synthetic VM test case by upwards of 40%,
30% even without hugetlbfs.  An IOMMU with 2M superpage support does
pretty well with hugetlbfs, so we break even there, but when using
non-hugepages we can still see about a 30% improvement.

The trouble comes with AMD-Vi, which has crazy awesome fine-grained
superpage support.  The above change hurts the existing hugetlbfs case
by about 25% and maybe marginally helps the non-hugepage case.

The solution I've come up with is to attempt to detect fine-grained
superpage support and only enable the vfio-based unmap chunking when
not detected.  This maintains AMD-Vi performance while still helping
Intel VT-d.  I'm curious to see what this will do on other IOMMUs.
Please test and report.

The trace also showed that we do single page mappings for invalid and
reserved memory regions, like mappings of MMIO BARs, for no good
reason.  We only need to skip the accounting here, not the chunking.

Finally, the trace showed that we can run far, far too long with
needs-resched set, so I've sprinked in some cond_resched()s after
potentially time consuming IOMMU operations.

All said, on the troublesome Intel case, these changes result in
about 2% of the iommu_unmap() calls we had originally and only
sporadic needs-resced sightings.  Thanks,

Alex

---

Alex Williamson (3):
      vfio/type1: Add conditional rescheduling
      vfio/type1: Chunk contiguous reserved/invalid page mappings
      vfio/type1: DMA unmap chunking

 drivers/vfio/vfio_iommu_type1.c |   80 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 69 insertions(+), 11 deletions(-)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/3] vfio/type1: DMA unmap chunking
       [not found] ` <20150115171023.29162.90105.stgit-GCcqpEzw8uZBDLzU/O5InQ@public.gmane.org>
@ 2015-01-15 17:35   ` Alex Williamson
  2015-01-15 17:35   ` [PATCH 2/3] vfio/type1: Chunk contiguous reserved/invalid page mappings Alex Williamson
  1 sibling, 0 replies; 4+ messages in thread
From: Alex Williamson @ 2015-01-15 17:35 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA

When unmapping DMA entries we try to rely on the IOMMU API behavior
that allows the IOMMU to unmap a larger area than requested, up to
the size of the original mapping.  This works great when the IOMMU
supports superpages *and* they're in use.  Otherwise, each PAGE_SIZE
increment is unmapped separately, resulting in poor performance.

Instead we can use the IOVA-to-physical-address translation provided
by the IOMMU API and unmap using the largest contiguous physical
memory chunk available, which is also how vfio/type1 would have
mapped the region.  For a synthetic 1TB guest VM mapping and shutdown
test on Intel VT-d (2M IOMMU pagesize support), this achieves about
a 30% overall improvement mapping standard 4K pages, regardless of
IOMMU superpage enabling, and about a 40% improvement mapping 2M
hugetlbfs pages when IOMMU superpages are not available.  Hugetlbfs
with IOMMU superpages enabled is effectively unchanged.

Unfortunately the same algorithm does not work well on IOMMUs with
fine-grained superpages, like AMD-Vi, costing about 25% extra since
the IOMMU will automatically unmap any power-of-two contiguous
mapping we've provided it.  We add a routine and a domain flag to
detect this feature, leaving AMD-Vi unaffected by this unmap
optimization.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/vfio/vfio_iommu_type1.c |   54 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 4a9d666..e6e7f15 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -66,6 +66,7 @@ struct vfio_domain {
 	struct list_head	next;
 	struct list_head	group_list;
 	int			prot;		/* IOMMU_CACHE */
+	bool			fgsp;		/* Fine-grained super pages */
 };
 
 struct vfio_dma {
@@ -350,8 +351,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 		iommu_unmap(d->domain, dma->iova, dma->size);
 
 	while (iova < end) {
-		size_t unmapped;
-		phys_addr_t phys;
+		size_t unmapped, len;
+		phys_addr_t phys, next;
 
 		phys = iommu_iova_to_phys(domain->domain, iova);
 		if (WARN_ON(!phys)) {
@@ -359,7 +360,19 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 			continue;
 		}
 
-		unmapped = iommu_unmap(domain->domain, iova, PAGE_SIZE);
+		/*
+		 * To optimize for fewer iommu_unmap() calls, each of which
+		 * may require hardware cache flushing, try to find the
+		 * largest contiguous physical memory chunk to unmap.
+		 */
+		for (len = PAGE_SIZE;
+		     !domain->fgsp && iova + len < end; len += PAGE_SIZE) {
+			next = iommu_iova_to_phys(domain->domain, iova + len);
+			if (next != phys + len)
+				break;
+		}
+
+		unmapped = iommu_unmap(domain->domain, iova, len);
 		if (WARN_ON(!unmapped))
 			break;
 
@@ -665,6 +678,39 @@ static int vfio_iommu_replay(struct vfio_iommu *iommu,
 	return 0;
 }
 
+/*
+ * We change our unmap behavior slightly depending on whether the IOMMU
+ * supports fine-grained superpages.  IOMMUs like AMD-Vi will use a superpage
+ * for practically any contiguous power-of-two mapping we give it.  This means
+ * we don't need to look for contiguous chunks ourselves to make unmapping
+ * more efficient.  On IOMMUs with coarse-grained super pages, like Intel VT-d
+ * with discrete 2M/1G/512G/1T superpages, identifying contiguous chunks
+ * significantly boosts non-hugetlbfs mappings and doesn't seem to hurt when
+ * hugetlbfs is in use.
+ */
+static void vfio_test_domain_fgsp(struct vfio_domain *domain)
+{
+	struct page *pages;
+	int ret, order = get_order(PAGE_SIZE * 2);
+
+	pages = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
+	if (!pages)
+		return;
+
+	ret = iommu_map(domain->domain, 0, page_to_phys(pages), PAGE_SIZE * 2,
+			IOMMU_READ | IOMMU_WRITE | domain->prot);
+	if (!ret) {
+		size_t unmapped = iommu_unmap(domain->domain, 0, PAGE_SIZE);
+
+		if (unmapped == PAGE_SIZE)
+			iommu_unmap(domain->domain, PAGE_SIZE, PAGE_SIZE);
+		else
+			domain->fgsp = true;
+	}
+
+	__free_pages(pages, order);
+}
+
 static int vfio_iommu_type1_attach_group(void *iommu_data,
 					 struct iommu_group *iommu_group)
 {
@@ -758,6 +804,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
 		}
 	}
 
+	vfio_test_domain_fgsp(domain);
+
 	/* replay mappings on new domains */
 	ret = vfio_iommu_replay(iommu, domain);
 	if (ret)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/3] vfio/type1: Chunk contiguous reserved/invalid page mappings
       [not found] ` <20150115171023.29162.90105.stgit-GCcqpEzw8uZBDLzU/O5InQ@public.gmane.org>
  2015-01-15 17:35   ` [PATCH 1/3] vfio/type1: DMA unmap chunking Alex Williamson
@ 2015-01-15 17:35   ` Alex Williamson
  1 sibling, 0 replies; 4+ messages in thread
From: Alex Williamson @ 2015-01-15 17:35 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA

We currently map invalid and reserved pages, such as often occur from
mapping MMIO regions of a VM through the IOMMU, using single pages.
There's really no reason we can't instead follow the methodology we
use for normal pages and find the largest possible physically
contiguous chunk for mapping.  The only difference is that we don't
do locked memory accounting for these since they're not back by RAM.

In most applications this will be a very minor improvement, but when
graphics and GPGPU devices are in play, MMIO BARs become non-trivial.

Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---

 drivers/vfio/vfio_iommu_type1.c |   18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index e6e7f15..35c9008 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -265,6 +265,7 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
 	bool lock_cap = capable(CAP_IPC_LOCK);
 	long ret, i;
+	bool rsvd;
 
 	if (!current->mm)
 		return -ENODEV;
@@ -273,10 +274,9 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	if (ret)
 		return ret;
 
-	if (is_invalid_reserved_pfn(*pfn_base))
-		return 1;
+	rsvd = is_invalid_reserved_pfn(*pfn_base);
 
-	if (!lock_cap && current->mm->locked_vm + 1 > limit) {
+	if (!rsvd && !lock_cap && current->mm->locked_vm + 1 > limit) {
 		put_pfn(*pfn_base, prot);
 		pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
 			limit << PAGE_SHIFT);
@@ -284,7 +284,8 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 	}
 
 	if (unlikely(disable_hugepages)) {
-		vfio_lock_acct(1);
+		if (!rsvd)
+			vfio_lock_acct(1);
 		return 1;
 	}
 
@@ -296,12 +297,14 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 		if (ret)
 			break;
 
-		if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
+		if (pfn != *pfn_base + i ||
+		    rsvd != is_invalid_reserved_pfn(pfn)) {
 			put_pfn(pfn, prot);
 			break;
 		}
 
-		if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
+		if (!rsvd && !lock_cap &&
+		    current->mm->locked_vm + i + 1 > limit) {
 			put_pfn(pfn, prot);
 			pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
 				__func__, limit << PAGE_SHIFT);
@@ -309,7 +312,8 @@ static long vfio_pin_pages(unsigned long vaddr, long npage,
 		}
 	}
 
-	vfio_lock_acct(i);
+	if (!rsvd)
+		vfio_lock_acct(i);
 
 	return i;
 }

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 3/3] vfio/type1: Add conditional rescheduling
  2015-01-15 17:35 [PATCH 0/3] vfio/type1: map/unmap chunking + conditional rescheduling Alex Williamson
       [not found] ` <20150115171023.29162.90105.stgit-GCcqpEzw8uZBDLzU/O5InQ@public.gmane.org>
@ 2015-01-15 17:35 ` Alex Williamson
  1 sibling, 0 replies; 4+ messages in thread
From: Alex Williamson @ 2015-01-15 17:35 UTC (permalink / raw)
  To: iommu, kvm; +Cc: alex.williamson

IOMMU operations can be expensive and it's not very difficult for a
user to give us a lot of work to do for a map or unmap operation.
Killing a large VM will vfio assigned devices can result in soft
lockups and IOMMU tracing shows that we can easily spend 80% of our
time with need-resched set.  A sprinkling of conf_resched() calls
after map and unmap calls has a very tiny affect on performance
while resulting in traces with <1% of calls overflowing into needs-
resched.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/vfio/vfio_iommu_type1.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 35c9008..57d8c37 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -351,8 +351,10 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 	domain = d = list_first_entry(&iommu->domain_list,
 				      struct vfio_domain, next);
 
-	list_for_each_entry_continue(d, &iommu->domain_list, next)
+	list_for_each_entry_continue(d, &iommu->domain_list, next) {
 		iommu_unmap(d->domain, dma->iova, dma->size);
+		cond_resched();
+	}
 
 	while (iova < end) {
 		size_t unmapped, len;
@@ -384,6 +386,8 @@ static void vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma)
 					     unmapped >> PAGE_SHIFT,
 					     dma->prot, false);
 		iova += unmapped;
+
+		cond_resched();
 	}
 
 	vfio_lock_acct(-unlocked);
@@ -528,6 +532,8 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova,
 			    map_try_harder(d, iova, pfn, npage, prot))
 				goto unwind;
 		}
+
+		cond_resched();
 	}
 
 	return 0;


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-01-15 17:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-15 17:35 [PATCH 0/3] vfio/type1: map/unmap chunking + conditional rescheduling Alex Williamson
     [not found] ` <20150115171023.29162.90105.stgit-GCcqpEzw8uZBDLzU/O5InQ@public.gmane.org>
2015-01-15 17:35   ` [PATCH 1/3] vfio/type1: DMA unmap chunking Alex Williamson
2015-01-15 17:35   ` [PATCH 2/3] vfio/type1: Chunk contiguous reserved/invalid page mappings Alex Williamson
2015-01-15 17:35 ` [PATCH 3/3] vfio/type1: Add conditional rescheduling Alex Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox