[Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
@ 2017-01-19  9:25 Peter Xu
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-19  9:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, peterx

This requirement originates from the VT-d vfio series:

  https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html

The goal of this series is to allow IOMMU to notify unmap with very
big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
the whole address space).

The first patch is a good to have, for traces.

The second one is a cleanup of existing code, only.

The third one moves the further RAM translation and check into map
operation logic, so that it'll free unmap operations.

The series is marked as RFC since I am not sure whether this is a
workable way. Anyway, please review to help confirm it.

Thanks.

Peter Xu (3):
  vfio: trace map/unmap for notify as well
  vfio: introduce vfio_get_vaddr()
  vfio: allow to notify unmap for very large region

 hw/vfio/common.c     | 56 ++++++++++++++++++++++++++++++++++------------------
 hw/vfio/trace-events |  2 +-
 2 files changed, 38 insertions(+), 20 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well
  2017-01-19  9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu
@ 2017-01-19  9:25 ` Peter Xu
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-19  9:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, peterx

We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
dump it as well.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c     | 3 ++-
 hw/vfio/trace-events | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 801578b..174f351 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -305,7 +305,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     void *vaddr;
     int ret;
 
-    trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask);
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
 
     if (iotlb->target_as != &address_space_memory) {
         error_report("Wrong target AS \"%s\", only system memory is allowed",
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index ef81609..7ae8233 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -84,7 +84,7 @@ vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s"
 # hw/vfio/common.c
 vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)"
 vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64
-vfio_iommu_map_notify(uint64_t iova_start, uint64_t iova_end) "iommu map @ %"PRIx64" - %"PRIx64
+vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "iommu %s @ %"PRIx64" - %"PRIx64
 vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64
 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] %"PRIx64" - %"PRIx64
 vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] %"PRIx64" - %"PRIx64" [%p]"
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr()
  2017-01-19  9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu
@ 2017-01-19  9:25 ` Peter Xu
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region Peter Xu
  2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson
  3 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-19  9:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, peterx

A cleanup for vfio_iommu_map_notify(). Should have no functional change,
just to make the function shorter and easier to understand.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c | 58 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 174f351..ce55dff 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -294,25 +294,14 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
            section->offset_within_address_space & (1ULL << 63);
 }
 
-static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr,
+                           bool *read_only)
 {
-    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
-    VFIOContainer *container = giommu->container;
-    hwaddr iova = iotlb->iova + giommu->iommu_offset;
     MemoryRegion *mr;
     hwaddr xlat;
     hwaddr len = iotlb->addr_mask + 1;
-    void *vaddr;
-    int ret;
-
-    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
-                                iova, iova + iotlb->addr_mask);
-
-    if (iotlb->target_as != &address_space_memory) {
-        error_report("Wrong target AS \"%s\", only system memory is allowed",
-                     iotlb->target_as->name ? iotlb->target_as->name : "none");
-        return;
-    }
+    bool ret = false;
+    bool writable = iotlb->perm & IOMMU_WO;
 
     /*
      * The IOMMU TLB entry we have just covers translation through
@@ -322,12 +311,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     rcu_read_lock();
     mr = address_space_translate(&address_space_memory,
                                  iotlb->translated_addr,
-                                 &xlat, &len, iotlb->perm & IOMMU_WO);
+                                 &xlat, &len, writable);
     if (!memory_region_is_ram(mr)) {
         error_report("iommu map to non memory area %"HWADDR_PRIx"",
                      xlat);
         goto out;
     }
+
     /*
      * Translation truncates length to the IOMMU page size,
      * check that it did not truncate too much.
@@ -337,11 +327,41 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         goto out;
     }
 
+    *vaddr = memory_region_get_ram_ptr(mr) + xlat;
+    *read_only = !writable || mr->readonly;
+    ret = true;
+
+out:
+    rcu_read_unlock();
+    return ret;
+}
+
+static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    hwaddr iova = iotlb->iova + giommu->iommu_offset;
+    bool read_only;
+    void *vaddr;
+    int ret;
+
+    trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP",
+                                iova, iova + iotlb->addr_mask);
+
+    if (iotlb->target_as != &address_space_memory) {
+        error_report("Wrong target AS \"%s\", only system memory is allowed",
+                     iotlb->target_as->name ? iotlb->target_as->name : "none");
+        return;
+    }
+
+    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+        return;
+    }
+
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
-        vaddr = memory_region_get_ram_ptr(mr) + xlat;
         ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
-                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
+                           read_only);
         if (ret) {
             error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
                          "0x%"HWADDR_PRIx", %p) = %d (%m)",
@@ -357,8 +377,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
                          iotlb->addr_mask + 1, ret);
         }
     }
-out:
-    rcu_read_unlock();
 }
 
 static void vfio_listener_region_add(MemoryListener *listener,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region
  2017-01-19  9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() Peter Xu
@ 2017-01-19  9:25 ` Peter Xu
  2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson
  3 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-19  9:25 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, peterx

Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big
region. This can be leveraged by QEMU IOMMU implementation to cleanup
existing page mappings for an entire iova address space (by notifying
with an IOTLB with extremely huge addr_mask). However current
vfio_iommu_map_notify() does not allow that. It make sure that all the
translated address in IOTLB is falling into RAM range.

The check makes sense, but it should only be a sensible checker for
mapping operations, and mean little for unmap operations.

This patch moves this check into map logic only, so that we'll get
faster unmap handling (no need to translate again), and also we can then
better support unmapping a very big region when it covers non-ram ranges
or even not-existing ranges.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index ce55dff..4d90844 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -354,11 +354,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
         return;
     }

-    if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
-        return;
-    }
-
     if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
+        if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) {
+            return;
+        }
         ret = vfio_dma_map(container, iova,
                            iotlb->addr_mask + 1, vaddr,
                            read_only);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-19  9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu
                   ` (2 preceding siblings ...)
  2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region Peter Xu
@ 2017-01-19 17:54 ` Alex Williamson
  2017-01-20  3:43   ` Peter Xu
  3 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2017-01-19 17:54 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel

On Thu, 19 Jan 2017 17:25:29 +0800
Peter Xu <peterx@redhat.com> wrote:

> This requirement originates from the VT-d vfio series:
> 
>   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> 
> The goal of this series is to allow IOMMU to notify unmap with very
> big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> the whole address space).
> 
> The first patch is a good to have, for traces.
> 
> The second one is a cleanup of existing code, only.

Sort of, it does add some overhead to the unmap path, but you remove
that and more in the third patch.

> The third one moves the further RAM translation and check into map
> operation logic, so that it'll free unmap operations.
> 
> The series is marked as RFC since I am not sure whether this is a
> workable way. Anyway, please review to help confirm it.

It seems reasonable to me, except for the example here of using 2^63-1,
which I expect is to work around the vfio {un}map API bug as we
discussed on irc.  For everyone, the root of the problem is that the
ioctls use:

        __u64   iova;                           /* IO virtual address */
        __u64   size;                           /* Size of mapping (bytes) */

So we can only express a size of 2^64-1 and we have an off-by-one error
trying to perform a map/unmap of an entire 64-bit space.  Note when
designing an API, use start/end rather than base/size to avoid this.

What I don't want to see is for this API bug to leak out into the rest
of the QEMU code such that intel_iommu code, or iommu code in general
subtly avoids it by artificially using a smaller range.  VT-d hardware
has an actual physical address space of either 2^39 or 2^48 bits, so if
you want to make the iommu address space match the device we're trying
to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
64-bit address space on the IOMMU, so to handle that case I'd expect
the simplest solution would be to track the and mapped iova high water
mark per container in vfio and truncate unmaps to that high water end
address.  Realistically we're probably not going to see iovas at the end
of the 64-bit address space, but we can come up with some other
workaround in the vfio code or update the kernel API if we do.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson
@ 2017-01-20  3:43   ` Peter Xu
  2017-01-20  4:21     ` Alex Williamson
  2017-01-20 12:27     ` Peter Xu
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-20  3:43 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel

On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote:
> On Thu, 19 Jan 2017 17:25:29 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > This requirement originates from the VT-d vfio series:
> > 
> >   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> > 
> > The goal of this series is to allow IOMMU to notify unmap with very
> > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> > the whole address space).
> > 
> > The first patch is a good to have, for traces.
> > 
> > The second one is a cleanup of existing code, only.
> 
> Sort of, it does add some overhead to the unmap path, but you remove
> that and more in the third patch.

Hmm, yes, I tried to get the ram pointer even for unmap. I should
remove the ending "only".

>  
> > The third one moves the further RAM translation and check into map
> > operation logic, so that it'll free unmap operations.
> > 
> > The series is marked as RFC since I am not sure whether this is a
> > workable way. Anyway, please review to help confirm it.
> 
> It seems reasonable to me,

Good to know that you didn't disagree on this. :) Then let me take
these patches along with that series in the next post (since that
series will depend on this one, so I'll keep them together, though
please maintainers decide on how you'll merge them when you want to),

> except for the example here of using 2^63-1,
> which I expect is to work around the vfio {un}map API bug as we
> discussed on irc.

Actually not only for that one, I don't know whether we have problem
here:

(I mentioned this in IRC, in case you missed that, I paste it here as
 well)

static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
				      dma_addr_t start, size_t size)
{
	struct rb_node *node = iommu->dma_list.rb_node;

	while (node) {
		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);

		if (start + size <= dma->iova)
			node = node->rb_left;
		else if (start >= dma->iova + dma->size)
                          ^^^^^^^^^^^^^^^^^^^^^
                          Could it overflow here? <--------------
			node = node->rb_right;
		else
			return dma;
	}

	return NULL;
}

I used 2^63-1 to try to avoid both cases.

> For everyone, the root of the problem is that the
> ioctls use:
> 
>         __u64   iova;                           /* IO virtual address */
>         __u64   size;                           /* Size of mapping (bytes) */
> 
> So we can only express a size of 2^64-1 and we have an off-by-one error
> trying to perform a map/unmap of an entire 64-bit space.  Note when
> designing an API, use start/end rather than base/size to avoid this.

Lesson learned.

> 
> What I don't want to see is for this API bug to leak out into the rest
> of the QEMU code such that intel_iommu code, or iommu code in general
> subtly avoids it by artificially using a smaller range.  VT-d hardware
> has an actual physical address space of either 2^39 or 2^48 bits, so if
> you want to make the iommu address space match the device we're trying
> to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> 64-bit address space on the IOMMU, so to handle that case I'd expect
> the simplest solution would be to track the and mapped iova high water
> mark per container in vfio and truncate unmaps to that high water end
> address.  Realistically we're probably not going to see iovas at the end
> of the 64-bit address space, but we can come up with some other
> workaround in the vfio code or update the kernel API if we do.  Thanks,

Agree that high watermark can be a good solution for VT-d. I'll use
that instead of 2^63-1. Though for AMD, if it supports 64bits, we may
still need to solve existing issues in the future, since in that case
the high watermark can be (hwaddr)-1 as well if guest specifies it.

(Though anyway I'm not sure when AMD vIOMMUs will be ready for device
 assignment, so I don't think that's a big issue at least for now)

Thanks!

-- peterx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-20  3:43   ` Peter Xu
@ 2017-01-20  4:21     ` Alex Williamson
  2017-01-20  4:45       ` Peter Xu
  2017-01-20 12:27     ` Peter Xu
  1 sibling, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2017-01-20  4:21 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel

On Fri, 20 Jan 2017 11:43:28 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote:
> > On Thu, 19 Jan 2017 17:25:29 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > This requirement originates from the VT-d vfio series:
> > > 
> > >   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> > > 
> > > The goal of this series is to allow IOMMU to notify unmap with very
> > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> > > the whole address space).
> > > 
> > > The first patch is a good to have, for traces.
> > > 
> > > The second one is a cleanup of existing code, only.  
> > 
> > Sort of, it does add some overhead to the unmap path, but you remove
> > that and more in the third patch.  
> 
> Hmm, yes, I tried to get the ram pointer even for unmap. I should
> remove the ending "only".
> 
> >    
> > > The third one moves the further RAM translation and check into map
> > > operation logic, so that it'll free unmap operations.
> > > 
> > > The series is marked as RFC since I am not sure whether this is a
> > > workable way. Anyway, please review to help confirm it.  
> > 
> > It seems reasonable to me,  
> 
> Good to know that you didn't disagree on this. :) Then let me take
> these patches along with that series in the next post (since that
> series will depend on this one, so I'll keep them together, though
> please maintainers decide on how you'll merge them when you want to),
> 
> > except for the example here of using 2^63-1,
> > which I expect is to work around the vfio {un}map API bug as we
> > discussed on irc.  
> 
> Actually not only for that one, I don't know whether we have problem
> here:
> 
> (I mentioned this in IRC, in case you missed that, I paste it here as
>  well)
> 
> static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> 				      dma_addr_t start, size_t size)
> {
> 	struct rb_node *node = iommu->dma_list.rb_node;
> 
> 	while (node) {
> 		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> 
> 		if (start + size <= dma->iova)
> 			node = node->rb_left;
> 		else if (start >= dma->iova + dma->size)
>                           ^^^^^^^^^^^^^^^^^^^^^
>                           Could it overflow here? <--------------
> 			node = node->rb_right;
> 		else
> 			return dma;
> 	}
> 
> 	return NULL;
> }
> 
> I used 2^63-1 to try to avoid both cases.


Perhaps I'm not seeing the overflow you're trying to identify,
but dma->iova + dma->size cannot overflow, these are existing DMA
mappings and the DMA mapping path does test for overflow.  I think we
can consider those sanitized.  I am wondering if the test above it is
suspect though, start + size doesn't seem to be checked for overflow on
the unmap path.  It seems pretty easy to avoid from the user side
though, but maybe we should add a test for it in the kernel.

 
> > For everyone, the root of the problem is that the
> > ioctls use:
> > 
> >         __u64   iova;                           /* IO virtual address */
> >         __u64   size;                           /* Size of mapping (bytes) */
> > 
> > So we can only express a size of 2^64-1 and we have an off-by-one error
> > trying to perform a map/unmap of an entire 64-bit space.  Note when
> > designing an API, use start/end rather than base/size to avoid this.  
> 
> Lesson learned.
> 
> > 
> > What I don't want to see is for this API bug to leak out into the rest
> > of the QEMU code such that intel_iommu code, or iommu code in general
> > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > you want to make the iommu address space match the device we're trying
> > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > the simplest solution would be to track the and mapped iova high water
> > mark per container in vfio and truncate unmaps to that high water end
> > address.  Realistically we're probably not going to see iovas at the end
> > of the 64-bit address space, but we can come up with some other
> > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> 
> Agree that high watermark can be a good solution for VT-d. I'll use
> that instead of 2^63-1. Though for AMD, if it supports 64bits, we may
> still need to solve existing issues in the future, since in that case
> the high watermark can be (hwaddr)-1 as well if guest specifies it.
> 
> (Though anyway I'm not sure when AMD vIOMMUs will be ready for device
>  assignment, so I don't think that's a big issue at least for now)

Even with a true 64bit address space, it seems very unlikely we'd have
iovas at the very top of that address space.  If we do, another idea
would be that iommus often have a reserved iova range for MSI mapping,
both VT-d and AMD-Vi do.  We know there would be no user iovas within
those ranges, so it would make a clean point to split a full 64-bit
unmap in two.  Hopefully we'll have reporting of that reserved range up
through vfio should/when we need something like that.  We could
also simply track both the start and end of that mapping that sets the
high water mark, unmapping all except that, then that alone as two
separate ioctls.  I'm not too concerned that we'll be able to work
around it if we need to.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-20  4:21     ` Alex Williamson
@ 2017-01-20  4:45       ` Peter Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-20  4:45 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel

On Thu, Jan 19, 2017 at 09:21:10PM -0700, Alex Williamson wrote:
> On Fri, 20 Jan 2017 11:43:28 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote:
> > > On Thu, 19 Jan 2017 17:25:29 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > This requirement originates from the VT-d vfio series:
> > > > 
> > > >   https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html
> > > > 
> > > > The goal of this series is to allow IOMMU to notify unmap with very
> > > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap
> > > > the whole address space).
> > > > 
> > > > The first patch is a good to have, for traces.
> > > > 
> > > > The second one is a cleanup of existing code, only.  
> > > 
> > > Sort of, it does add some overhead to the unmap path, but you remove
> > > that and more in the third patch.  
> > 
> > Hmm, yes, I tried to get the ram pointer even for unmap. I should
> > remove the ending "only".
> > 
> > >    
> > > > The third one moves the further RAM translation and check into map
> > > > operation logic, so that it'll free unmap operations.
> > > > 
> > > > The series is marked as RFC since I am not sure whether this is a
> > > > workable way. Anyway, please review to help confirm it.  
> > > 
> > > It seems reasonable to me,  
> > 
> > Good to know that you didn't disagree on this. :) Then let me take
> > these patches along with that series in the next post (since that
> > series will depend on this one, so I'll keep them together, though
> > please maintainers decide on how you'll merge them when you want to),
> > 
> > > except for the example here of using 2^63-1,
> > > which I expect is to work around the vfio {un}map API bug as we
> > > discussed on irc.  
> > 
> > Actually not only for that one, I don't know whether we have problem
> > here:
> > 
> > (I mentioned this in IRC, in case you missed that, I paste it here as
> >  well)
> > 
> > static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> > 				      dma_addr_t start, size_t size)
> > {
> > 	struct rb_node *node = iommu->dma_list.rb_node;
> > 
> > 	while (node) {
> > 		struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> > 
> > 		if (start + size <= dma->iova)
> > 			node = node->rb_left;
> > 		else if (start >= dma->iova + dma->size)
> >                           ^^^^^^^^^^^^^^^^^^^^^
> >                           Could it overflow here? <--------------
> > 			node = node->rb_right;
> > 		else
> > 			return dma;
> > 	}
> > 
> > 	return NULL;
> > }
> > 
> > I used 2^63-1 to try to avoid both cases.
> 
> 
> Perhaps I'm not seeing the overflow you're trying to identify,
> but dma->iova + dma->size cannot overflow, these are existing DMA
> mappings and the DMA mapping path does test for overflow.  I think we
> can consider those sanitized.  I am wondering if the test above it is
> suspect though, start + size doesn't seem to be checked for overflow on
> the unmap path.  It seems pretty easy to avoid from the user side
> though, but maybe we should add a test for it in the kernel.

Ah, yes. :-)

> 
>  
> > > For everyone, the root of the problem is that the
> > > ioctls use:
> > > 
> > >         __u64   iova;                           /* IO virtual address */
> > >         __u64   size;                           /* Size of mapping (bytes) */
> > > 
> > > So we can only express a size of 2^64-1 and we have an off-by-one error
> > > trying to perform a map/unmap of an entire 64-bit space.  Note when
> > > designing an API, use start/end rather than base/size to avoid this.  
> > 
> > Lesson learned.
> > 
> > > 
> > > What I don't want to see is for this API bug to leak out into the rest
> > > of the QEMU code such that intel_iommu code, or iommu code in general
> > > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > > you want to make the iommu address space match the device we're trying
> > > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > > the simplest solution would be to track the and mapped iova high water
> > > mark per container in vfio and truncate unmaps to that high water end
> > > address.  Realistically we're probably not going to see iovas at the end
> > > of the 64-bit address space, but we can come up with some other
> > > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> > 
> > Agree that high watermark can be a good solution for VT-d. I'll use
> > that instead of 2^63-1. Though for AMD, if it supports 64bits, we may
> > still need to solve existing issues in the future, since in that case
> > the high watermark can be (hwaddr)-1 as well if guest specifies it.
> > 
> > (Though anyway I'm not sure when AMD vIOMMUs will be ready for device
> >  assignment, so I don't think that's a big issue at least for now)
> 
> Even with a true 64bit address space, it seems very unlikely we'd have
> iovas at the very top of that address space.

Yes, especially if the IOVA is allocated inside kernel. However, since
we are with DPDK these days, that's still possible?

I remembered that Jason has mentioned about OOM attacker for a tree -
since we have a per-domain tree in kernel vfio, it's also possible
that aggresive guest (DPDK user) doing special pattern of mapping that
can let the tree grow into a very big one, finally exaust the host
memory. Is this a problem as well?

> If we do, another idea
> would be that iommus often have a reserved iova range for MSI mapping,
> both VT-d and AMD-Vi do.  We know there would be no user iovas within
> those ranges, so it would make a clean point to split a full 64-bit
> unmap in two.  Hopefully we'll have reporting of that reserved range up
> through vfio should/when we need something like that.

Since we are at here... IIUC VT-d has reserved 0xfeeXXXXX for MSI, do
we have protection in intel iommu driver that user should not use IOVA
inside this range (I guess not?)? Though this is totally out of topic
since even this is a problem, it'll be for intel-iommu driver's.

> We could
> also simply track both the start and end of that mapping that sets the
> high water mark, unmapping all except that, then that alone as two
> separate ioctls.  I'm not too concerned that we'll be able to work
> around it if we need to.  Thanks,

Sure. Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-20  3:43   ` Peter Xu
  2017-01-20  4:21     ` Alex Williamson
@ 2017-01-20 12:27     ` Peter Xu
  2017-01-20 17:14       ` Alex Williamson
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Xu @ 2017-01-20 12:27 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel

On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote:

[...]

> > What I don't want to see is for this API bug to leak out into the rest
> > of the QEMU code such that intel_iommu code, or iommu code in general
> > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > you want to make the iommu address space match the device we're trying
> > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > the simplest solution would be to track the and mapped iova high water
> > mark per container in vfio and truncate unmaps to that high water end
> > address.  Realistically we're probably not going to see iovas at the end
> > of the 64-bit address space, but we can come up with some other
> > workaround in the vfio code or update the kernel API if we do.  Thanks,
> 
> Agree that high watermark can be a good solution for VT-d. I'll use
> that instead of 2^63-1.

Okay when I replied I didn't notice this "watermark" may need more
than several (even tens of) LOCs. :(

Considering that I see no further usage of this watermark, I'm
thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as
the watermark - it's simple, efficient and secure imho.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-20 12:27     ` Peter Xu
@ 2017-01-20 17:14       ` Alex Williamson
  2017-01-22  2:59         ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Williamson @ 2017-01-20 17:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel

On Fri, 20 Jan 2017 20:27:18 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote:
> 
> [...]
> 
> > > What I don't want to see is for this API bug to leak out into the rest
> > > of the QEMU code such that intel_iommu code, or iommu code in general
> > > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > > you want to make the iommu address space match the device we're trying
> > > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > > the simplest solution would be to track the and mapped iova high water
> > > mark per container in vfio and truncate unmaps to that high water end
> > > address.  Realistically we're probably not going to see iovas at the end
> > > of the 64-bit address space, but we can come up with some other
> > > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> > 
> > Agree that high watermark can be a good solution for VT-d. I'll use
> > that instead of 2^63-1.  
> 
> Okay when I replied I didn't notice this "watermark" may need more
> than several (even tens of) LOCs. :(
> 
> Considering that I see no further usage of this watermark, I'm
> thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as
> the watermark - it's simple, efficient and secure imho.

Avoiding the issue based on the virtual iommu hardware properties is a
fine solution, my intention was only to discourage introduction of
artificial limitations in the surrounding code to avoid this vfio
issue.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region
  2017-01-20 17:14       ` Alex Williamson
@ 2017-01-22  2:59         ` Peter Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2017-01-22  2:59 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel

On Fri, Jan 20, 2017 at 10:14:01AM -0700, Alex Williamson wrote:
> On Fri, 20 Jan 2017 20:27:18 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote:
> > 
> > [...]
> > 
> > > > What I don't want to see is for this API bug to leak out into the rest
> > > > of the QEMU code such that intel_iommu code, or iommu code in general
> > > > subtly avoids it by artificially using a smaller range.  VT-d hardware
> > > > has an actual physical address space of either 2^39 or 2^48 bits, so if
> > > > you want to make the iommu address space match the device we're trying
> > > > to emulate, that's perfectly fine.  AIUI, AMD-Vi does actually have a
> > > > 64-bit address space on the IOMMU, so to handle that case I'd expect
> > > > the simplest solution would be to track the and mapped iova high water
> > > > mark per container in vfio and truncate unmaps to that high water end
> > > > address.  Realistically we're probably not going to see iovas at the end
> > > > of the 64-bit address space, but we can come up with some other
> > > > workaround in the vfio code or update the kernel API if we do.  Thanks,  
> > > 
> > > Agree that high watermark can be a good solution for VT-d. I'll use
> > > that instead of 2^63-1.  
> > 
> > Okay when I replied I didn't notice this "watermark" may need more
> > than several (even tens of) LOCs. :(
> > 
> > Considering that I see no further usage of this watermark, I'm
> > thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as
> > the watermark - it's simple, efficient and secure imho.
> 
> Avoiding the issue based on the virtual iommu hardware properties is a
> fine solution, my intention was only to discourage introduction of
> artificial limitations in the surrounding code to avoid this vfio
> issue.  Thanks,

Yes.

I have posted a new version of the vfio series. Looking forward to
your further comment (or ack, if with luck :) on v4.

Thanks,

-- peterx

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-22  3:00 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-01-19  9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu
2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu
2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() Peter Xu
2017-01-19  9:25 ` [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region Peter Xu
2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson
2017-01-20  3:43   ` Peter Xu
2017-01-20  4:21     ` Alex Williamson
2017-01-20  4:45       ` Peter Xu
2017-01-20 12:27     ` Peter Xu
2017-01-20 17:14       ` Alex Williamson
2017-01-22  2:59         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).