* [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region @ 2017-01-19 9:25 Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu ` (3 more replies) 0 siblings, 4 replies; 11+ messages in thread From: Peter Xu @ 2017-01-19 9:25 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, peterx This requirement originates from the VT-d vfio series: https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html The goal of this series is to allow IOMMU to notify unmap with very big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap the whole address space). The first patch is a good to have, for traces. The second one is a cleanup of existing code, only. The third one moves the further RAM translation and check into map operation logic, so that it'll free unmap operations. The series is marked as RFC since I am not sure whether this is a workable way. Anyway, please review to help confirm it. Thanks. Peter Xu (3): vfio: trace map/unmap for notify as well vfio: introduce vfio_get_vaddr() vfio: allow to notify unmap for very large region hw/vfio/common.c | 56 ++++++++++++++++++++++++++++++++++------------------ hw/vfio/trace-events | 2 +- 2 files changed, 38 insertions(+), 20 deletions(-) -- 2.7.4 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well 2017-01-19 9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu @ 2017-01-19 9:25 ` Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() Peter Xu ` (2 subsequent siblings) 3 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2017-01-19 9:25 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, peterx We traces its range, but we don't know whether it's a MAP/UNMAP. Let's dump it as well. Signed-off-by: Peter Xu <peterx@redhat.com> --- hw/vfio/common.c | 3 ++- hw/vfio/trace-events | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 801578b..174f351 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -305,7 +305,8 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) void *vaddr; int ret; - trace_vfio_iommu_map_notify(iova, iova + iotlb->addr_mask); + trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", + iova, iova + iotlb->addr_mask); if (iotlb->target_as != &address_space_memory) { error_report("Wrong target AS \"%s\", only system memory is allowed", diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index ef81609..7ae8233 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -84,7 +84,7 @@ vfio_pci_igd_lpc_bridge_enabled(const char *name) "%s" # hw/vfio/common.c vfio_region_write(const char *name, int index, uint64_t addr, uint64_t data, unsigned size) " (%s:region%d+0x%"PRIx64", 0x%"PRIx64 ", %d)" vfio_region_read(char *name, int index, uint64_t addr, unsigned size, uint64_t data) " (%s:region%d+0x%"PRIx64", %d) = 0x%"PRIx64 -vfio_iommu_map_notify(uint64_t iova_start, uint64_t iova_end) "iommu map @ %"PRIx64" - %"PRIx64 +vfio_iommu_map_notify(const char *op, uint64_t iova_start, uint64_t iova_end) "iommu %s @ %"PRIx64" - %"PRIx64 vfio_listener_region_add_skip(uint64_t start, uint64_t end) "SKIPPING region_add %"PRIx64" - %"PRIx64 vfio_listener_region_add_iommu(uint64_t start, uint64_t end) "region_add [iommu] %"PRIx64" - %"PRIx64 vfio_listener_region_add_ram(uint64_t iova_start, uint64_t iova_end, void *vaddr) "region_add [ram] %"PRIx64" - %"PRIx64" [%p]" -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() 2017-01-19 9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu @ 2017-01-19 9:25 ` Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region Peter Xu 2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson 3 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2017-01-19 9:25 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, peterx A cleanup for vfio_iommu_map_notify(). Should have no functional change, just to make the function shorter and easier to understand. Signed-off-by: Peter Xu <peterx@redhat.com> --- hw/vfio/common.c | 58 +++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 38 insertions(+), 20 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 174f351..ce55dff 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -294,25 +294,14 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section) section->offset_within_address_space & (1ULL << 63); } -static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +static bool vfio_get_vaddr(IOMMUTLBEntry *iotlb, void **vaddr, + bool *read_only) { - VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); - VFIOContainer *container = giommu->container; - hwaddr iova = iotlb->iova + giommu->iommu_offset; MemoryRegion *mr; hwaddr xlat; hwaddr len = iotlb->addr_mask + 1; - void *vaddr; - int ret; - - trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", - iova, iova + iotlb->addr_mask); - - if (iotlb->target_as != &address_space_memory) { - error_report("Wrong target AS \"%s\", only system memory is allowed", - iotlb->target_as->name ? iotlb->target_as->name : "none"); - return; - } + bool ret = false; + bool writable = iotlb->perm & IOMMU_WO; /* * The IOMMU TLB entry we have just covers translation through @@ -322,12 +311,13 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) rcu_read_lock(); mr = address_space_translate(&address_space_memory, iotlb->translated_addr, - &xlat, &len, iotlb->perm & IOMMU_WO); + &xlat, &len, writable); if (!memory_region_is_ram(mr)) { error_report("iommu map to non memory area %"HWADDR_PRIx"", xlat); goto out; } + /* * Translation truncates length to the IOMMU page size, * check that it did not truncate too much. @@ -337,11 +327,41 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) goto out; } + *vaddr = memory_region_get_ram_ptr(mr) + xlat; + *read_only = !writable || mr->readonly; + ret = true; + +out: + rcu_read_unlock(); + return ret; +} + +static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) +{ + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + VFIOContainer *container = giommu->container; + hwaddr iova = iotlb->iova + giommu->iommu_offset; + bool read_only; + void *vaddr; + int ret; + + trace_vfio_iommu_map_notify(iotlb->perm == IOMMU_NONE ? "UNMAP" : "MAP", + iova, iova + iotlb->addr_mask); + + if (iotlb->target_as != &address_space_memory) { + error_report("Wrong target AS \"%s\", only system memory is allowed", + iotlb->target_as->name ? iotlb->target_as->name : "none"); + return; + } + + if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { + return; + } + if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { - vaddr = memory_region_get_ram_ptr(mr) + xlat; ret = vfio_dma_map(container, iova, iotlb->addr_mask + 1, vaddr, - !(iotlb->perm & IOMMU_WO) || mr->readonly); + read_only); if (ret) { error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " "0x%"HWADDR_PRIx", %p) = %d (%m)", @@ -357,8 +377,6 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) iotlb->addr_mask + 1, ret); } } -out: - rcu_read_unlock(); } static void vfio_listener_region_add(MemoryListener *listener, -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region 2017-01-19 9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() Peter Xu @ 2017-01-19 9:25 ` Peter Xu 2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson 3 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2017-01-19 9:25 UTC (permalink / raw) To: qemu-devel; +Cc: Alex Williamson, peterx Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big region. This can be leveraged by QEMU IOMMU implementation to cleanup existing page mappings for an entire iova address space (by notifying with an IOTLB with extremely huge addr_mask). However current vfio_iommu_map_notify() does not allow that. It make sure that all the translated address in IOTLB is falling into RAM range. The check makes sense, but it should only be a sensible checker for mapping operations, and mean little for unmap operations. This patch moves this check into map logic only, so that we'll get faster unmap handling (no need to translate again), and also we can then better support unmapping a very big region when it covers non-ram ranges or even not-existing ranges. Signed-off-by: Peter Xu <peterx@redhat.com> --- hw/vfio/common.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/hw/vfio/common.c b/hw/vfio/common.c index ce55dff..4d90844 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -354,11 +354,10 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb) return; } - if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { - return; - } - if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) { + if (!vfio_get_vaddr(iotlb, &vaddr, &read_only)) { + return; + } ret = vfio_dma_map(container, iova, iotlb->addr_mask + 1, vaddr, read_only); -- 2.7.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-19 9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu ` (2 preceding siblings ...) 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region Peter Xu @ 2017-01-19 17:54 ` Alex Williamson 2017-01-20 3:43 ` Peter Xu 3 siblings, 1 reply; 11+ messages in thread From: Alex Williamson @ 2017-01-19 17:54 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel On Thu, 19 Jan 2017 17:25:29 +0800 Peter Xu <peterx@redhat.com> wrote: > This requirement originates from the VT-d vfio series: > > https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html > > The goal of this series is to allow IOMMU to notify unmap with very > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap > the whole address space). > > The first patch is a good to have, for traces. > > The second one is a cleanup of existing code, only. Sort of, it does add some overhead to the unmap path, but you remove that and more in the third patch. > The third one moves the further RAM translation and check into map > operation logic, so that it'll free unmap operations. > > The series is marked as RFC since I am not sure whether this is a > workable way. Anyway, please review to help confirm it. It seems reasonable to me, except for the example here of using 2^63-1, which I expect is to work around the vfio {un}map API bug as we discussed on irc. For everyone, the root of the problem is that the ioctls use: __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ So we can only express a size of 2^64-1 and we have an off-by-one error trying to perform a map/unmap of an entire 64-bit space. Note when designing an API, use start/end rather than base/size to avoid this. What I don't want to see is for this API bug to leak out into the rest of the QEMU code such that intel_iommu code, or iommu code in general subtly avoids it by artificially using a smaller range. VT-d hardware has an actual physical address space of either 2^39 or 2^48 bits, so if you want to make the iommu address space match the device we're trying to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a 64-bit address space on the IOMMU, so to handle that case I'd expect the simplest solution would be to track the and mapped iova high water mark per container in vfio and truncate unmaps to that high water end address. Realistically we're probably not going to see iovas at the end of the 64-bit address space, but we can come up with some other workaround in the vfio code or update the kernel API if we do. Thanks, Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson @ 2017-01-20 3:43 ` Peter Xu 2017-01-20 4:21 ` Alex Williamson 2017-01-20 12:27 ` Peter Xu 0 siblings, 2 replies; 11+ messages in thread From: Peter Xu @ 2017-01-20 3:43 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote: > On Thu, 19 Jan 2017 17:25:29 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > This requirement originates from the VT-d vfio series: > > > > https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html > > > > The goal of this series is to allow IOMMU to notify unmap with very > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap > > the whole address space). > > > > The first patch is a good to have, for traces. > > > > The second one is a cleanup of existing code, only. > > Sort of, it does add some overhead to the unmap path, but you remove > that and more in the third patch. Hmm, yes, I tried to get the ram pointer even for unmap. I should remove the ending "only". > > > The third one moves the further RAM translation and check into map > > operation logic, so that it'll free unmap operations. > > > > The series is marked as RFC since I am not sure whether this is a > > workable way. Anyway, please review to help confirm it. > > It seems reasonable to me, Good to know that you didn't disagree on this. :) Then let me take these patches along with that series in the next post (since that series will depend on this one, so I'll keep them together, though please maintainers decide on how you'll merge them when you want to), > except for the example here of using 2^63-1, > which I expect is to work around the vfio {un}map API bug as we > discussed on irc. Actually not only for that one, I don't know whether we have problem here: (I mentioned this in IRC, in case you missed that, I paste it here as well) static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu, dma_addr_t start, size_t size) { struct rb_node *node = iommu->dma_list.rb_node; while (node) { struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node); if (start + size <= dma->iova) node = node->rb_left; else if (start >= dma->iova + dma->size) ^^^^^^^^^^^^^^^^^^^^^ Could it overflow here? <-------------- node = node->rb_right; else return dma; } return NULL; } I used 2^63-1 to try to avoid both cases. > For everyone, the root of the problem is that the > ioctls use: > > __u64 iova; /* IO virtual address */ > __u64 size; /* Size of mapping (bytes) */ > > So we can only express a size of 2^64-1 and we have an off-by-one error > trying to perform a map/unmap of an entire 64-bit space. Note when > designing an API, use start/end rather than base/size to avoid this. Lesson learned. > > What I don't want to see is for this API bug to leak out into the rest > of the QEMU code such that intel_iommu code, or iommu code in general > subtly avoids it by artificially using a smaller range. VT-d hardware > has an actual physical address space of either 2^39 or 2^48 bits, so if > you want to make the iommu address space match the device we're trying > to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a > 64-bit address space on the IOMMU, so to handle that case I'd expect > the simplest solution would be to track the and mapped iova high water > mark per container in vfio and truncate unmaps to that high water end > address. Realistically we're probably not going to see iovas at the end > of the 64-bit address space, but we can come up with some other > workaround in the vfio code or update the kernel API if we do. Thanks, Agree that high watermark can be a good solution for VT-d. I'll use that instead of 2^63-1. Though for AMD, if it supports 64bits, we may still need to solve existing issues in the future, since in that case the high watermark can be (hwaddr)-1 as well if guest specifies it. (Though anyway I'm not sure when AMD vIOMMUs will be ready for device assignment, so I don't think that's a big issue at least for now) Thanks! -- peterx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-20 3:43 ` Peter Xu @ 2017-01-20 4:21 ` Alex Williamson 2017-01-20 4:45 ` Peter Xu 2017-01-20 12:27 ` Peter Xu 1 sibling, 1 reply; 11+ messages in thread From: Alex Williamson @ 2017-01-20 4:21 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel On Fri, 20 Jan 2017 11:43:28 +0800 Peter Xu <peterx@redhat.com> wrote: > On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote: > > On Thu, 19 Jan 2017 17:25:29 +0800 > > Peter Xu <peterx@redhat.com> wrote: > > > > > This requirement originates from the VT-d vfio series: > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html > > > > > > The goal of this series is to allow IOMMU to notify unmap with very > > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap > > > the whole address space). > > > > > > The first patch is a good to have, for traces. > > > > > > The second one is a cleanup of existing code, only. > > > > Sort of, it does add some overhead to the unmap path, but you remove > > that and more in the third patch. > > Hmm, yes, I tried to get the ram pointer even for unmap. I should > remove the ending "only". > > > > > > The third one moves the further RAM translation and check into map > > > operation logic, so that it'll free unmap operations. > > > > > > The series is marked as RFC since I am not sure whether this is a > > > workable way. Anyway, please review to help confirm it. > > > > It seems reasonable to me, > > Good to know that you didn't disagree on this. :) Then let me take > these patches along with that series in the next post (since that > series will depend on this one, so I'll keep them together, though > please maintainers decide on how you'll merge them when you want to), > > > except for the example here of using 2^63-1, > > which I expect is to work around the vfio {un}map API bug as we > > discussed on irc. > > Actually not only for that one, I don't know whether we have problem > here: > > (I mentioned this in IRC, in case you missed that, I paste it here as > well) > > static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu, > dma_addr_t start, size_t size) > { > struct rb_node *node = iommu->dma_list.rb_node; > > while (node) { > struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node); > > if (start + size <= dma->iova) > node = node->rb_left; > else if (start >= dma->iova + dma->size) > ^^^^^^^^^^^^^^^^^^^^^ > Could it overflow here? <-------------- > node = node->rb_right; > else > return dma; > } > > return NULL; > } > > I used 2^63-1 to try to avoid both cases. Perhaps I'm not seeing the overflow you're trying to identify, but dma->iova + dma->size cannot overflow, these are existing DMA mappings and the DMA mapping path does test for overflow. I think we can consider those sanitized. I am wondering if the test above it is suspect though, start + size doesn't seem to be checked for overflow on the unmap path. It seems pretty easy to avoid from the user side though, but maybe we should add a test for it in the kernel. > > For everyone, the root of the problem is that the > > ioctls use: > > > > __u64 iova; /* IO virtual address */ > > __u64 size; /* Size of mapping (bytes) */ > > > > So we can only express a size of 2^64-1 and we have an off-by-one error > > trying to perform a map/unmap of an entire 64-bit space. Note when > > designing an API, use start/end rather than base/size to avoid this. > > Lesson learned. > > > > > What I don't want to see is for this API bug to leak out into the rest > > of the QEMU code such that intel_iommu code, or iommu code in general > > subtly avoids it by artificially using a smaller range. VT-d hardware > > has an actual physical address space of either 2^39 or 2^48 bits, so if > > you want to make the iommu address space match the device we're trying > > to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a > > 64-bit address space on the IOMMU, so to handle that case I'd expect > > the simplest solution would be to track the and mapped iova high water > > mark per container in vfio and truncate unmaps to that high water end > > address. Realistically we're probably not going to see iovas at the end > > of the 64-bit address space, but we can come up with some other > > workaround in the vfio code or update the kernel API if we do. Thanks, > > Agree that high watermark can be a good solution for VT-d. I'll use > that instead of 2^63-1. Though for AMD, if it supports 64bits, we may > still need to solve existing issues in the future, since in that case > the high watermark can be (hwaddr)-1 as well if guest specifies it. > > (Though anyway I'm not sure when AMD vIOMMUs will be ready for device > assignment, so I don't think that's a big issue at least for now) Even with a true 64bit address space, it seems very unlikely we'd have iovas at the very top of that address space. If we do, another idea would be that iommus often have a reserved iova range for MSI mapping, both VT-d and AMD-Vi do. We know there would be no user iovas within those ranges, so it would make a clean point to split a full 64-bit unmap in two. Hopefully we'll have reporting of that reserved range up through vfio should/when we need something like that. We could also simply track both the start and end of that mapping that sets the high water mark, unmapping all except that, then that alone as two separate ioctls. I'm not too concerned that we'll be able to work around it if we need to. Thanks, Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-20 4:21 ` Alex Williamson @ 2017-01-20 4:45 ` Peter Xu 0 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2017-01-20 4:45 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel On Thu, Jan 19, 2017 at 09:21:10PM -0700, Alex Williamson wrote: > On Fri, 20 Jan 2017 11:43:28 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > On Thu, Jan 19, 2017 at 10:54:37AM -0700, Alex Williamson wrote: > > > On Thu, 19 Jan 2017 17:25:29 +0800 > > > Peter Xu <peterx@redhat.com> wrote: > > > > > > > This requirement originates from the VT-d vfio series: > > > > > > > > https://lists.nongnu.org/archive/html/qemu-devel/2017-01/msg03495.html > > > > > > > > The goal of this series is to allow IOMMU to notify unmap with very > > > > big IOTLB range, for example, with base=0 and size=2^63-1 (to unmap > > > > the whole address space). > > > > > > > > The first patch is a good to have, for traces. > > > > > > > > The second one is a cleanup of existing code, only. > > > > > > Sort of, it does add some overhead to the unmap path, but you remove > > > that and more in the third patch. > > > > Hmm, yes, I tried to get the ram pointer even for unmap. I should > > remove the ending "only". > > > > > > > > > The third one moves the further RAM translation and check into map > > > > operation logic, so that it'll free unmap operations. > > > > > > > > The series is marked as RFC since I am not sure whether this is a > > > > workable way. Anyway, please review to help confirm it. > > > > > > It seems reasonable to me, > > > > Good to know that you didn't disagree on this. :) Then let me take > > these patches along with that series in the next post (since that > > series will depend on this one, so I'll keep them together, though > > please maintainers decide on how you'll merge them when you want to), > > > > > except for the example here of using 2^63-1, > > > which I expect is to work around the vfio {un}map API bug as we > > > discussed on irc. > > > > Actually not only for that one, I don't know whether we have problem > > here: > > > > (I mentioned this in IRC, in case you missed that, I paste it here as > > well) > > > > static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu, > > dma_addr_t start, size_t size) > > { > > struct rb_node *node = iommu->dma_list.rb_node; > > > > while (node) { > > struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node); > > > > if (start + size <= dma->iova) > > node = node->rb_left; > > else if (start >= dma->iova + dma->size) > > ^^^^^^^^^^^^^^^^^^^^^ > > Could it overflow here? <-------------- > > node = node->rb_right; > > else > > return dma; > > } > > > > return NULL; > > } > > > > I used 2^63-1 to try to avoid both cases. > > > Perhaps I'm not seeing the overflow you're trying to identify, > but dma->iova + dma->size cannot overflow, these are existing DMA > mappings and the DMA mapping path does test for overflow. I think we > can consider those sanitized. I am wondering if the test above it is > suspect though, start + size doesn't seem to be checked for overflow on > the unmap path. It seems pretty easy to avoid from the user side > though, but maybe we should add a test for it in the kernel. Ah, yes. :-) > > > > > For everyone, the root of the problem is that the > > > ioctls use: > > > > > > __u64 iova; /* IO virtual address */ > > > __u64 size; /* Size of mapping (bytes) */ > > > > > > So we can only express a size of 2^64-1 and we have an off-by-one error > > > trying to perform a map/unmap of an entire 64-bit space. Note when > > > designing an API, use start/end rather than base/size to avoid this. > > > > Lesson learned. > > > > > > > > What I don't want to see is for this API bug to leak out into the rest > > > of the QEMU code such that intel_iommu code, or iommu code in general > > > subtly avoids it by artificially using a smaller range. VT-d hardware > > > has an actual physical address space of either 2^39 or 2^48 bits, so if > > > you want to make the iommu address space match the device we're trying > > > to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a > > > 64-bit address space on the IOMMU, so to handle that case I'd expect > > > the simplest solution would be to track the and mapped iova high water > > > mark per container in vfio and truncate unmaps to that high water end > > > address. Realistically we're probably not going to see iovas at the end > > > of the 64-bit address space, but we can come up with some other > > > workaround in the vfio code or update the kernel API if we do. Thanks, > > > > Agree that high watermark can be a good solution for VT-d. I'll use > > that instead of 2^63-1. Though for AMD, if it supports 64bits, we may > > still need to solve existing issues in the future, since in that case > > the high watermark can be (hwaddr)-1 as well if guest specifies it. > > > > (Though anyway I'm not sure when AMD vIOMMUs will be ready for device > > assignment, so I don't think that's a big issue at least for now) > > Even with a true 64bit address space, it seems very unlikely we'd have > iovas at the very top of that address space. Yes, especially if the IOVA is allocated inside kernel. However, since we are with DPDK these days, that's still possible? I remembered that Jason has mentioned about OOM attacker for a tree - since we have a per-domain tree in kernel vfio, it's also possible that aggresive guest (DPDK user) doing special pattern of mapping that can let the tree grow into a very big one, finally exaust the host memory. Is this a problem as well? > If we do, another idea > would be that iommus often have a reserved iova range for MSI mapping, > both VT-d and AMD-Vi do. We know there would be no user iovas within > those ranges, so it would make a clean point to split a full 64-bit > unmap in two. Hopefully we'll have reporting of that reserved range up > through vfio should/when we need something like that. Since we are at here... IIUC VT-d has reserved 0xfeeXXXXX for MSI, do we have protection in intel iommu driver that user should not use IOVA inside this range (I guess not?)? Though this is totally out of topic since even this is a problem, it'll be for intel-iommu driver's. > We could > also simply track both the start and end of that mapping that sets the > high water mark, unmapping all except that, then that alone as two > separate ioctls. I'm not too concerned that we'll be able to work > around it if we need to. Thanks, Sure. Thanks, -- peterx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-20 3:43 ` Peter Xu 2017-01-20 4:21 ` Alex Williamson @ 2017-01-20 12:27 ` Peter Xu 2017-01-20 17:14 ` Alex Williamson 1 sibling, 1 reply; 11+ messages in thread From: Peter Xu @ 2017-01-20 12:27 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote: [...] > > What I don't want to see is for this API bug to leak out into the rest > > of the QEMU code such that intel_iommu code, or iommu code in general > > subtly avoids it by artificially using a smaller range. VT-d hardware > > has an actual physical address space of either 2^39 or 2^48 bits, so if > > you want to make the iommu address space match the device we're trying > > to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a > > 64-bit address space on the IOMMU, so to handle that case I'd expect > > the simplest solution would be to track the and mapped iova high water > > mark per container in vfio and truncate unmaps to that high water end > > address. Realistically we're probably not going to see iovas at the end > > of the 64-bit address space, but we can come up with some other > > workaround in the vfio code or update the kernel API if we do. Thanks, > > Agree that high watermark can be a good solution for VT-d. I'll use > that instead of 2^63-1. Okay when I replied I didn't notice this "watermark" may need more than several (even tens of) LOCs. :( Considering that I see no further usage of this watermark, I'm thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as the watermark - it's simple, efficient and secure imho. Thanks, -- peterx ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-20 12:27 ` Peter Xu @ 2017-01-20 17:14 ` Alex Williamson 2017-01-22 2:59 ` Peter Xu 0 siblings, 1 reply; 11+ messages in thread From: Alex Williamson @ 2017-01-20 17:14 UTC (permalink / raw) To: Peter Xu; +Cc: qemu-devel On Fri, 20 Jan 2017 20:27:18 +0800 Peter Xu <peterx@redhat.com> wrote: > On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote: > > [...] > > > > What I don't want to see is for this API bug to leak out into the rest > > > of the QEMU code such that intel_iommu code, or iommu code in general > > > subtly avoids it by artificially using a smaller range. VT-d hardware > > > has an actual physical address space of either 2^39 or 2^48 bits, so if > > > you want to make the iommu address space match the device we're trying > > > to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a > > > 64-bit address space on the IOMMU, so to handle that case I'd expect > > > the simplest solution would be to track the and mapped iova high water > > > mark per container in vfio and truncate unmaps to that high water end > > > address. Realistically we're probably not going to see iovas at the end > > > of the 64-bit address space, but we can come up with some other > > > workaround in the vfio code or update the kernel API if we do. Thanks, > > > > Agree that high watermark can be a good solution for VT-d. I'll use > > that instead of 2^63-1. > > Okay when I replied I didn't notice this "watermark" may need more > than several (even tens of) LOCs. :( > > Considering that I see no further usage of this watermark, I'm > thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as > the watermark - it's simple, efficient and secure imho. Avoiding the issue based on the virtual iommu hardware properties is a fine solution, my intention was only to discourage introduction of artificial limitations in the surrounding code to avoid this vfio issue. Thanks, Alex ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region 2017-01-20 17:14 ` Alex Williamson @ 2017-01-22 2:59 ` Peter Xu 0 siblings, 0 replies; 11+ messages in thread From: Peter Xu @ 2017-01-22 2:59 UTC (permalink / raw) To: Alex Williamson; +Cc: qemu-devel On Fri, Jan 20, 2017 at 10:14:01AM -0700, Alex Williamson wrote: > On Fri, 20 Jan 2017 20:27:18 +0800 > Peter Xu <peterx@redhat.com> wrote: > > > On Fri, Jan 20, 2017 at 11:43:28AM +0800, Peter Xu wrote: > > > > [...] > > > > > > What I don't want to see is for this API bug to leak out into the rest > > > > of the QEMU code such that intel_iommu code, or iommu code in general > > > > subtly avoids it by artificially using a smaller range. VT-d hardware > > > > has an actual physical address space of either 2^39 or 2^48 bits, so if > > > > you want to make the iommu address space match the device we're trying > > > > to emulate, that's perfectly fine. AIUI, AMD-Vi does actually have a > > > > 64-bit address space on the IOMMU, so to handle that case I'd expect > > > > the simplest solution would be to track the and mapped iova high water > > > > mark per container in vfio and truncate unmaps to that high water end > > > > address. Realistically we're probably not going to see iovas at the end > > > > of the 64-bit address space, but we can come up with some other > > > > workaround in the vfio code or update the kernel API if we do. Thanks, > > > > > > Agree that high watermark can be a good solution for VT-d. I'll use > > > that instead of 2^63-1. > > > > Okay when I replied I didn't notice this "watermark" may need more > > than several (even tens of) LOCs. :( > > > > Considering that I see no further usage of this watermark, I'm > > thinking whether it's okay I directly use (1ULL << VTD_MGAW) here as > > the watermark - it's simple, efficient and secure imho. > > Avoiding the issue based on the virtual iommu hardware properties is a > fine solution, my intention was only to discourage introduction of > artificial limitations in the surrounding code to avoid this vfio > issue. Thanks, Yes. I have posted a new version of the vfio series. Looking forward to your further comment (or ack, if with luck :) on v4. Thanks, -- peterx ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2017-01-22 3:00 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-01-19 9:25 [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 1/3] vfio: trace map/unmap for notify as well Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 2/3] vfio: introduce vfio_get_vaddr() Peter Xu 2017-01-19 9:25 ` [Qemu-devel] [PATCH RFC 3/3] vfio: allow to notify unmap for very large region Peter Xu 2017-01-19 17:54 ` [Qemu-devel] [PATCH RFC 0/3] vfio: allow to notify unmap for very big region Alex Williamson 2017-01-20 3:43 ` Peter Xu 2017-01-20 4:21 ` Alex Williamson 2017-01-20 4:45 ` Peter Xu 2017-01-20 12:27 ` Peter Xu 2017-01-20 17:14 ` Alex Williamson 2017-01-22 2:59 ` Peter Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).