From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35248) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOlMY-0001sT-0D for qemu-devel@nongnu.org; Thu, 28 Jan 2016 07:07:49 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aOlMS-0007a2-D6 for qemu-devel@nongnu.org; Thu, 28 Jan 2016 07:07:45 -0500 Received: from e06smtp10.uk.ibm.com ([195.75.94.106]:55909) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aOlMR-0007Yz-W8 for qemu-devel@nongnu.org; Thu, 28 Jan 2016 07:07:40 -0500 Received: from localhost by e06smtp10.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 28 Jan 2016 12:07:37 -0000 Received: from b06cxnps3075.portsmouth.uk.ibm.com (d06relay10.portsmouth.uk.ibm.com [9.149.109.195]) by d06dlp01.portsmouth.uk.ibm.com (Postfix) with ESMTP id 7096917D8056 for ; Thu, 28 Jan 2016 12:07:44 +0000 (GMT) Received: from d06av03.portsmouth.uk.ibm.com (d06av03.portsmouth.uk.ibm.com [9.149.37.213]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id u0SC7ZqM3080682 for ; Thu, 28 Jan 2016 12:07:35 GMT Received: from d06av03.portsmouth.uk.ibm.com (localhost [127.0.0.1]) by d06av03.portsmouth.uk.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id u0SC7YVP021850 for ; Thu, 28 Jan 2016 05:07:35 -0700 References: <1452611505-25478-1-git-send-email-pmorel@linux.vnet.ibm.com> <1452622595.9674.19.camel@redhat.com> <569FA454.6050409@linux.vnet.ibm.com> <1453304819.32741.277.camel@redhat.com> <56A0D9F4.1060708@linux.vnet.ibm.com> <1453500876.32741.465.camel@redhat.com> <1453501156.32741.468.camel@redhat.com> <56A787DC.6060905@linux.vnet.ibm.com> <1453827630.26652.71.camel@redhat.com> <56A88DD9.3040004@linux.vnet.ibm.com> <1453916634.6261.7.camel@redhat.com> From: Pierre Morel Message-ID: <56AA0485.1090204@linux.vnet.ibm.com> Date: Thu, 28 Jan 2016 13:07:33 +0100 MIME-Version: 1.0 In-Reply-To: <1453916634.6261.7.camel@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v3] vfio/common: Check iova with limit not with size List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: pbonzini@redhat.com, qemu-devel@nongnu.org, peter.maydell@linaro.org On 01/27/2016 06:43 PM, Alex Williamson wrote: > On Wed, 2016-01-27 at 10:28 +0100, Pierre Morel wrote: >> >> On 01/26/2016 06:00 PM, Alex Williamson wrote: >>> On Tue, 2016-01-26 at 15:51 +0100, Pierre Morel wrote: >>>> On 01/22/2016 11:19 PM, Alex Williamson wrote: >>>>> On Fri, 2016-01-22 at 15:14 -0700, Alex Williamson wrote: >>>>>> On Thu, 2016-01-21 at 14:15 +0100, Pierre Morel wrote: >>>>>>> On 01/20/2016 04:46 PM, Alex Williamson wrote: >>>>>>>> On Wed, 2016-01-20 at 16:14 +0100, Pierre Morel wrote: >>>>>>>>> On 01/12/2016 07:16 PM, Alex Williamson wrote: >>>>>>>>>> On Tue, 2016-01-12 at 16:11 +0100, Pierre Morel wrote: >>>>>>>>>>> In vfio_listener_region_add(), we try to validate that the region >>>>>>>>>>> is >>>>>>>>>>> not >>>>>>>>>>> zero sized and hasn't overflowed the addresses space. >>>>>>>>>>> >>>>>>>>>>> But the calculation uses the size of the region instead of >>>>>>>>>>> using the region's limit (size - 1). >>>>>>>>>>> >>>>>>>>>>> This leads to Int128 overflow when the region has >>>>>>>>>>> been initialized to UINT64_MAX because in this case >>>>>>>>>>> memory_region_init() transform the size from UINT64_MAX >>>>>>>>>>> to int128_2_64(). >>>>>>>>>>> >>>>>>>>>>> Let's really use the limit by sustracting one to the size >>>>>>>>>>> and take care to use the limit for functions using limit >>>>>>>>>>> and size to call functions which need size. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Pierre Morel >>>>>>>>>>> --- >>>>>>>>>>> >>>>>>>>>>> Changes from v2: >>>>>>>>>>> - all, just ignore v2, sorry about this, >>>>>>>>>>> this is build after v1 >>>>>>>>>>> >>>>>>>>>>> Changes from v1: >>>>>>>>>>> - adjust the tests by knowing we already substracted one to >>>>>>>>>>> end. >>>>>>>>>>> >>>>>>>>>>> hw/vfio/common.c | 14 +++++++------- >>>>>>>>>>> 1 files changed, 7 insertions(+), 7 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>>>>>> index 6797208..a5f6643 100644 >>>>>>>>>>> --- a/hw/vfio/common.c >>>>>>>>>>> +++ b/hw/vfio/common.c >>>>>>>>>>> @@ -348,12 +348,12 @@ static void >>>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>>> if (int128_ge(int128_make64(iova), llend)) { >>>>>>>>>>> return; >>>>>>>>>>> } >>>>>>>>>>> - end = int128_get64(llend); >>>>>>>>>>> + end = int128_get64(int128_sub(llend, int128_one())); >>>>>>>>>>> >>>>>>>>>>> - if ((iova < container->min_iova) || ((end - 1) > container- >>>>>>>>>>>> max_iova)) { >>>>>>>>>>> + if ((iova < container->min_iova) || (end > container- >>>>>>>>>>>> max_iova)) { >>>>>>>>>>> error_report("vfio: IOMMU container %p can't map guest >>>>>>>>>>> IOVA >>>>>>>>>>> region" >>>>>>>>>>> " 0x%"HWADDR_PRIx"..0x%"HWADDR_PRIx, >>>>>>>>>>> - container, iova, end - 1); >>>>>>>>>>> + container, iova, end); >>>>>>>>>>> ret = -EFAULT; >>>>>>>>>>> goto fail; >>>>>>>>>>> } >>>>>>>>>>> @@ -363,7 +363,7 @@ static void >>>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>>> if (memory_region_is_iommu(section->mr)) { >>>>>>>>>>> VFIOGuestIOMMU *giommu; >>>>>>>>>>> >>>>>>>>>>> - trace_vfio_listener_region_add_iommu(iova, end - 1); >>>>>>>>>>> + trace_vfio_listener_region_add_iommu(iova, end); >>>>>>>>>>> /* >>>>>>>>>>> * FIXME: We should do some checking to see if the >>>>>>>>>>> * capabilities of the host VFIO IOMMU are adequate to >>>>>>>>>>> model >>>>>>>>>>> @@ -394,13 +394,13 @@ static void >>>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>>> section->offset_within_region + >>>>>>>>>>> (iova - section->offset_within_address_space); >>>>>>>>>>> >>>>>>>>>>> - trace_vfio_listener_region_add_ram(iova, end - 1, vaddr); >>>>>>>>>>> + trace_vfio_listener_region_add_ram(iova, end, vaddr); >>>>>>>>>>> >>>>>>>>>>> - ret = vfio_dma_map(container, iova, end - iova, vaddr, >>>>>>>>>>> section- >>>>>>>>>>>> readonly); >>>>>>>>>>> + ret = vfio_dma_map(container, iova, end - iova + 1, vaddr, >>>>>>>>>>> section->readonly); >>>>>>>>>>> if (ret) { >>>>>>>>>>> error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >>>>>>>>>>> "0x%"HWADDR_PRIx", %p) = %d (%m)", >>>>>>>>>>> - container, iova, end - iova, vaddr, ret); >>>>>>>>>>> + container, iova, end - iova + 1, vaddr, >>>>>>>>>>> ret); >>>>>>>>>>> goto fail; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>> Hmm, did we just push the overflow from one place to another? If >>>>>>>>>> we're >>>>>>>>>> mapping a full region of size int128_2_64() starting at iova zero, >>>>>>>>>> then >>>>>>>>>> this becomes (0xffff_ffff_ffff_ffff - 0 + 1) = 0. So I think we >>>>>>>>>> need >>>>>>>>>> to calculate size with 128bit arithmetic too and let it assert if >>>>>>>>>> we >>>>>>>>>> overflow, ie: >>>>>>>>>> >>>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>>>>> index a5f6643..13ad90b 100644 >>>>>>>>>> --- a/hw/vfio/common.c >>>>>>>>>> +++ b/hw/vfio/common.c >>>>>>>>>> @@ -321,7 +321,7 @@ static void >>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>> MemoryRegionSection >>>>>>>>>> *section) >>>>>>>>>> { >>>>>>>>>> VFIOContainer *container = container_of(listener, >>>>>>>>>> VFIOContainer, listener); >>>>>>>>>> - hwaddr iova, end; >>>>>>>>>> + hwaddr iova, end, size; >>>>>>>>>> Int128 llend; >>>>>>>>>> void *vaddr; >>>>>>>>>> int ret; >>>>>>>>>> @@ -348,7 +348,9 @@ static void >>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>> if (int128_ge(int128_make64(iova), llend)) { >>>>>>>>>> return; >>>>>>>>>> } >>>>>>>>>> + >>>>>>>>>> end = int128_get64(int128_sub(llend, int128_one())); >>>>>>>>>> + size = int128_get64(int128_sub(llend, int128_make64(iova))); >>>>>>>>> here again, if iova is null, since llend is section->size (2^64) ... >>>>>>>>> >>>>>>>>>> >>>>>>>>>> if ((iova < container->min_iova) || (end > container- >>>>>>>>>>> max_iova)) { >>>>>>>>>> error_report("vfio: IOMMU container %p can't map guest >>>>>>>>>> IOVA region" >>>>>>>>>> @@ -396,11 +398,11 @@ static void >>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>> >>>>>>>>>> trace_vfio_listener_region_add_ram(iova, end, vaddr); >>>>>>>>>> >>>>>>>>>> - ret = vfio_dma_map(container, iova, end - iova + 1, vaddr, >>>>>>>>>> section->readonly); >>>>>>>>>> + ret = vfio_dma_map(container, iova, size, vaddr, section- >>>>>>>>>>> readonly); >>>>>>>>>> if (ret) { >>>>>>>>>> error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >>>>>>>>>> "0x%"HWADDR_PRIx", %p) = %d (%m)", >>>>>>>>>> - container, iova, end - iova + 1, vaddr, ret); >>>>>>>>>> + container, iova, size, vaddr, ret); >>>>>>>>>> goto fail; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Does that still solve your scenario? Perhaps vfio-iommu-type1 >>>>>>>>>> should >>>>>>>>>> have used first/last rather than start/size for mapping since we >>>>>>>>>> seem >>>>>>>>>> to have an off-by-one for mapping a full 64bit space. Seems like >>>>>>>>>> we >>>>>>>>>> could do it with two calls to vfio_dma_map if we really wanted to. >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Alex >>>>>>>>>> >>>>>>>>> You are right, every try to solve this will push the overflow >>>>>>>>> somewhere >>>>>>>>> else. >>>>>>>>> >>>>>>>>> There is just no way to express 2^64 with 64 bits, we have the >>>>>>>>> int128() >>>>>>>>> solution, >>>>>>>>> but if we solve it here, we fall in the linux ioctl call anyway. >>>>>>>>> >>>>>>>>> Intuitively, making two calls do not seem right to me. >>>>>>>>> >>>>>>>>> But, what do you think of something like: >>>>>>>>> >>>>>>>>> - creating a new VFIO extention >>>>>>>>> >>>>>>>>> - and in ioctl(), since we have a flag entry in the >>>>>>>>> vfio_iommu_type1_dma_map, >>>>>>>>> may be adding a new flag meaning "map all virtual memory" ? >>>>>>>>> or meaning "use first/last" ? >>>>>>>>> I think this would break existing code unless we add a new VFIO >>>>>>>>> extension. >>>>>>>> Backup, is there ever a case where we actually need to map the entire >>>>>>>> 64bit address space? This is fairly well impossible on x86. I'm >>>>>>>> pointing out an issue, but I don't know that we need to solve it with >>>>>>>> more than an assert since it's never likely to happen. Thanks, >>>>>>>> >>>>>>>> Alex >>>>>>>> >>>>>>> If I understood right, IOVA is the IO virtual address, >>>>>>> it is then possible to map the virtual address page 0xffff_ffff_ffff_f000 >>>>>>> to something reasonable inside the real memory. >>>>>> It is. >>>>>> >>>>>>> Eventual we do not need to map the last virtual page but >>>>>>> I think that in a general case the all virtual memory, as viewed by the >>>>>>> device through the IOMMU should be mapped to avoid any uninitialized >>>>>>> virtual memory access. >>>>>> When using vfio, a device only has access to the IOVA space which has >>>>>> been explicitly mapped. This would be a security issue otherwise since >>>>>> kernel vfio can't rely on userspace to wipe the device IOVA space. >>>> yes. >>>>>>> It is the same reason that make us map the all virtual memory for the >>>>>>> CPU MMU. >>>>>> We don't really do that either, CPU mapping works based on page tables >>>>>> and non-existent entries simply don't exist. We don't fully populate >>>>>> the page tables in advance, this would be a horrible waste of memory. >>>> >>>> Alex, >>>> >>>> I am not sure of that, when preparing DMA from the device, the >>>> guest will provide the destination address and these destination addresses >>>> will be translated by the IOMMU when the device start the DMA. >>>> >>>> The guest can make any decision by preparing the DMA and if I have well >>>> understood, >>>> this is transparent to QEMU. >>>> What is not transparent is the IOMMU translation. >>>> >>>> Then, when the device starts the DMA the destination address can be >>>> anything inside >>>> the virtual memory and the IOMMU will translate this. >>>> To be able to translate, a table entry for this virtual address must >>>> exist in the IOMMU >>>> page table. >>>> >>>> If you have several level of page table you may only fill the first level >>>> for all entries, and may be, have only one first level entry initialized >>>> and the belonging >>>> second level entries filled. >>>> Which greatly reduces the size of the tables. >>>> >>>> But if you do not fill one of the first level entry the behavior of the >>>> IOMMU >>>> and then of the DMA is done according to what ever has been left in this >>>> entry. >>> It seems like you're arguing that the guest is going to have a 2^64 bit >>> address space for DMA targets. Sure, the guest driver can program the >>> device to DMA anywhere in that address space, but what should the IOMMU >>> actually consider an valid DMA? It has to be things that have been >>> mapped, like guest physical RAM or peer-to-peer DMA targets. How would >>> the IOMMU handle a stray DMA for anything else? By default the DMA >>> target would not be mapped and the DMA would generate a fault. How >>> that fault is handled is specific to the architecture and platform. If >>> you want to provide the device with a full 2^64 bit address space where >>> nothing will fault, you're going to need to do it via a lot of mappings >>> pointing to the same host physical page. I'm really not sure where >>> we're going with this though. >> >> Alex, >> >> I think we misunderstood us. >> As I do not know if I do not understand you or if you do not understand me, >> I try to explain me better. >> You may find the description obvious, the all is ok. >> >> In the architectures I know, DMA and IOMMU are two separate things. >> IOMMU must initialize table entries for the all space DMA is able to access. >> It must do it, no exception allowed. >> But some, and may be quite all the entries in the table are invalid >> entries, >> existing entries with the invalid bit (or no valid bit) set. >> >> Most, if not all IOMMU have several level of tables. >> Suppose a two level table mapping a 2^32 memory space. >> You have first level with 4k entries with an indirection to the second >> page level. >> Second page level mapping 256 entries of 4k pages. >> >> In this case, to have a valid page table accessing 1 byte somewhere in >> memory you have: >> 4k - 1 entries of first level with invalid bit set >> 1 first level entry which is valid and point to a second level table >> >> in second level table you have 255 entries with invalid bit set >> one entry valid pointing to a page where the byte is. >> >> This just to illustrate what I want to say when I say all entries are >> initialized. >> If one let garbage inside one entry it happens random things... we do >> not want that. >> >> Are we OK with this? > Yes, basic page tables. > >>>>>>> May be I missed something, or may be I worry too much, >>>>>>> but I see this as a restriction on the supported hardware >>>>>>> if we compare host and guest hardware support compatibility. >>>>>> I don't see the issue, there's arguably a bug in the API that doesn't >>>>>> allow us to map the full 64bit IOVA space of a device in a single >>>>>> mapping, but we can do it in two. Besides, there's really no case >>>>>> where a device needs a fully populated IOTLB unless you're actually >>>>>> giving the device access to 16 EMB of memory. >>>>> s/EMB/EB/ Or I suppose technically EiB >>>> yes, I agree with this, we do not need to access so much memory. >>>> >>>>>>> We can live with it, because in fact you are right and today >>>>>>> I am not aware of a hardware wanting to access this page but a >>>>>>> hardware designers knowing having a IOMMU may want to access exactly >>>>>>> this kind of strange virtual page for special features and this would work >>>>>>> on the host but not inside of the guest. >>>>>> The API issue is not that we can't map 0xffff_ffff_ffff_f000, it's that >>>>>> we can't map 0x0 through 0xffff_ffff_ffff_ffff in a single mapping >>>>>> because we pass the size instead of the end address (where size here >>>>>> would be 2^64). We can map 0x0 through 0xffff_ffff_ffff_efff, followed >>>>>> by 0xffff_ffff_ffff_f000 through 0xffff_ffff_ffff_ffff, but again, why >>>>>> would you ever need to do this? Thanks, >>>>>> >>>>>> Alex >>>> The thing is that It could be useful to say we map all the virtual memory. >>> Why? And again, you can do it, just not in a single mapping, which >>> really seems like a theoretical problem since you're mapping the IOVA >>> space of a device, which lives within and consumes a small portion of >>> this address space, so you're pretty much always looking at doing >>> multiple mappings. >> >> My fault, I use the word map indifferently to refer to a valid mapping >> or invalid mapping. >> May be the source of the misunderstood. >> >>> >>>> Having a size of 2^64 was a possibility. >>>> On the other hands, with the actual implementation the >>>> "memory_region_iommu_replay" would take on long long long time. >>>> >>>> In fact, depending on the IOMMU capabilities and usage we do not need to >>>> call >>>> the "memory_region_iommu_replay" at that time or even not at all. >>> The mapping itself would take a long time, even with GiB pages we're >>> talking about populating 2^34 page table entries. >> >> as seen earlier, it depends on the IOMMU architecture, some tables >> can be ignored if the entry of the parent table is invalid. >> And anyway, the host must do it. >> >>> Of course many of >>> those entries would need to point to the same physical page to cover >>> empty space since otherwise you'd need to run this on a processor that >>> supports more than 2^64 bits of address space, but you probably don't >>> want to waste lots of memory covering empty space, so that means a >>> smaller page size, which means orders of magnitude more mappings and >>> more space wasted in the page tables... All of this should have some >>> useful value, which is escaping me. Thanks, >>> >>> Alex >>> >> >> Another source of misunderstanding can be that I may not understand >> the QEMU code as good as I think. >> >> And may be the last (I hope so) source of misunderstanding is that I did >> not explain well >> the goal of this which is the optimization of IOMMU for some architectures: >> >> Depending on the IOMMU architecture you may not be forced to map the memory >> with a valid entry from the beginning. >> In fact you have the following possibility: >> - map all memory with invalid entry, may be the host do it per default >> (he better do) >> - only update entries that the guest updated. >> >> To do this you must intercept guest updates. >> There are several possibilities to do the interception depending on the >> host architecture, >> the IOMMU and the IOMMU-TLB architecture. >> >> In this case, you would map all the memory the DMA may theoretically access >> with invalid entries, you have no need of replaying the IOMMU entries on >> start, and you >> will only updates entries on each DMA_MAP/UNMAP calls. >> >> The migration case would need some kind of IOMMU replay but it is >> another problem. > You've still lost me sorry, when I do not understand I ask until I understand. and since I did not understand it follows stupid questions. Qemu code is sometime hard to understand for me and there is no room for approximations. > on the actual problem you're trying to solve. You > want to invalidate all of the IOTLB entries for a device, but that's > exactly the starting state of an IOMMU domain and you already have the > ability to get back to that state by unmapping the entire address space, > with the caveat that it takes two calls to flush the full 2^64 bit > address space. Thanks, I think that it is the point I did not full understand. I think I understand better, thanks to your patience. > So don't you just need to fix the overflow issue that > you've identified and add a workaround for the API issue? OK, I try to do this. > I don't > understand why you'd ever need to _map_, ie. create valid IOTLB entries, me neither:-) . I misunderstood, I do not need it. > for the entire address space, but if you don't even want to track the > highest DMA address in use to make unmapping more efficient, the tools > are still there for you. Thanks, > > Alex > > thanks you very much for your patience and explanations, I revisit the way to fix the overflow issue and contact you again when I find something proper. Pierre