From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42810) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1evnDm-0000pN-Oz for qemu-devel@nongnu.org; Tue, 13 Mar 2018 12:56:20 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1evnDi-0004mc-O2 for qemu-devel@nongnu.org; Tue, 13 Mar 2018 12:56:18 -0400 Date: Tue, 13 Mar 2018 10:56:06 -0600 From: Alex Williamson Message-ID: <20180313105606.18d0d8af@w520.home> In-Reply-To: <8e909bd8-2c24-abde-aa42-fd3f46ff649e@ozlabs.ru> References: <20180212051954.GE11634@umbus.fritz.box> <0a57f0ec-4300-f4f3-3957-4bced8b57e14@ozlabs.ru> <20180212090653.42bdd2e2@w520.home> <0c594a0b-488c-b5f8-e8ed-ffa8f4a98088@ozlabs.ru> <20180213053630.GL11634@umbus.fritz.box> <20180213054124.GN11634@umbus.fritz.box> <8282d070-b2d3-7e43-3da9-b17c77b99a67@ozlabs.ru> <20180214013323.GA5247@umbus.fritz.box> <843efd62-7a39-8536-91ad-1abea2beb190@ozlabs.ru> <20180214085541.5491705f@w520.home> <20180216052824.GI2074@umbus.fritz.box> <821596dc-35b0-60cf-3e1f-3df0809b2f8e@ozlabs.ru> <8e909bd8-2c24-abde-aa42-fd3f46ff649e@ozlabs.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [PATCH qemu v7 2/4] vfio/pci: Relax DMA map errors for MMIO regions List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexey Kardashevskiy Cc: David Gibson , qemu-devel@nongnu.org, qemu-ppc@nongnu.org, Auger Eric [Cc +Eric] On Tue, 13 Mar 2018 15:53:19 +1100 Alexey Kardashevskiy wrote: > On 7/3/18 1:17 pm, Alexey Kardashevskiy wrote: > > On 26/02/18 19:36, Alexey Kardashevskiy wrote: =20 > >> On 19/02/18 13:46, Alexey Kardashevskiy wrote: =20 > >>> On 16/02/18 16:28, David Gibson wrote: =20 > >>>> On Wed, Feb 14, 2018 at 08:55:41AM -0700, Alex Williamson wrote: =20 > >>>>> On Wed, 14 Feb 2018 19:09:16 +1100 > >>>>> Alexey Kardashevskiy wrote: > >>>>> =20 > >>>>>> On 14/02/18 12:33, David Gibson wrote: =20 > >>>>>>> On Tue, Feb 13, 2018 at 07:20:56PM +1100, Alexey Kardashevskiy wr= ote: =20 > >>>>>>>> On 13/02/18 16:41, David Gibson wrote: =20 > >>>>>>>>> On Tue, Feb 13, 2018 at 04:36:30PM +1100, David Gibson wrote: = =20 > >>>>>>>>>> On Tue, Feb 13, 2018 at 12:15:52PM +1100, Alexey Kardashevskiy= wrote: =20 > >>>>>>>>>>> On 13/02/18 03:06, Alex Williamson wrote: =20 > >>>>>>>>>>>> On Mon, 12 Feb 2018 18:05:54 +1100 > >>>>>>>>>>>> Alexey Kardashevskiy wrote: > >>>>>>>>>>>> =20 > >>>>>>>>>>>>> On 12/02/18 16:19, David Gibson wrote: =20 > >>>>>>>>>>>>>> On Fri, Feb 09, 2018 at 06:55:01PM +1100, Alexey Kardashev= skiy wrote: =20 > >>>>>>>>>>>>>>> At the moment if vfio_memory_listener is registered in th= e system memory > >>>>>>>>>>>>>>> address space, it maps/unmaps every RAM memory region for= DMA. > >>>>>>>>>>>>>>> It expects system page size aligned memory sections so vf= io_dma_map > >>>>>>>>>>>>>>> would not fail and so far this has been the case. A mappi= ng failure > >>>>>>>>>>>>>>> would be fatal. A side effect of such behavior is that so= me MMIO pages > >>>>>>>>>>>>>>> would not be mapped silently. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> However we are going to change MSIX BAR handling so we wi= ll end having > >>>>>>>>>>>>>>> non-aligned sections in vfio_memory_listener (more detail= s is in > >>>>>>>>>>>>>>> the next patch) and vfio_dma_map will exit QEMU. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> In order to avoid fatal failures on what previously was n= ot a failure and > >>>>>>>>>>>>>>> was just silently ignored, this checks the section alignm= ent to > >>>>>>>>>>>>>>> the smallest supported IOMMU page size and prints an erro= r if not aligned; > >>>>>>>>>>>>>>> it also prints an error if vfio_dma_map failed despite th= e page size check. > >>>>>>>>>>>>>>> Both errors are not fatal; only MMIO RAM regions are chec= ked > >>>>>>>>>>>>>>> (aka "RAM device" regions). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If the amount of errors printed is overwhelming, the MSIX= relocation > >>>>>>>>>>>>>>> could be used to avoid excessive error output. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This is unlikely to cause any behavioral change. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy = =20 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> There are some relatively superficial problems noted below. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> But more fundamentally, this feels like it's extending an = existing > >>>>>>>>>>>>>> hack past the point of usefulness. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> The explicit check for is_ram_device() here has always bot= hered me - > >>>>>>>>>>>>>> it's not like a real bus bridge magically knows whether a = target > >>>>>>>>>>>>>> address maps to RAM or not. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> What I think is really going on is that even for systems w= ithout an > >>>>>>>>>>>>>> IOMMU, it's not really true to say that the PCI address sp= ace maps > >>>>>>>>>>>>>> directly onto address_space_memory. Instead, there's a la= rge, but > >>>>>>>>>>>>>> much less than 2^64 sized, "upstream window" at address 0 = on the PCI > >>>>>>>>>>>>>> bus, which is identity mapped to the system bus. Details = will vary > >>>>>>>>>>>>>> with the system, but in practice we expect nothing but RAM= to be in > >>>>>>>>>>>>>> that window. Addresses not within that window won't be ma= pped to the > >>>>>>>>>>>>>> system bus but will just be broadcast on the PCI bus and m= ight be > >>>>>>>>>>>>>> picked up as a p2p transaction. =20 > >>>>>>>>>>>>> > >>>>>>>>>>>>> Currently this p2p works only via the IOMMU, direct p2p is = not possible as > >>>>>>>>>>>>> the guest needs to know physical MMIO addresses to make p2p= work and it > >>>>>>>>>>>>> does not. =20 > >>>>>>>>>>>> > >>>>>>>>>>>> /me points to the Direct Translated P2P section of the ACS s= pec, though > >>>>>>>>>>>> it's as prone to spoofing by the device as ATS. In any case= , p2p > >>>>>>>>>>>> reflected from the IOMMU is still p2p and offloads the CPU e= ven if > >>>>>>>>>>>> bandwidth suffers vs bare metal depending on if the data dou= bles back > >>>>>>>>>>>> over any links. Thanks, =20 > >>>>>>>>>>> > >>>>>>>>>>> Sure, I was just saying that p2p via IOMMU won't be as simple= as broadcast > >>>>>>>>>>> on the PCI bus, IOMMU needs to be programmed in advance to ma= ke this work, > >>>>>>>>>>> and current that broadcast won't work for the passed through = devices. =20 > >>>>>>>>>> > >>>>>>>>>> Well, sure, p2p in a guest with passthrough devices clearly ne= eds to > >>>>>>>>>> be translated through the IOMMU (and p2p from a passthrough to= an > >>>>>>>>>> emulated device is essentially impossible). > >>>>>>>>>> > >>>>>>>>>> But.. what does that have to do with this code. This is the m= emory > >>>>>>>>>> area watcher, looking for memory regions being mapped directly= into > >>>>>>>>>> the PCI space. NOT IOMMU regions, since those are handled sep= arately > >>>>>>>>>> by wiring up the IOMMU notifier. This will only trigger if RA= M-like, > >>>>>>>>>> non-RAM regions are put into PCI space *not* behind an IOMMMU.= =20 > >>>>>>>>> > >>>>>>>>> Duh, sorry, realised I was mixing up host and guest IOMMU. I g= uess > >>>>>>>>> the point here is that this will map RAM-like devices into the = host > >>>>>>>>> IOMMU when there is no guest IOMMU, allowing p2p transactions b= etween > >>>>>>>>> passthrough devices (though not from passthrough to emulated de= vices). =20 > >>>>>>>> > >>>>>>>> Correct. > >>>>>>>> =20 > >>>>>>>>> > >>>>>>>>> The conditions still seem kind of awkward to me, but I guess it= makes > >>>>>>>>> sense. =20 > >>>>>>>> > >>>>>>>> Is it the time to split this listener to RAM-listener and PCI bu= s listener? =20 > >>>>>>> > >>>>>>> I'm not really sure what you mean by that. > >>>>>>> =20 > >>>>>>>> On x86 it listens on the "memory" AS, on spapr - on the > >>>>>>>> "pci@800000020000000" AS, this will just create more confusion o= ver time... =20 > >>>>>>> > >>>>>>> Hm, it's still logically the same AS in each case - the PCI addre= ss > >>>>>>> space. It's just that in the x86 case that happens to be the sam= e as > >>>>>>> the memory AS (at least when there isn't a guest side IOMMU). = =20 > >>>>>> > >>>>>> Hm. Ok. > >>>>>> =20 > >>>>>>> I do wonder if that's really right even for x86 without IOMMU. T= he > >>>>>>> PCI address space is identity mapped to the "main" memory address > >>>>>>> space there, but I'm not sure it's mapped th *all* of the main me= mory =20 > >>>>>> > >>>>>> What should have been in the place of "th" above? :) > >>>>>> =20 > >>>>>>> address space, probably just certain ranges of it. That's what I= was > >>>>>>> suggesting earlier in the thread. =20 > >>>>>> > >>>>>> afaict vfio is trying to mmap every RAM MR =3D=3D KVM memory slot,= no ranges or > >>>>>> anything like that. I am trying to figure out what is not correct = with the > >>>>>> patch. Thanks, =20 > >>>>> > >>>>> It is possible for x86 systems to have translation applied to MMIO = vs > >>>>> RAM such that the processor view and device view of MMIO are differ= ent, > >>>>> putting them into separate address spaces, but this is not typical = and > >>>>> not available on the class of chipsets that QEMU emulates for PC. > >>>>> Therefore I think it's correct that MMIO and RAM all live in one big > >>>>> happy, flat address space as they do now (assuming the guest IOMMU = is > >>>>> not present or disabled). Thanks, =20 > >>>> > >>>> Ah.. I think the thing I was missing is that on PC (at least with > >>>> typical chipsets) *all* MMIO essentially comes from PCI space. Even > >>>> the legacy devices are essentially ISA which is treated as being on a > >>>> bridge under the PCI space. On non-x86 there are often at least a > >>>> handful of MMIO devices that aren't within the PCI space - say, the > >>>> PCI host bridge itself at least. x86 avoids that by using the > >>>> separate IO space, which is also a bit weird in that it's > >>>> simultaneously direct attached to the cpu (and so doesn't need bridge > >>>> configuration), but also identified with the legay IO space treated = as > >>>> being under two bridges (host->PCI, PCI->ISA) for other purposes. > >>>> > >>>> Maybe it's just me, but I find it makes more sense to me if I think = of > >>>> it as the two ASes being equal because on PC the system address map > >>>> (address_space_memory) is made equal to the PCI address map, rather > >>>> than the PCI address map being made equal to the system one. =20 > >>> > >>> It makes more sense to me too, we just do not exploit/expose this on = SPAPR > >>> - the PCI address space only has 2xIOMMU and 1xMSI windows and that's= it, > >>> no MMIO which is mapped to the system AS only (adding those to the PC= I AS > >>> is little tricky as mentioned elsewhere). > >>> > >>> Anyway, I still wonder - what needs to be addressed in the 2/4 patch = in > >>> order to proceed? =20 > >> > >> Ping? > >> =20 > >=20 > >=20 > > Ping? > >=20 > > =20 >=20 >=20 > Could anyone please enlighten me what needs to be done with this patchset > to proceed, or confirm that it is ok but not going anywhere as everybody = is > busy with other things? :) Thanks, Actually making sure it compiles would have been nice: qemu.git/hw/vfio/common.c: In function =E2=80=98vfio_listener_region_add=E2= =80=99: qemu.git/hw/vfio/common.c:550:40: error: invalid operands to binary & (have= =E2=80=98Int128 {aka struct Int128}=E2=80=99 and =E2=80=98hwaddr {aka long= long unsigned int}=E2=80=99) if ((iova & pgmask) || (llsize & pgmask)) { ^ qemu.git/hw/vfio/common.c: In function =E2=80=98vfio_listener_region_del=E2= =80=99: qemu.git/hw/vfio/common.c:669:50: error: invalid operands to binary & (have= =E2=80=98Int128 {aka struct Int128}=E2=80=99 and =E2=80=98hwaddr {aka long= long unsigned int}=E2=80=99) try_unmap =3D !((iova & pgmask) || (llsize & pgmask)); ^ Clearly llsize needs to be wrapped in int128_get64() here. Should I presume that testing of this patch is equally lacking? Eric, have you done any testing of this? I think this was fixing a mapping issue you reported. Thanks, Alex