From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46569) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1akmRt-00069v-65 for qemu-devel@nongnu.org; Tue, 29 Mar 2016 01:44:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1akmRo-000653-QO for qemu-devel@nongnu.org; Tue, 29 Mar 2016 01:44:17 -0400 Received: from mail-pf0-x244.google.com ([2607:f8b0:400e:c00::244]:36111) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1akmRo-00064k-B6 for qemu-devel@nongnu.org; Tue, 29 Mar 2016 01:44:12 -0400 Received: by mail-pf0-x244.google.com with SMTP id q129so960722pfb.3 for ; Mon, 28 Mar 2016 22:44:11 -0700 (PDT) References: <1458546426-26222-18-git-send-email-aik@ozlabs.ru> <20160322051449.GG23586@voom.redhat.com> <56F0DDFF.7050308@ozlabs.ru> <20160323010844.GO23586@voom.redhat.com> <56F1FBAB.6090308@ozlabs.ru> <20160323025315.GS23586@voom.redhat.com> <56F2083C.9050200@ozlabs.ru> <20160323060338.GV23586@voom.redhat.com> <56F32ED3.8030700@ozlabs.ru> <56F3AF14.50907@ozlabs.ru> <20160329053042.GG15156@voom.fritz.box> From: Alexey Kardashevskiy Message-ID: <56FA1624.2060307@ozlabs.ru> Date: Tue, 29 Mar 2016 16:44:04 +1100 MIME-Version: 1.0 In-Reply-To: <20160329053042.GG15156@voom.fritz.box> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Paul Mackerras , Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org On 03/29/2016 04:30 PM, David Gibson wrote: > On Thu, Mar 24, 2016 at 08:10:44PM +1100, Alexey Kardashevskiy wrote: >> On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote: >>> On 03/23/2016 05:03 PM, David Gibson wrote: >>>> On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote: >>>>> On 03/23/2016 01:53 PM, David Gibson wrote: >>>>>> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote: >>>>>>> On 03/23/2016 12:08 PM, David Gibson wrote: >>>>>>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>> On 03/22/2016 04:14 PM, David Gibson wrote: >>>>>>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window >>>>>>>>>>> management. >>>>>>>>>>> This adds ability to VFIO common code to dynamically allocate/remove >>>>>>>>>>> DMA windows in the host kernel when new VFIO container is >>>>>>>>>>> added/removed. >>>>>>>>>>> >>>>>>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to >>>>>>>>>>> vfio_listener_region_add >>>>>>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite >>>>>>>>>>> action is taken in vfio_listener_region_del. >>>>>>>>>>> >>>>>>>>>>> When creating a new window, this uses euristic to decide on the >>>>>>>>>>> TCE table >>>>>>>>>>> levels number. >>>>>>>>>>> >>>>>>>>>>> This should cause no guest visible change in behavior. >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>>>>>>> --- >>>>>>>>>>> Changes: >>>>>>>>>>> v14: >>>>>>>>>>> * new to the series >>>>>>>>>>> >>>>>>>>>>> --- >>>>>>>>>>> TODO: >>>>>>>>>>> * export levels to PHB >>>>>>>>>>> --- >>>>>>>>>>> hw/vfio/common.c | 108 >>>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>>>>>>>>> trace-events | 2 ++ >>>>>>>>>>> 2 files changed, 105 insertions(+), 5 deletions(-) >>>>>>>>>>> >>>>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>>>>>> index 4e873b7..421d6eb 100644 >>>>>>>>>>> --- a/hw/vfio/common.c >>>>>>>>>>> +++ b/hw/vfio/common.c >>>>>>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer >>>>>>>>>>> *container, >>>>>>>>>>> return 0; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr >>>>>>>>>>> min_iova) >>>>>>>>>>> +{ >>>>>>>>>>> + VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, >>>>>>>>>>> min_iova, 0x1000); >>>>>>>>>> >>>>>>>>>> The hard-coded 0x1000 looks dubious.. >>>>>>>>> >>>>>>>>> Well, that's the minimal page size... >>>>>>>> >>>>>>>> Really? Some BookE CPUs support 1KiB page size.. >>>>>>> >>>>>>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :) >>>>>> >>>>>> Uh.. actually I don't think those CPUs generally had an IOMMU. But if >>>>>> it's been done for CPU MMU I wouldn't count on it not being done for >>>>>> IOMMU. >>>>>> >>>>>> 1 is a safer choice. >>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>>>>> + g_assert(hiommu); >>>>>>>>>>> + QLIST_REMOVE(hiommu, hiommu_next); >>>>>>>>>>> +} >>>>>>>>>>> + >>>>>>>>>>> static bool vfio_listener_skipped_section(MemoryRegionSection >>>>>>>>>>> *section) >>>>>>>>>>> { >>>>>>>>>>> return (!memory_region_is_ram(section->mr) && >>>>>>>>>>> @@ -392,6 +400,61 @@ static void >>>>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>>>> } >>>>>>>>>>> end = int128_get64(llend); >>>>>>>>>>> >>>>>>>>>>> + if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { >>>>>>>>>> >>>>>>>>>> I think this would be clearer split out into a helper function, >>>>>>>>>> vfio_create_host_window() or something. >>>>>>>>> >>>>>>>>> >>>>>>>>> It is rather vfio_spapr_create_host_window() and we were avoiding >>>>>>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a >>>>>>>>> separate file but this usually triggers more discussion and never >>>>>>>>> ends well. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>>> + unsigned entries, pages; >>>>>>>>>>> + struct vfio_iommu_spapr_tce_create create = { .argsz = >>>>>>>>>>> sizeof(create) }; >>>>>>>>>>> + >>>>>>>>>>> + g_assert(section->mr->iommu_ops); >>>>>>>>>>> + g_assert(memory_region_is_iommu(section->mr)); >>>>>>>>>> >>>>>>>>>> I don't think you need these asserts. AFAICT the same logic should >>>>>>>>>> work if a RAM MR was added directly to PCI address space - this would >>>>>>>>>> create the new host window, then the existing code for adding a RAM MR >>>>>>>>>> would map that block of RAM statically into the new window. >>>>>>>>> >>>>>>>>> In what configuration/machine can we do that on SPAPR? >>>>>>>> >>>>>>>> spapr guests won't ever do that. But you can run an x86 guest on a >>>>>>>> powernv host and this situation could come up. >>>>>>> >>>>>>> >>>>>>> I am pretty sure VFIO won't work in this case anyway. >>>>>> >>>>>> I'm not. There's no fundamental reason VFIO shouldn't work with TCG. >>>>> >>>>> This is not about TCG (pseries TCG guest works with VFIO on powernv host), >>>>> this is about things like VFIO_IOMMU_GET_INFO vs. >>>>> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work. >>>>> >>>>> Should I add such support in this patchset? >>>> >>>> Unless adding the generality is really complex, and so far I haven't >>>> seen a reason for it to be. >>> >>> Seriously? :( >> >> >> So, I tried. >> >> With q35 machine (pc-i440fx-2.6 have even worse memory tree), there are >> several RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, then >> a gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after 4GB >> the rest of the RAM: >> >> memory-region: system >> 0000000000000000-ffffffffffffffff (prio 0, RW): system >> 0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g >> @pc.ram 0000000000000000-000000007fffffff >> 0000000000000000-ffffffffffffffff (prio -1, RW): pci >> 00000000000c0000-00000000000dffff (prio 1, RW): pc.rom >> 00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios >> @pc.bios 0000000000020000-000000000003ffff >> 00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0 >> 00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0 >> mmaps[0] >> 00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2 >> 00000000febf0000-00000000febf007f (prio 0, RW): msix-table >> 00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled] >> 00000000febf2000-00000000febf2fff (prio 1, RW): ahci >> 00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios >> [...] >> 0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g >> @pc.ram 0000000080000000-00000001ffffffff >> >> >> Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump" >> regions) and we end up having RAM below 2GB and above 4GB, 2 windows. >> Problem is a window needs to be aligned to power of two. > > Window size? That should be ok - you can just round up the requested > size to a power of two. We don't have to populate the whole host window. > >> Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and >> 8GB), next problem - the second window needs to start from 1<<<59, not 4G or >> any other random offset. > > Yeah, that's harder. With the IODA is it always that the 32-bit > window is at 0 and the 64-bit window is at 1<<59, or can you have a > 64-bit window at 0? I can have a single window at 0. This is why my (hacked) version works at all. IODA2 allows having 2 windows, each window can be backed with TCE table with variable pagesize or be a bypass, the window capabilities are exactly the same. > If it's the second, then this is theoretically possible, but the host > kernel - on seeing the second window request, would need to see that > it can instead of adding a new host window, instead extend the first > host window so that it covers both the guest windows. > That does sound reasonably tricky and I don't think we should try to > implement it any time soon. BUT - we should design our *interfaces* > so that it's reasonable to add that in future. > >> Ok, I change my test to start with -m1G to have a single window. >> Now it fails because here is what happens: >> region_add: pc.ram: 0 40000000 >> region_del: pc.ram: 0 40000000 >> region_add: pc.ram: 0 c0000 >> The second "add" is not power of two -> fail. And I cannot avoid this - >> "pc.rom" is still there, it is a separate region so RAM gets split into >> smaller chunks. I do not know to to fix this properly. Still, how to fix this? Do I need to fix this now? Practically, when do I create this single huge window? >> >> >> So, in order to make it (x86/tcg on powernv host) work anyhow, I had to >> create a single huge window (as it is a single window - it starts from @0) >> unconditionally in vfio_connect_container(), not in >> vfio_listener_region_add(); and I added filtering (skip "pc.bios", "pc.rom", >> BARs - these things start above 1GB) and then I could boot a x86_64 guest >> and even pass Mellanox ConnextX3, it would bring 2 interfaces up and >> dhclient assigned IPs to them (which is quite amusing that it even works). >> >> >> So, I think I will replace assert() with: >> >> unsigned pagesize = qemu_real_host_page_size; >> if (section->mr->iommu_ops) { >> pagesize = section->mr->iommu_ops->get_page_sizes(section->mr); >> } >> >> but there is no practical use for this anyway. > > So, I think what you need here is to compute the effective IOMMU page > size (see other email for details). That will be getrampagesize() for > RAM regions, and the mininum of that and the guest IOMMU page size for > IOMMU regions. -- Alexey