From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59648) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aj1I8-0004hj-IC for qemu-devel@nongnu.org; Thu, 24 Mar 2016 05:10:58 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aj1I3-0002TM-Rv for qemu-devel@nongnu.org; Thu, 24 Mar 2016 05:10:56 -0400 Received: from mail-pa0-x244.google.com ([2607:f8b0:400e:c03::244]:35064) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aj1I3-0002T3-HL for qemu-devel@nongnu.org; Thu, 24 Mar 2016 05:10:51 -0400 Received: by mail-pa0-x244.google.com with SMTP id fl4so2227950pad.2 for ; Thu, 24 Mar 2016 02:10:50 -0700 (PDT) References: <1458546426-26222-1-git-send-email-aik@ozlabs.ru> <1458546426-26222-18-git-send-email-aik@ozlabs.ru> <20160322051449.GG23586@voom.redhat.com> <56F0DDFF.7050308@ozlabs.ru> <20160323010844.GO23586@voom.redhat.com> <56F1FBAB.6090308@ozlabs.ru> <20160323025315.GS23586@voom.redhat.com> <56F2083C.9050200@ozlabs.ru> <20160323060338.GV23586@voom.redhat.com> <56F32ED3.8030700@ozlabs.ru> From: Alexey Kardashevskiy Message-ID: <56F3AF14.50907@ozlabs.ru> Date: Thu, 24 Mar 2016 20:10:44 +1100 MIME-Version: 1.0 In-Reply-To: <56F32ED3.8030700@ozlabs.ru> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Paul Mackerras , Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org On 03/24/2016 11:03 AM, Alexey Kardashevskiy wrote: > On 03/23/2016 05:03 PM, David Gibson wrote: >> On Wed, Mar 23, 2016 at 02:06:36PM +1100, Alexey Kardashevskiy wrote: >>> On 03/23/2016 01:53 PM, David Gibson wrote: >>>> On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote: >>>>> On 03/23/2016 12:08 PM, David Gibson wrote: >>>>>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote: >>>>>>> On 03/22/2016 04:14 PM, David Gibson wrote: >>>>>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote: >>>>>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window >>>>>>>>> management. >>>>>>>>> This adds ability to VFIO common code to dynamically allocate/remove >>>>>>>>> DMA windows in the host kernel when new VFIO container is >>>>>>>>> added/removed. >>>>>>>>> >>>>>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to >>>>>>>>> vfio_listener_region_add >>>>>>>>> and adds just created IOMMU into the host IOMMU list; the opposite >>>>>>>>> action is taken in vfio_listener_region_del. >>>>>>>>> >>>>>>>>> When creating a new window, this uses euristic to decide on the >>>>>>>>> TCE table >>>>>>>>> levels number. >>>>>>>>> >>>>>>>>> This should cause no guest visible change in behavior. >>>>>>>>> >>>>>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>>>>> --- >>>>>>>>> Changes: >>>>>>>>> v14: >>>>>>>>> * new to the series >>>>>>>>> >>>>>>>>> --- >>>>>>>>> TODO: >>>>>>>>> * export levels to PHB >>>>>>>>> --- >>>>>>>>> hw/vfio/common.c | 108 >>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>>>>>>> trace-events | 2 ++ >>>>>>>>> 2 files changed, 105 insertions(+), 5 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>>>>> index 4e873b7..421d6eb 100644 >>>>>>>>> --- a/hw/vfio/common.c >>>>>>>>> +++ b/hw/vfio/common.c >>>>>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer >>>>>>>>> *container, >>>>>>>>> return 0; >>>>>>>>> } >>>>>>>>> >>>>>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr >>>>>>>>> min_iova) >>>>>>>>> +{ >>>>>>>>> + VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, >>>>>>>>> min_iova, 0x1000); >>>>>>>> >>>>>>>> The hard-coded 0x1000 looks dubious.. >>>>>>> >>>>>>> Well, that's the minimal page size... >>>>>> >>>>>> Really? Some BookE CPUs support 1KiB page size.. >>>>> >>>>> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :) >>>> >>>> Uh.. actually I don't think those CPUs generally had an IOMMU. But if >>>> it's been done for CPU MMU I wouldn't count on it not being done for >>>> IOMMU. >>>> >>>> 1 is a safer choice. >>>> >>>>> >>>>> >>>>>> >>>>>>>>> + g_assert(hiommu); >>>>>>>>> + QLIST_REMOVE(hiommu, hiommu_next); >>>>>>>>> +} >>>>>>>>> + >>>>>>>>> static bool vfio_listener_skipped_section(MemoryRegionSection >>>>>>>>> *section) >>>>>>>>> { >>>>>>>>> return (!memory_region_is_ram(section->mr) && >>>>>>>>> @@ -392,6 +400,61 @@ static void >>>>>>>>> vfio_listener_region_add(MemoryListener *listener, >>>>>>>>> } >>>>>>>>> end = int128_get64(llend); >>>>>>>>> >>>>>>>>> + if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { >>>>>>>> >>>>>>>> I think this would be clearer split out into a helper function, >>>>>>>> vfio_create_host_window() or something. >>>>>>> >>>>>>> >>>>>>> It is rather vfio_spapr_create_host_window() and we were avoiding >>>>>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a >>>>>>> separate file but this usually triggers more discussion and never >>>>>>> ends well. >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> + unsigned entries, pages; >>>>>>>>> + struct vfio_iommu_spapr_tce_create create = { .argsz = >>>>>>>>> sizeof(create) }; >>>>>>>>> + >>>>>>>>> + g_assert(section->mr->iommu_ops); >>>>>>>>> + g_assert(memory_region_is_iommu(section->mr)); >>>>>>>> >>>>>>>> I don't think you need these asserts. AFAICT the same logic should >>>>>>>> work if a RAM MR was added directly to PCI address space - this would >>>>>>>> create the new host window, then the existing code for adding a RAM MR >>>>>>>> would map that block of RAM statically into the new window. >>>>>>> >>>>>>> In what configuration/machine can we do that on SPAPR? >>>>>> >>>>>> spapr guests won't ever do that. But you can run an x86 guest on a >>>>>> powernv host and this situation could come up. >>>>> >>>>> >>>>> I am pretty sure VFIO won't work in this case anyway. >>>> >>>> I'm not. There's no fundamental reason VFIO shouldn't work with TCG. >>> >>> This is not about TCG (pseries TCG guest works with VFIO on powernv host), >>> this is about things like VFIO_IOMMU_GET_INFO vs. >>> VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work. >>> >>> Should I add such support in this patchset? >> >> Unless adding the generality is really complex, and so far I haven't >> seen a reason for it to be. > > Seriously? :( So, I tried. With q35 machine (pc-i440fx-2.6 have even worse memory tree), there are several RAM blocks - 0..0xc0000, then pc.rom, then RAM again till 2GB, then a gap (PCI MMIO?), then PC BIOS at 0xfffc0000 (which is RAM), then after 4GB the rest of the RAM: memory-region: system 0000000000000000-ffffffffffffffff (prio 0, RW): system 0000000000000000-000000007fffffff (prio 0, RW): alias ram-below-4g @pc.ram 0000000000000000-000000007fffffff 0000000000000000-ffffffffffffffff (prio -1, RW): pci 00000000000c0000-00000000000dffff (prio 1, RW): pc.rom 00000000000e0000-00000000000fffff (prio 1, R-): alias isa-bios @pc.bios 0000000000020000-000000000003ffff 00000000febe0000-00000000febeffff (prio 1, RW): 0003:09:00.0 BAR 0 00000000febe0000-00000000febeffff (prio 0, RW): 0003:09:00.0 BAR 0 mmaps[0] 00000000febf0000-00000000febf1fff (prio 1, RW): 0003:09:00.0 BAR 2 00000000febf0000-00000000febf007f (prio 0, RW): msix-table 00000000febf1000-00000000febf1007 (prio 0, RW): msix-pba [disabled] 00000000febf2000-00000000febf2fff (prio 1, RW): ahci 00000000fffc0000-00000000ffffffff (prio 0, R-): pc.bios [...] 0000000100000000-000000027fffffff (prio 0, RW): alias ram-above-4g @pc.ram 0000000080000000-00000001ffffffff Ok, let's say we filter out "pc.bios", "pc.rom", BARs (aka "skip dump" regions) and we end up having RAM below 2GB and above 4GB, 2 windows. Problem is a window needs to be aligned to power of two. Ok, I change my test from -m8G to -m10G to have 2 aligned regions - 2GB and 8GB), next problem - the second window needs to start from 1<<<59, not 4G or any other random offset. Ok, I change my test to start with -m1G to have a single window. Now it fails because here is what happens: region_add: pc.ram: 0 40000000 region_del: pc.ram: 0 40000000 region_add: pc.ram: 0 c0000 The second "add" is not power of two -> fail. And I cannot avoid this - "pc.rom" is still there, it is a separate region so RAM gets split into smaller chunks. I do not know to to fix this properly. So, in order to make it (x86/tcg on powernv host) work anyhow, I had to create a single huge window (as it is a single window - it starts from @0) unconditionally in vfio_connect_container(), not in vfio_listener_region_add(); and I added filtering (skip "pc.bios", "pc.rom", BARs - these things start above 1GB) and then I could boot a x86_64 guest and even pass Mellanox ConnextX3, it would bring 2 interfaces up and dhclient assigned IPs to them (which is quite amusing that it even works). So, I think I will replace assert() with: unsigned pagesize = qemu_real_host_page_size; if (section->mr->iommu_ops) { pagesize = section->mr->iommu_ops->get_page_sizes(section->mr); } but there is no practical use for this anyway. What do I miss and what do I need to try more to proceed with this patch? Thanks. -- Alexey