From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:45341) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiZ8B-0000YP-Fd for qemu-devel@nongnu.org; Tue, 22 Mar 2016 23:06:48 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aiZ86-0005me-M4 for qemu-devel@nongnu.org; Tue, 22 Mar 2016 23:06:47 -0400 Received: from mail-pf0-x243.google.com ([2607:f8b0:400e:c00::243]:34232) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aiZ86-0005mX-9W for qemu-devel@nongnu.org; Tue, 22 Mar 2016 23:06:42 -0400 Received: by mail-pf0-x243.google.com with SMTP id n5so529861pfn.1 for ; Tue, 22 Mar 2016 20:06:42 -0700 (PDT) References: <1458546426-26222-1-git-send-email-aik@ozlabs.ru> <1458546426-26222-18-git-send-email-aik@ozlabs.ru> <20160322051449.GG23586@voom.redhat.com> <56F0DDFF.7050308@ozlabs.ru> <20160323010844.GO23586@voom.redhat.com> <56F1FBAB.6090308@ozlabs.ru> <20160323025315.GS23586@voom.redhat.com> From: Alexey Kardashevskiy Message-ID: <56F2083C.9050200@ozlabs.ru> Date: Wed, 23 Mar 2016 14:06:36 +1100 MIME-Version: 1.0 In-Reply-To: <20160323025315.GS23586@voom.redhat.com> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v14 17/18] vfio/spapr: Use VFIO_SPAPR_TCE_v2_IOMMU List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org On 03/23/2016 01:53 PM, David Gibson wrote: > On Wed, Mar 23, 2016 at 01:12:59PM +1100, Alexey Kardashevskiy wrote: >> On 03/23/2016 12:08 PM, David Gibson wrote: >>> On Tue, Mar 22, 2016 at 04:54:07PM +1100, Alexey Kardashevskiy wrote: >>>> On 03/22/2016 04:14 PM, David Gibson wrote: >>>>> On Mon, Mar 21, 2016 at 06:47:05PM +1100, Alexey Kardashevskiy wrote: >>>>>> New VFIO_SPAPR_TCE_v2_IOMMU type supports dynamic DMA window management. >>>>>> This adds ability to VFIO common code to dynamically allocate/remove >>>>>> DMA windows in the host kernel when new VFIO container is added/removed. >>>>>> >>>>>> This adds VFIO_IOMMU_SPAPR_TCE_CREATE ioctl to vfio_listener_region_add >>>>>> and adds just created IOMMU into the host IOMMU list; the opposite >>>>>> action is taken in vfio_listener_region_del. >>>>>> >>>>>> When creating a new window, this uses euristic to decide on the TCE table >>>>>> levels number. >>>>>> >>>>>> This should cause no guest visible change in behavior. >>>>>> >>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>> --- >>>>>> Changes: >>>>>> v14: >>>>>> * new to the series >>>>>> >>>>>> --- >>>>>> TODO: >>>>>> * export levels to PHB >>>>>> --- >>>>>> hw/vfio/common.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++++++--- >>>>>> trace-events | 2 ++ >>>>>> 2 files changed, 105 insertions(+), 5 deletions(-) >>>>>> >>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>> index 4e873b7..421d6eb 100644 >>>>>> --- a/hw/vfio/common.c >>>>>> +++ b/hw/vfio/common.c >>>>>> @@ -279,6 +279,14 @@ static int vfio_host_iommu_add(VFIOContainer *container, >>>>>> return 0; >>>>>> } >>>>>> >>>>>> +static void vfio_host_iommu_del(VFIOContainer *container, hwaddr min_iova) >>>>>> +{ >>>>>> + VFIOHostIOMMU *hiommu = vfio_host_iommu_lookup(container, min_iova, 0x1000); >>>>> >>>>> The hard-coded 0x1000 looks dubious.. >>>> >>>> Well, that's the minimal page size... >>> >>> Really? Some BookE CPUs support 1KiB page size.. >> >> Hm. For IOMMU? Ok. s/0x1000/1/ should do then :) > > Uh.. actually I don't think those CPUs generally had an IOMMU. But if > it's been done for CPU MMU I wouldn't count on it not being done for > IOMMU. > > 1 is a safer choice. > >> >> >>> >>>>>> + g_assert(hiommu); >>>>>> + QLIST_REMOVE(hiommu, hiommu_next); >>>>>> +} >>>>>> + >>>>>> static bool vfio_listener_skipped_section(MemoryRegionSection *section) >>>>>> { >>>>>> return (!memory_region_is_ram(section->mr) && >>>>>> @@ -392,6 +400,61 @@ static void vfio_listener_region_add(MemoryListener *listener, >>>>>> } >>>>>> end = int128_get64(llend); >>>>>> >>>>>> + if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { >>>>> >>>>> I think this would be clearer split out into a helper function, >>>>> vfio_create_host_window() or something. >>>> >>>> >>>> It is rather vfio_spapr_create_host_window() and we were avoiding >>>> xxx_spapr_xxx so far. I'd cut-n-paste the SPAPR PCI AS listener to a >>>> separate file but this usually triggers more discussion and never ends well. >>>> >>>> >>>> >>>>>> + unsigned entries, pages; >>>>>> + struct vfio_iommu_spapr_tce_create create = { .argsz = sizeof(create) }; >>>>>> + >>>>>> + g_assert(section->mr->iommu_ops); >>>>>> + g_assert(memory_region_is_iommu(section->mr)); >>>>> >>>>> I don't think you need these asserts. AFAICT the same logic should >>>>> work if a RAM MR was added directly to PCI address space - this would >>>>> create the new host window, then the existing code for adding a RAM MR >>>>> would map that block of RAM statically into the new window. >>>> >>>> In what configuration/machine can we do that on SPAPR? >>> >>> spapr guests won't ever do that. But you can run an x86 guest on a >>> powernv host and this situation could come up. >> >> >> I am pretty sure VFIO won't work in this case anyway. > > I'm not. There's no fundamental reason VFIO shouldn't work with TCG. This is not about TCG (pseries TCG guest works with VFIO on powernv host), this is about things like VFIO_IOMMU_GET_INFO vs. VFIO_IOMMU_SPAPR_TCE_GET_INFO ioctls but yes, fundamentally, it can work. Should I add such support in this patchset? > >>> In any case there's no point asserting if the code is correct anyway. >> >> Assert here says (at least) "not tested" or "not expected to >> happen". > > Hmmm.. > >> >> >>> >>>>>> + trace_vfio_listener_region_add_iommu(iova, end - 1); >>>>>> + /* >>>>>> + * FIXME: For VFIO iommu types which have KVM acceleration to >>>>>> + * avoid bouncing all map/unmaps through qemu this way, this >>>>>> + * would be the right place to wire that up (tell the KVM >>>>>> + * device emulation the VFIO iommu handles to use). >>>>>> + */ >>>>>> + create.window_size = memory_region_size(section->mr); >>>>>> + create.page_shift = >>>>>> + ctz64(section->mr->iommu_ops->get_page_sizes(section->mr)); >>>>> >>>>> Ah.. except that I guess you'd need to fall back to host page size >>>>> here to handle a RAM MR. >>>> >>>> Can you give an example of such RAM MR being added to PCI AS on >>>> SPAPR? >>> >>> On spapr, no. But you can run other machine types as guests (at least >>> with TCG) on a host with the spapr IOMMU. >>> >>>>>> + /* >>>>>> + * SPAPR host supports multilevel TCE tables, there is some >>>>>> + * euristic to decide how many levels we want for our table: >>>>>> + * 0..64 = 1; 65..4096 = 2; 4097..262144 = 3; 262145.. = 4 >>>>>> + */ >>>>>> + entries = create.window_size >> create.page_shift; >>>>>> + pages = (entries * sizeof(uint64_t)) / getpagesize(); >>>>>> + create.levels = ctz64(pow2ceil(pages) - 1) / 6 + 1; >>>>>> + >>>>>> + ret = ioctl(container->fd, VFIO_IOMMU_SPAPR_TCE_CREATE, &create); >>>>>> + if (ret) { >>>>>> + error_report("Failed to create a window, ret = %d (%m)", ret); >>>>>> + goto fail; >>>>>> + } >>>>>> + >>>>>> + if (create.start_addr != section->offset_within_address_space || >>>>>> + vfio_host_iommu_lookup(container, create.start_addr, >>>>>> + create.start_addr + create.window_size - 1)) { >>>>> >>>>> Under what circumstances can this trigger? Is the kernel ioctl >>>>> allowed to return a different window start address than the one >>>>> requested? >>>> >>>> You already asked this some time ago :) The userspace cannot request >>>> address, the host kernel returns one. >>> >>> Ok. For generality it would be nice if you could succeed here as long >>> as the new host window covers the requested guest window, even if it >>> doesn't match exactly. And for that matter to not request the new >>> window if the host already has a window covering the guest region. >> >> >> That would be dead code - when would it possibly work? I mean I could >> instrument an artificial test but the actual user which might appear later >> will likely be soooo different so this won't help anyway. > > Hmm, I suppose. It actually shouldn't be that hard to trigger a case > like this, if you just bumped the bridge's dma64 base address property > up a little bit - above the host kernel's base address, but small > enough that you can still easily fit the guest memory in. I can test it today for sure but once committed, we will have to support it. Which I am trying to avoid until we get clear picture what we are supporting here. -- Alexey