From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48421) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agU9g-0008P0-9D for qemu-devel@nongnu.org; Thu, 17 Mar 2016 05:23:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1agU9d-0007xF-Dn for qemu-devel@nongnu.org; Thu, 17 Mar 2016 05:23:44 -0400 Received: from mail-pf0-x243.google.com ([2607:f8b0:400e:c00::243]:35826) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agU9c-0007x9-Uk for qemu-devel@nongnu.org; Thu, 17 Mar 2016 05:23:41 -0400 Received: by mail-pf0-x243.google.com with SMTP id u190so10171761pfb.2 for ; Thu, 17 Mar 2016 02:23:40 -0700 (PDT) References: <1456823441-46757-1-git-send-email-aik@ozlabs.ru> <1456823441-46757-12-git-send-email-aik@ozlabs.ru> <20160303063013.GL1620@voom.redhat.com> <56E7793C.3000209@ozlabs.ru> <20160315054230.GH15272@voom.fritz.box> <56EA3ADD.8010704@ozlabs.ru> <20160317061034.GN9032@voom> From: Alexey Kardashevskiy Message-ID: <56EA7797.5060706@ozlabs.ru> Date: Thu, 17 Mar 2016 20:23:35 +1100 MIME-Version: 1.0 In-Reply-To: <20160317061034.GN9032@voom> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [Qemu-ppc] [PATCH qemu v13 11/16] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson Cc: Alex Williamson , qemu-ppc@nongnu.org, qemu-devel@nongnu.org On 03/17/2016 05:10 PM, David Gibson wrote: > On Thu, Mar 17, 2016 at 04:04:29PM +1100, Alexey Kardashevskiy wrote: >> On 03/15/2016 04:42 PM, David Gibson wrote: >>> On Tue, Mar 15, 2016 at 01:53:48PM +1100, Alexey Kardashevskiy wrote: >>>> On 03/03/2016 05:30 PM, David Gibson wrote: >>>>> On Tue, Mar 01, 2016 at 08:10:36PM +1100, Alexey Kardashevskiy wrote: >>>>>> This makes use of the new "memory registering" feature. The idea is >>>>>> to provide the userspace ability to notify the host kernel about pages >>>>>> which are going to be used for DMA. Having this information, the host >>>>>> kernel can pin them all once per user process, do locked pages >>>>>> accounting (once) and not spent time on doing that in real time with >>>>>> possible failures which cannot be handled nicely in some cases. >>>>>> >>>>>> This adds a prereg memory listener which listens on address_space_memory >>>>>> and notifies a VFIO container about memory which needs to be >>>>>> pinned/unpinned. VFIO MMIO regions (i.e. "skip dump" regions) are skipped. >>>>>> >>>>>> As there is no per-IOMMU-type release() callback anymore, this stores >>>>>> the IOMMU type in the container so vfio_listener_release() can device >>>>>> if it needs to unregister @prereg_listener. >>>>>> >>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes >>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does >>>>>> not call it when v2 is detected and enabled. >>>>>> >>>>>> This does not change the guest visible interface. >>>>>> >>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>> --- >>>>>> hw/vfio/Makefile.objs | 1 + >>>>>> hw/vfio/common.c | 39 +++++++++--- >>>>>> hw/vfio/prereg.c | 138 ++++++++++++++++++++++++++++++++++++++++++ >>>>>> include/hw/vfio/vfio-common.h | 4 ++ >>>>>> trace-events | 2 + >>>>>> 5 files changed, 175 insertions(+), 9 deletions(-) >>>>>> create mode 100644 hw/vfio/prereg.c >>>>>> >>>>>> diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs >>>>>> index ceddbb8..5800e0e 100644 >>>>>> --- a/hw/vfio/Makefile.objs >>>>>> +++ b/hw/vfio/Makefile.objs >>>>>> @@ -4,4 +4,5 @@ obj-$(CONFIG_PCI) += pci.o pci-quirks.o >>>>>> obj-$(CONFIG_SOFTMMU) += platform.o >>>>>> obj-$(CONFIG_SOFTMMU) += calxeda-xgmac.o >>>>>> obj-$(CONFIG_SOFTMMU) += amd-xgbe.o >>>>>> +obj-$(CONFIG_SOFTMMU) += prereg.o >>>>>> endif >>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>> index 3aaa6b5..f2a03e0 100644 >>>>>> --- a/hw/vfio/common.c >>>>>> +++ b/hw/vfio/common.c >>>>>> @@ -531,6 +531,9 @@ static const MemoryListener vfio_iommu_listener = { >>>>>> static void vfio_listener_release(VFIOContainer *container) >>>>>> { >>>>>> memory_listener_unregister(&container->iommu_listener.listener); >>>>>> + if (container->iommu_type == VFIO_SPAPR_TCE_v2_IOMMU) { >>>>>> + memory_listener_unregister(&container->prereg_listener.listener); >>>>>> + } >>>>>> } >>>>>> >>>>>> int vfio_mmap_region(Object *obj, VFIORegion *region, >>>>>> @@ -722,8 +725,8 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as) >>>>>> goto free_container_exit; >>>>>> } >>>>>> >>>>>> - ret = ioctl(fd, VFIO_SET_IOMMU, >>>>>> - v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU); >>>>>> + container->iommu_type = v2 ? VFIO_TYPE1v2_IOMMU : VFIO_TYPE1_IOMMU; >>>>>> + ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type); >>>>>> if (ret) { >>>>>> error_report("vfio: failed to set iommu for container: %m"); >>>>>> ret = -errno; >>>>>> @@ -748,8 +751,10 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as) >>>>>> if ((ret == 0) && (info.flags & VFIO_IOMMU_INFO_PGSIZES)) { >>>>>> container->iova_pgsizes = info.iova_pgsizes; >>>>>> } >>>>>> - } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) { >>>>>> + } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU) || >>>>>> + ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU)) { >>>>>> struct vfio_iommu_spapr_tce_info info; >>>>>> + bool v2 = !!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_v2_IOMMU); >>>>>> >>>>>> ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd); >>>>>> if (ret) { >>>>>> @@ -757,7 +762,9 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as) >>>>>> ret = -errno; >>>>>> goto free_container_exit; >>>>>> } >>>>>> - ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU); >>>>>> + container->iommu_type = >>>>>> + v2 ? VFIO_SPAPR_TCE_v2_IOMMU : VFIO_SPAPR_TCE_IOMMU; >>>>>> + ret = ioctl(fd, VFIO_SET_IOMMU, container->iommu_type); >>>>> >>>>> It'd be nice to consolidate the setting of container->iommu_type and >>>>> then the SET_IOMMU ioctl() rather than having more or less duplicated >>>>> logic for Type1 and SPAPR, but it's not a big deal. >>>> >>>> >>>> May be but I cannot think of any nice way of doing this though. >>>> >>>> >>>>> >>>>>> if (ret) { >>>>>> error_report("vfio: failed to set iommu for container: %m"); >>>>>> ret = -errno; >>>>>> @@ -769,11 +776,25 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as) >>>>>> * when container fd is closed so we do not call it explicitly >>>>>> * in this file. >>>>>> */ >>>>>> - ret = ioctl(fd, VFIO_IOMMU_ENABLE); >>>>>> - if (ret) { >>>>>> - error_report("vfio: failed to enable container: %m"); >>>>>> - ret = -errno; >>>>>> - goto free_container_exit; >>>>>> + if (!v2) { >>>>>> + ret = ioctl(fd, VFIO_IOMMU_ENABLE); >>>>>> + if (ret) { >>>>>> + error_report("vfio: failed to enable container: %m"); >>>>>> + ret = -errno; >>>>>> + goto free_container_exit; >>>>>> + } >>>>>> + } else { >>>>>> + container->prereg_listener.container = container; >>>>>> + container->prereg_listener.listener = vfio_prereg_listener; >>>>>> + >>>>>> + memory_listener_register(&container->prereg_listener.listener, >>>>>> + &address_space_memory); >>>>> >>>>> This assumes that the target address space of the (guest) IOMMU is >>>>> address_space_memory. Which is fine - vfio already assumes that - but >>>>> it reminds me that it'd be nice to have an explicit check for that (I >>>>> guess it would have to go in vfio_iommu_map_notify()). So that if >>>>> someone constructs a machine where that's not the case, it'll at least >>>>> be obvious why VFIO isn't working. >>>> >>>> Ok, I'll add a small patch for this in the next respin. >>> >>> Ok. >>> >>>>>> + if (container->error) { >>>>>> + error_report("vfio: RAM memory listener initialization failed for container"); >>>>>> + memory_listener_unregister( >>>>>> + &container->prereg_listener.listener); >>>>>> + goto free_container_exit; >>>>>> + } >>>>>> } >>>>> >>>>> Looks like you don't have an error path which will handle the case >>>>> where the prereg listener is registered, but registering the normal >>>>> PCI AS listener fails - I believe you will fail to unregister the >>>>> prereg listener in that case. >>>> >>>> >>>> In this case, the control goes to listener_release_exit: which calls >>>> vfio_listener_release() which unregisters both listeners (it is a few chunks >>>> above). >>> >>> Ah.. yes. In which case this could also jump to listener_release_exit >>> and avoid the explicit unreg(), yes? >> >> >> Sorry, I do not follow you here. It does jump to >> listener_release_exit already. > > I mean you can use the listener_release_exit label on failure of the > prereg listener as well as failure of the regular listener. When vfio_prereg_listener fails, vfio_memory_listener is not registered - it just looks cleaner to jump further to free_container_exit than to listener_release_exit, does not it? >>>>>> /* >>>>>> diff --git a/hw/vfio/prereg.c b/hw/vfio/prereg.c >>>>>> new file mode 100644 >>>>>> index 0000000..66cd696 >>>>>> --- /dev/null >>>>>> +++ b/hw/vfio/prereg.c >>>>>> @@ -0,0 +1,138 @@ >>>>>> +/* >>>>>> + * DMA memory preregistration >>>>>> + * >>>>>> + * Authors: >>>>>> + * Alexey Kardashevskiy >>>>>> + * >>>>>> + * This work is licensed under the terms of the GNU GPL, version 2. See >>>>>> + * the COPYING file in the top-level directory. >>>>>> + */ >>>>>> + >>>>>> +#include "qemu/osdep.h" >>>>>> +#include >>>>>> +#include >>>>>> + >>>>>> +#include "hw/vfio/vfio-common.h" >>>>>> +#include "hw/vfio/vfio.h" >>>>>> +#include "qemu/error-report.h" >>>>>> +#include "trace.h" >>>>>> + >>>>>> +static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section) >>>>>> +{ >>>>>> + return (!memory_region_is_ram(section->mr) && >>>>>> + !memory_region_is_iommu(section->mr)) || >>>>>> + memory_region_is_skip_dump(section->mr); >>>>>> +} >>>>>> + >>>>>> +static void vfio_prereg_listener_region_add(MemoryListener *listener, >>>>>> + MemoryRegionSection *section) >>>>>> +{ >>>>>> + VFIOMemoryListener *vlistener = container_of(listener, VFIOMemoryListener, >>>>>> + listener); >>>>>> + VFIOContainer *container = vlistener->container; >>>>>> + hwaddr iova; >>>>>> + Int128 llend; >>>>>> + int ret; >>>>>> + hwaddr page_mask = qemu_real_host_page_mask; >>>>>> + struct vfio_iommu_spapr_register_memory reg = { >>>>>> + .argsz = sizeof(reg), >>>>>> + .flags = 0, >>>>>> + }; >>>>>> + >>>>>> + if (vfio_prereg_listener_skipped_section(section)) { >>>>>> + trace_vfio_listener_region_add_skip( >>>>>> + section->offset_within_address_space, >>>>>> + section->offset_within_address_space + >>>>>> + int128_get64(int128_sub(section->size, int128_one()))); >>>>>> + return; >>>>>> + } >>>>> >>>>> You should probably explicitly check for IOMMU regions and abort if >>>>> you find one. An IOMMU AS appearing within address_space_memory would >>>>> be bad news. >>>> >>>> >>>> Oh, vfio_prereg_listener_skipped_section() allows memory_region_is_iommu(), >>>> I'll remove it. >>> >>> Well, that's part. >>> >>> But IOMMU regions appearing here shouldn't just be ignored - they >>> should be treated as a fatal error. >> >> >> Is this always an error when they appear in &address_space_memory? Because >> if it is so, we should do this check somewhere in >> memory_region_transaction_commit() (memory_region_add_subregion() cannot >> check this, there is no AS). > > Hmm.. not necessarily, although it would certainly be strange. I can > imagine some sort of off-board memory device which incorporates an > "IO"MMU which will remap requests both from the cpu and from DMA > devices. > > But, that wouldn't work with VFIO, since we no longer know where the > host memory is which is backing address_space_memory to preregister > it. Well.. it might be possible by looking at the *target* AS of an > iommu registered in address_space_memory, setting up a listener on > that, and keeping on going until you reach a real memory region. > > The case is so unlikely that it's not worth actually implementing the > code for. But I think it is worth a sanity check. Ok then. g_assert() or hwerror()? -- Alexey