From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44916) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCip8-00040W-Ll for qemu-devel@nongnu.org; Wed, 08 Jul 2015 02:27:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZCip0-0002v6-Ap for qemu-devel@nongnu.org; Wed, 08 Jul 2015 02:27:14 -0400 Received: from mail-pd0-f182.google.com ([209.85.192.182]:33249) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCip0-0002uv-3k for qemu-devel@nongnu.org; Wed, 08 Jul 2015 02:27:06 -0400 Received: by pdbdz6 with SMTP id dz6so45260423pdb.0 for ; Tue, 07 Jul 2015 23:27:05 -0700 (PDT) References: <1436148670-6592-1-git-send-email-aik@ozlabs.ru> <1436148670-6592-14-git-send-email-aik@ozlabs.ru> <1436190148.3909.55.camel@redhat.com> <559A9FEA.1040609@ozlabs.ru> <1436199187.3909.90.camel@redhat.com> <559BC202.9030003@ozlabs.ru> <1436286263.1391.61.camel@redhat.com> From: Alexey Kardashevskiy Message-ID: <559CC2A7.5080105@ozlabs.ru> Date: Wed, 8 Jul 2015 16:26:47 +1000 MIME-Version: 1.0 In-Reply-To: <1436286263.1391.61.camel@redhat.com> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Michael Roth , qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Gavin Shan , David Gibson On 07/08/2015 02:24 AM, Alex Williamson wrote: > On Tue, 2015-07-07 at 22:11 +1000, Alexey Kardashevskiy wrote: >> On 07/07/2015 02:13 AM, Alex Williamson wrote: >>> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote: >>>> On 07/06/2015 11:42 PM, Alex Williamson wrote: >>>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote: >>>>>> This makes use of the new "memory registering" feature. The idea is >>>>>> to provide the userspace ability to notify the host kernel about pages >>>>>> which are going to be used for DMA. Having this information, the host >>>>>> kernel can pin them all once per user process, do locked pages >>>>>> accounting (once) and not spent time on doing that in real time with >>>>>> possible failures which cannot be handled nicely in some cases. >>>>>> >>>>>> This adds a guest RAM memory listener which notifies a VFIO container >>>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions >>>>>> (i.e. "skip dump" regions) are skipped. >>>>>> >>>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes >>>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does >>>>>> not call it when v2 is detected and enabled. >>>>>> >>>>>> This does not change the guest visible interface. >>>>>> >>>>>> Signed-off-by: Alexey Kardashevskiy >>>>>> Reviewed-by: David Gibson >>>>>> --- >>>>>> Changes: >>>>>> v9: >>>>>> * since there is no more SPAPR-specific data in container::iommu_data, >>>>>> the memory preregistration fields are common and potentially can be used >>>>>> by other architectures >>>>>> >>>>>> v7: >>>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl() >>>>>> * s'ramlistener'register_listener' >>>>>> >>>>>> v6: >>>>>> * fixed commit log (s/guest/userspace/), added note about no guest visible >>>>>> change >>>>>> * fixed error checking if ram registration failed >>>>>> * added alignment check for section->offset_within_region >>>>>> >>>>>> v5: >>>>>> * simplified the patch >>>>>> * added trace points >>>>>> * added round_up() for the size >>>>>> * SPAPR IOMMU v2 used >>>>>> --- >>>>>> hw/vfio/common.c | 109 ++++++++++++++++++++++++++++++++++++++---- >>>>>> include/hw/vfio/vfio-common.h | 3 ++ >>>>>> trace-events | 1 + >>>>>> 3 files changed, 104 insertions(+), 9 deletions(-) >>>>>> >>>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>>> index 8eacfd7..0c7ba8c 100644 >>>>>> --- a/hw/vfio/common.c >>>>>> +++ b/hw/vfio/common.c >>>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container) >>>>>> memory_listener_unregister(&container->iommu_data.type1.listener); >>>>>> } >>>>>> >>>>>> +static void vfio_ram_do_region(VFIOContainer *container, >>>>>> + MemoryRegionSection *section, unsigned long req) >>>>>> +{ >>>>>> + int ret; >>>>>> + struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) }; >>>>> >>>>> This function is not as general as the name would imply, it's spapr >>>>> specific due to this. How about vfio_spapr_register_memory() with a >>>>> bool parameter toggling register vs unregister so we're not passing an >>>>> arbitrary ioctl number? >>>> >>>> Ok. Although I am quite often asked not to do such a thing and rather add 2 >>>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits. >>> >>> I'm not a fan of functions that do the reverse process based on a bool >>> arg either, but I dislike them less than passing an arbitrary ioctl >>> number for a parameter. The former is ugly, but the latter is difficult >>> to use and difficult to maintain because it would be subtle later to >>> spot an unsupported ioctl being passed to the function. >>> >>>>>> + >>>>>> + if (!memory_region_is_ram(section->mr) || >>>>>> + memory_region_is_skip_dump(section->mr)) { >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + if (unlikely((section->offset_within_region & (getpagesize() - 1)))) { >>>>> >>>>> s/getpagesize()/qemu_real_host_page_size/? >>>> >>>> >>>> Oh, right, I guess it reached upstream now. >>>> >>>> >>>>>> + error_report("%s received unaligned region", __func__); >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) + >>>>>> + section->offset_within_region; >>>>>> + reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE); >>>>>> + >>>>>> + ret = ioctl(container->fd, req, ®); >>>>>> + trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size, >>>>>> + ret ? -errno : 0); >>>>>> + if (!ret) { >>>>>> + return; >>>>>> + } >>>>>> + >>>>>> + /* >>>>>> + * On the initfn path, store the first error in the container so we >>>>>> + * can gracefully fail. Runtime, there's not much we can do other >>>>>> + * than throw a hardware error. >>>>>> + */ >>>>>> + if (!container->iommu_data.ram_reg_initialized) { >>>>>> + if (!container->iommu_data.ram_reg_error) { >>>>>> + container->iommu_data.ram_reg_error = -errno; >>>>>> + } >>>>>> + } else { >>>>>> + hw_error("vfio: RAM registering failed, unable to continue"); >>>>>> + } >>>>> >>>>> I'd rather see: >>>>> >>>>> if (ret) { >>>>> if (!container...) { >>>>> ... >>>>> } else { >>>>> ... >>>>> } >>>>> } >>>>> >>>>> Exiting early on success and otherwise falling into error handling is a >>>>> strange code flow. >>>> >>>> Ok... vfio_dma_map() does not follow this rule so I thought it is not that >>>> strict :) >>> >>> It would be nice to clean it up there too. >>> >>>>>> +} >>>>>> + >>>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener, >>>>>> + MemoryRegionSection *section) >>>>>> +{ >>>>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>>>> + iommu_data.register_listener); >>>>>> + memory_region_ref(section->mr); >>>>>> + vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY); >>>>> >>>>> vfio_spapr_register_memory(container, section, true); >>>>> >>>>>> +} >>>>>> + >>>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener, >>>>>> + MemoryRegionSection *section) >>>>>> +{ >>>>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>>>> + iommu_data.register_listener); >>>>>> + vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY); >>>>> >>>>> vfio_spapr_register_memory(container, section, false); >>>>> >>>>>> + memory_region_unref(section->mr); >>>>>> +} >>>>>> + >>>>>> +static const MemoryListener vfio_ram_memory_listener = { >>>>>> + .region_add = vfio_ram_listener_region_add, >>>>>> + .region_del = vfio_ram_listener_region_del, >>>>>> +}; >>>>> >>>>> These are all spapr specific, please reflect that in the name; >>>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del. >>>> >>>> ok. >>>> >>>> >>>>> Actually, can't we determine what type of IOMMU we have and make the >>>>> existing MemoryListener handle either type1 or spapr or spapr-v2? >>>> >>>> >>>> Sorry, I do not follow you here. How? The existing listener listens on PCI >>>> address space (at least, on pseries), new one listens on RAM address space >>>> (address_space_memory). What do I miss? >>> >>> Isn't that simply a difference of the address space the listener is >>> attached to? Type1 maps RAM, spapr-v1 maps guest IOMMU space and these >>> are already both handled by the same listener. >> >> >> Ok, I tried merging 2 listeners and realized that the PCI listener works >> with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be >> using an IOMMU page size which is not easily available there but this is a >> different story) and RAM listener with the qemu_real_host_page_size >> granularity (64K for my case) so depending on the address space type, >> vfio_listener_region_add() will have to use different page sizes. I like >> the idea of merging less now... > > Sounds like you're already solving something that needs to be fixed for > both. The type1 VFIO_IOMMU_GET_INFO ioctl does actually give us a > bitmap of supported iommu page sizes. It's really all but useless for > anything except determining the minimum page size. btw what sizes can really come from there? > For the most part we > just assume that it's the same as the host page size, so those existing > checks could actually change to host page alignment pretty safely. I > think we both actually want pages that are both host and target aligned, > don't we? What would you do on a 64k host if the guest tried to map a > region that only had 4k alignment? I will get_user_pages_fast(va & PAGE_MASK) and then write (gpa_to_hpa(va&PAGE_MASK)|(va & ~PAGE_MASK)) to the table, this is what we do now as our typical host uses 64k pages and default 32bit window always uses 4K (irrelevant to the guest page size). > Anyway, if that's the only problem, > it looks more like an opportunity than a barrier. Oh. Ok :) -- Alexey