From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57097) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCRjD-0002sc-VP for qemu-devel@nongnu.org; Tue, 07 Jul 2015 08:12:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZCRj8-0000H8-LB for qemu-devel@nongnu.org; Tue, 07 Jul 2015 08:11:59 -0400 Received: from mail-pd0-f180.google.com ([209.85.192.180]:34220) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCRj8-0000Gi-E0 for qemu-devel@nongnu.org; Tue, 07 Jul 2015 08:11:54 -0400 Received: by pdbep18 with SMTP id ep18so124963433pdb.1 for ; Tue, 07 Jul 2015 05:11:54 -0700 (PDT) References: <1436148670-6592-1-git-send-email-aik@ozlabs.ru> <1436148670-6592-14-git-send-email-aik@ozlabs.ru> <1436190148.3909.55.camel@redhat.com> <559A9FEA.1040609@ozlabs.ru> <1436199187.3909.90.camel@redhat.com> From: Alexey Kardashevskiy Message-ID: <559BC202.9030003@ozlabs.ru> Date: Tue, 7 Jul 2015 22:11:46 +1000 MIME-Version: 1.0 In-Reply-To: <1436199187.3909.90.camel@redhat.com> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alex Williamson Cc: Michael Roth , qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Gavin Shan , David Gibson On 07/07/2015 02:13 AM, Alex Williamson wrote: > On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote: >> On 07/06/2015 11:42 PM, Alex Williamson wrote: >>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote: >>>> This makes use of the new "memory registering" feature. The idea is >>>> to provide the userspace ability to notify the host kernel about pages >>>> which are going to be used for DMA. Having this information, the host >>>> kernel can pin them all once per user process, do locked pages >>>> accounting (once) and not spent time on doing that in real time with >>>> possible failures which cannot be handled nicely in some cases. >>>> >>>> This adds a guest RAM memory listener which notifies a VFIO container >>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions >>>> (i.e. "skip dump" regions) are skipped. >>>> >>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes >>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does >>>> not call it when v2 is detected and enabled. >>>> >>>> This does not change the guest visible interface. >>>> >>>> Signed-off-by: Alexey Kardashevskiy >>>> Reviewed-by: David Gibson >>>> --- >>>> Changes: >>>> v9: >>>> * since there is no more SPAPR-specific data in container::iommu_data, >>>> the memory preregistration fields are common and potentially can be used >>>> by other architectures >>>> >>>> v7: >>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl() >>>> * s'ramlistener'register_listener' >>>> >>>> v6: >>>> * fixed commit log (s/guest/userspace/), added note about no guest visible >>>> change >>>> * fixed error checking if ram registration failed >>>> * added alignment check for section->offset_within_region >>>> >>>> v5: >>>> * simplified the patch >>>> * added trace points >>>> * added round_up() for the size >>>> * SPAPR IOMMU v2 used >>>> --- >>>> hw/vfio/common.c | 109 ++++++++++++++++++++++++++++++++++++++---- >>>> include/hw/vfio/vfio-common.h | 3 ++ >>>> trace-events | 1 + >>>> 3 files changed, 104 insertions(+), 9 deletions(-) >>>> >>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>> index 8eacfd7..0c7ba8c 100644 >>>> --- a/hw/vfio/common.c >>>> +++ b/hw/vfio/common.c >>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container) >>>> memory_listener_unregister(&container->iommu_data.type1.listener); >>>> } >>>> >>>> +static void vfio_ram_do_region(VFIOContainer *container, >>>> + MemoryRegionSection *section, unsigned long req) >>>> +{ >>>> + int ret; >>>> + struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) }; >>> >>> This function is not as general as the name would imply, it's spapr >>> specific due to this. How about vfio_spapr_register_memory() with a >>> bool parameter toggling register vs unregister so we're not passing an >>> arbitrary ioctl number? >> >> Ok. Although I am quite often asked not to do such a thing and rather add 2 >> helpers (reg/unreg, do/undo, etc) instead and reuse common bits. > > I'm not a fan of functions that do the reverse process based on a bool > arg either, but I dislike them less than passing an arbitrary ioctl > number for a parameter. The former is ugly, but the latter is difficult > to use and difficult to maintain because it would be subtle later to > spot an unsupported ioctl being passed to the function. > >>>> + >>>> + if (!memory_region_is_ram(section->mr) || >>>> + memory_region_is_skip_dump(section->mr)) { >>>> + return; >>>> + } >>>> + >>>> + if (unlikely((section->offset_within_region & (getpagesize() - 1)))) { >>> >>> s/getpagesize()/qemu_real_host_page_size/? >> >> >> Oh, right, I guess it reached upstream now. >> >> >>>> + error_report("%s received unaligned region", __func__); >>>> + return; >>>> + } >>>> + >>>> + reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) + >>>> + section->offset_within_region; >>>> + reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE); >>>> + >>>> + ret = ioctl(container->fd, req, ®); >>>> + trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size, >>>> + ret ? -errno : 0); >>>> + if (!ret) { >>>> + return; >>>> + } >>>> + >>>> + /* >>>> + * On the initfn path, store the first error in the container so we >>>> + * can gracefully fail. Runtime, there's not much we can do other >>>> + * than throw a hardware error. >>>> + */ >>>> + if (!container->iommu_data.ram_reg_initialized) { >>>> + if (!container->iommu_data.ram_reg_error) { >>>> + container->iommu_data.ram_reg_error = -errno; >>>> + } >>>> + } else { >>>> + hw_error("vfio: RAM registering failed, unable to continue"); >>>> + } >>> >>> I'd rather see: >>> >>> if (ret) { >>> if (!container...) { >>> ... >>> } else { >>> ... >>> } >>> } >>> >>> Exiting early on success and otherwise falling into error handling is a >>> strange code flow. >> >> Ok... vfio_dma_map() does not follow this rule so I thought it is not that >> strict :) > > It would be nice to clean it up there too. > >>>> +} >>>> + >>>> +static void vfio_ram_listener_region_add(MemoryListener *listener, >>>> + MemoryRegionSection *section) >>>> +{ >>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>> + iommu_data.register_listener); >>>> + memory_region_ref(section->mr); >>>> + vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY); >>> >>> vfio_spapr_register_memory(container, section, true); >>> >>>> +} >>>> + >>>> +static void vfio_ram_listener_region_del(MemoryListener *listener, >>>> + MemoryRegionSection *section) >>>> +{ >>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>> + iommu_data.register_listener); >>>> + vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY); >>> >>> vfio_spapr_register_memory(container, section, false); >>> >>>> + memory_region_unref(section->mr); >>>> +} >>>> + >>>> +static const MemoryListener vfio_ram_memory_listener = { >>>> + .region_add = vfio_ram_listener_region_add, >>>> + .region_del = vfio_ram_listener_region_del, >>>> +}; >>> >>> These are all spapr specific, please reflect that in the name; >>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del. >> >> ok. >> >> >>> Actually, can't we determine what type of IOMMU we have and make the >>> existing MemoryListener handle either type1 or spapr or spapr-v2? >> >> >> Sorry, I do not follow you here. How? The existing listener listens on PCI >> address space (at least, on pseries), new one listens on RAM address space >> (address_space_memory). What do I miss? > > Isn't that simply a difference of the address space the listener is > attached to? Type1 maps RAM, spapr-v1 maps guest IOMMU space and these > are already both handled by the same listener. Ok, I tried merging 2 listeners and realized that the PCI listener works with TARGET_PAGE_SIZE granularity (which is 4K and actually it should be using an IOMMU page size which is not easily available there but this is a different story) and RAM listener with the qemu_real_host_page_size granularity (64K for my case) so depending on the address space type, vfio_listener_region_add() will have to use different page sizes. I like the idea of merging less now... -- Alexey