From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49870) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCGsT-0000Ub-IX for qemu-devel@nongnu.org; Mon, 06 Jul 2015 20:36:51 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZCGsP-0003RW-7k for qemu-devel@nongnu.org; Mon, 06 Jul 2015 20:36:49 -0400 Received: from mail-pd0-f177.google.com ([209.85.192.177]:36259) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZCGsO-0003R0-W2 for qemu-devel@nongnu.org; Mon, 06 Jul 2015 20:36:45 -0400 Received: by pddu5 with SMTP id u5so27172875pdd.3 for ; Mon, 06 Jul 2015 17:36:43 -0700 (PDT) References: <1436148670-6592-1-git-send-email-aik@ozlabs.ru> <1436148670-6592-14-git-send-email-aik@ozlabs.ru> <1436190148.3909.55.camel@redhat.com> <559A9FEA.1040609@ozlabs.ru> <1436199187.3909.90.camel@redhat.com> <20150707002907.GH17857@voom.redhat.com> From: Alexey Kardashevskiy Message-ID: <559B1F14.5000809@ozlabs.ru> Date: Tue, 7 Jul 2015 10:36:36 +1000 MIME-Version: 1.0 In-Reply-To: <20150707002907.GH17857@voom.redhat.com> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: David Gibson , Alex Williamson Cc: Michael Roth , qemu-ppc@nongnu.org, qemu-devel@nongnu.org, Gavin Shan On 07/07/2015 10:29 AM, David Gibson wrote: > On Mon, Jul 06, 2015 at 10:13:07AM -0600, Alex Williamson wrote: >> On Tue, 2015-07-07 at 01:34 +1000, Alexey Kardashevskiy wrote: >>> On 07/06/2015 11:42 PM, Alex Williamson wrote: >>>> On Mon, 2015-07-06 at 12:11 +1000, Alexey Kardashevskiy wrote: >>>>> This makes use of the new "memory registering" feature. The idea is >>>>> to provide the userspace ability to notify the host kernel about pages >>>>> which are going to be used for DMA. Having this information, the host >>>>> kernel can pin them all once per user process, do locked pages >>>>> accounting (once) and not spent time on doing that in real time with >>>>> possible failures which cannot be handled nicely in some cases. >>>>> >>>>> This adds a guest RAM memory listener which notifies a VFIO container >>>>> about memory which needs to be pinned/unpinned. VFIO MMIO regions >>>>> (i.e. "skip dump" regions) are skipped. >>>>> >>>>> The feature is only enabled for SPAPR IOMMU v2. The host kernel changes >>>>> are required. Since v2 does not need/support VFIO_IOMMU_ENABLE, this does >>>>> not call it when v2 is detected and enabled. >>>>> >>>>> This does not change the guest visible interface. >>>>> >>>>> Signed-off-by: Alexey Kardashevskiy >>>>> Reviewed-by: David Gibson >>>>> --- >>>>> Changes: >>>>> v9: >>>>> * since there is no more SPAPR-specific data in container::iommu_data, >>>>> the memory preregistration fields are common and potentially can be used >>>>> by other architectures >>>>> >>>>> v7: >>>>> * in vfio_spapr_ram_listener_region_del(), do unref() after ioctl() >>>>> * s'ramlistener'register_listener' >>>>> >>>>> v6: >>>>> * fixed commit log (s/guest/userspace/), added note about no guest visible >>>>> change >>>>> * fixed error checking if ram registration failed >>>>> * added alignment check for section->offset_within_region >>>>> >>>>> v5: >>>>> * simplified the patch >>>>> * added trace points >>>>> * added round_up() for the size >>>>> * SPAPR IOMMU v2 used >>>>> --- >>>>> hw/vfio/common.c | 109 ++++++++++++++++++++++++++++++++++++++---- >>>>> include/hw/vfio/vfio-common.h | 3 ++ >>>>> trace-events | 1 + >>>>> 3 files changed, 104 insertions(+), 9 deletions(-) >>>>> >>>>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >>>>> index 8eacfd7..0c7ba8c 100644 >>>>> --- a/hw/vfio/common.c >>>>> +++ b/hw/vfio/common.c >>>>> @@ -488,6 +488,76 @@ static void vfio_listener_release(VFIOContainer *container) >>>>> memory_listener_unregister(&container->iommu_data.type1.listener); >>>>> } >>>>> >>>>> +static void vfio_ram_do_region(VFIOContainer *container, >>>>> + MemoryRegionSection *section, unsigned long req) >>>>> +{ >>>>> + int ret; >>>>> + struct vfio_iommu_spapr_register_memory reg = { .argsz = sizeof(reg) }; >>>> >>>> This function is not as general as the name would imply, it's spapr >>>> specific due to this. How about vfio_spapr_register_memory() with a >>>> bool parameter toggling register vs unregister so we're not passing an >>>> arbitrary ioctl number? >>> >>> Ok. Although I am quite often asked not to do such a thing and rather add 2 >>> helpers (reg/unreg, do/undo, etc) instead and reuse common bits. >> >> I'm not a fan of functions that do the reverse process based on a bool >> arg either, but I dislike them less than passing an arbitrary ioctl >> number for a parameter. The former is ugly, but the latter is difficult >> to use and difficult to maintain because it would be subtle later to >> spot an unsupported ioctl being passed to the function. >> >>>>> + >>>>> + if (!memory_region_is_ram(section->mr) || >>>>> + memory_region_is_skip_dump(section->mr)) { >>>>> + return; >>>>> + } >>>>> + >>>>> + if (unlikely((section->offset_within_region & (getpagesize() - 1)))) { >>>> >>>> s/getpagesize()/qemu_real_host_page_size/? >>> >>> >>> Oh, right, I guess it reached upstream now. >>> >>> >>>>> + error_report("%s received unaligned region", __func__); >>>>> + return; >>>>> + } >>>>> + >>>>> + reg.vaddr = (__u64) memory_region_get_ram_ptr(section->mr) + >>>>> + section->offset_within_region; >>>>> + reg.size = ROUND_UP(int128_get64(section->size), TARGET_PAGE_SIZE); >>>>> + >>>>> + ret = ioctl(container->fd, req, ®); >>>>> + trace_vfio_ram_register(_IOC_NR(req) - VFIO_BASE, reg.vaddr, reg.size, >>>>> + ret ? -errno : 0); >>>>> + if (!ret) { >>>>> + return; >>>>> + } >>>>> + >>>>> + /* >>>>> + * On the initfn path, store the first error in the container so we >>>>> + * can gracefully fail. Runtime, there's not much we can do other >>>>> + * than throw a hardware error. >>>>> + */ >>>>> + if (!container->iommu_data.ram_reg_initialized) { >>>>> + if (!container->iommu_data.ram_reg_error) { >>>>> + container->iommu_data.ram_reg_error = -errno; >>>>> + } >>>>> + } else { >>>>> + hw_error("vfio: RAM registering failed, unable to continue"); >>>>> + } >>>> >>>> I'd rather see: >>>> >>>> if (ret) { >>>> if (!container...) { >>>> ... >>>> } else { >>>> ... >>>> } >>>> } >>>> >>>> Exiting early on success and otherwise falling into error handling is a >>>> strange code flow. >>> >>> Ok... vfio_dma_map() does not follow this rule so I thought it is not that >>> strict :) >> >> It would be nice to clean it up there too. >> >>>>> +} >>>>> + >>>>> +static void vfio_ram_listener_region_add(MemoryListener *listener, >>>>> + MemoryRegionSection *section) >>>>> +{ >>>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>>> + iommu_data.register_listener); >>>>> + memory_region_ref(section->mr); >>>>> + vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_REGISTER_MEMORY); >>>> >>>> vfio_spapr_register_memory(container, section, true); >>>> >>>>> +} >>>>> + >>>>> +static void vfio_ram_listener_region_del(MemoryListener *listener, >>>>> + MemoryRegionSection *section) >>>>> +{ >>>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>>> + iommu_data.register_listener); >>>>> + vfio_ram_do_region(container, section, VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY); >>>> >>>> vfio_spapr_register_memory(container, section, false); >>>> >>>>> + memory_region_unref(section->mr); >>>>> +} >>>>> + >>>>> +static const MemoryListener vfio_ram_memory_listener = { >>>>> + .region_add = vfio_ram_listener_region_add, >>>>> + .region_del = vfio_ram_listener_region_del, >>>>> +}; >>>> >>>> These are all spapr specific, please reflect that in the name; >>>> vfio_spapr_v2_memory_listener, vfio_spapr_v2_listener_add/del. >>> >>> ok. >>> >>> >>>> Actually, can't we determine what type of IOMMU we have and make the >>>> existing MemoryListener handle either type1 or spapr or spapr-v2? >>> >>> >>> Sorry, I do not follow you here. How? The existing listener listens on PCI >>> address space (at least, on pseries), new one listens on RAM address space >>> (address_space_memory). What do I miss? >> >> Isn't that simply a difference of the address space the listener is >> attached to? Type1 maps RAM, spapr-v1 maps guest IOMMU space and these >> are already both handled by the same listener. > > I think what you're missing is that the spapr code now needs to listen > on *both* the RAM and PCI address spaces. On RAM so it can do the > preregistration, and on PCI so it can do the actual IOMMU mappings. We had a chat with Alex. On x86 this listener listens on RAM already (as a fallback result in pci_device_iommu_address_space), it is SPAPR who does not. So now the plan is to keep using the same listener for both RAM and PCI but without the filter and do filtering in the callbacks. > What might make sense, although it might be better as a later cleanup > is to bake into the common code the idea of two listeners - one for > new RAM regions, one for new PCI mappings, with the actual actions for > each case dependent on the IOMMU type. May be one day. Do we have any other arch with guest visible IOMMU coming? Preferably x86? -- Alexey