From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56716) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZEXpY-0000b1-4u for qemu-devel@nongnu.org; Mon, 13 Jul 2015 03:07:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZEXpT-0000SO-9Q for qemu-devel@nongnu.org; Mon, 13 Jul 2015 03:07:12 -0400 Received: from mail-pd0-f173.google.com ([209.85.192.173]:32934) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZEXpT-0000S3-11 for qemu-devel@nongnu.org; Mon, 13 Jul 2015 03:07:07 -0400 Received: by pdbqm3 with SMTP id qm3so75330211pdb.0 for ; Mon, 13 Jul 2015 00:07:04 -0700 (PDT) References: <1436148670-6592-1-git-send-email-aik@ozlabs.ru> <1436148670-6592-11-git-send-email-aik@ozlabs.ru> <20150710213301.22411.30046@loki> <55A1F441.9050705@ozlabs.ru> <20150712144127.4709.81283@loki> From: Alexey Kardashevskiy Message-ID: <55A3638F.1070504@ozlabs.ru> Date: Mon, 13 Jul 2015 17:06:55 +1000 MIME-Version: 1.0 In-Reply-To: <20150712144127.4709.81283@loki> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Michael Roth , qemu-devel@nongnu.org Cc: Alex Williamson , qemu-ppc@nongnu.org, Gavin Shan , David Gibson On 07/13/2015 12:41 AM, Michael Roth wrote: > Quoting Alexey Kardashevskiy (2015-07-11 23:59:45) >> On 07/11/2015 07:33 AM, Michael Roth wrote: >>> Quoting Alexey Kardashevskiy (2015-07-05 21:11:06) >>>> sPAPR IOMMU is managing two copies of an TCE table: >>>> 1) a guest view of the table - this is what emulated devices use and >>>> this is where H_GET_TCE reads from; >>>> 2) a hardware TCE table - only present if there is at least one vfio-pci >>>> device on a PHB; it is updated via a memory listener on a PHB address >>>> space which forwards map/unmap requests to vfio-pci IOMMU host driver. >>>> >>>> At the moment presence of vfio-pci devices on a bus affect the way >>>> the guest view table is allocated. If there is no vfio-pci on a PHB >>>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table >>>> is allocated in KVM. However, if there is vfio-pci and we do yet not >>>> support KVM acceleration for these, the table has to be allocated >>>> by the userspace. >>>> >>>> When vfio-pci device is hotplugged and there were no vfio-pci devices >>>> already, the guest view table could have been allocated by KVM which >>>> means that H_PUT_TCE is handled by the host kernel and since we >>>> do not support vfio-pci in KVM, the hardware table will not be updated. >>>> >>>> This reallocates the guest view table in QEMU if the first vfio-pci >>>> device has just been plugged. spapr_tce_realloc_userspace() handles this. >>>> >>>> This replays all the mappings to make sure that the tables are in sync. >>>> This will not have a visible effect though as for a new device >>>> the guest kernel will allocate-and-map new addresses and therefore >>>> existing mappings from emulated devices will not be used by vfio-pci >>>> devices. >>>> >>>> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug >>>> hooks. >>>> >>>> Signed-off-by: Alexey Kardashevskiy >>>> --- >>>> Changes: >>>> v10: >>>> * removed unnecessary memory_region_del_subregion() and >>>> memory_region_add_subregion() as >>>> "vfio: Unregister IOMMU notifiers when container is destroyed" removes >>>> notifiers in a more correct way >>>> >>>> v9: >>>> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than >>>> via object_child_foreach() >>>> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() + >>>> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not >>>> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!) >>>> if we do hotplug+hotunplug+hotplug of the same device. >>>> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling >>>> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise >>>> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device. >>>> Actual VFIO PCI device release happens from rcu and since we add ours later, >>>> it gets executed later and we are good. >>>> --- >>>> hw/ppc/spapr_iommu.c | 51 ++++++++++++++++++++++++++++++++++++++++++--- >>>> hw/ppc/spapr_pci.c | 47 +++++++++++++++++++++++++++++++++++++++++ >>>> include/hw/pci-host/spapr.h | 1 + >>>> include/hw/ppc/spapr.h | 2 ++ >>>> trace-events | 2 ++ >>>> 5 files changed, 100 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c >>>> index 45c00d8..2d99c3b 100644 >>>> --- a/hw/ppc/spapr_iommu.c >>>> +++ b/hw/ppc/spapr_iommu.c >>>> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn, >>>> uint32_t nb_table, >>>> uint32_t page_shift, >>>> int *fd, >>>> - bool vfio_accel) >>>> + bool vfio_accel, >>>> + bool force_userspace) >>>> { >>>> uint64_t *table = NULL; >>>> uint64_t window_size = (uint64_t)nb_table << page_shift; >>>> >>>> - if (kvm_enabled() && !(window_size >> 32)) { >>>> + if (kvm_enabled() && !force_userspace && !(window_size >> 32)) { >>>> table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel); >>>> } >>>> >>>> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel) >>>> tcet->nb_table, >>>> tcet->page_shift, >>>> &tcet->fd, >>>> - vfio_accel); >>>> + vfio_accel, >>>> + false); >>>> >>>> memory_region_set_size(&tcet->iommu, >>>> (uint64_t)tcet->nb_table << tcet->page_shift); >>>> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname, >>>> return 0; >>>> } >>>> >>>> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table) >>>> +{ >>>> + target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift); >>>> + long i, ret = 0; >>>> + >>>> + for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) { >>>> + ret = put_tce_emu(tcet, ioba, table[i]); >>>> + if (ret) { >>>> + break; >>>> + } >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +int spapr_tce_replay(sPAPRTCETable *tcet) >>>> +{ >>>> + return spapr_tce_do_replay(tcet, tcet->table); >>>> +} >>>> + >>>> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay) >>>> +{ >>>> + int ret = 0, oldfd; >>>> + uint64_t *oldtable; >>>> + >>>> + oldtable = tcet->table; >>>> + oldfd = tcet->fd; >>>> + tcet->table = spapr_tce_alloc_table(tcet->liobn, >>>> + tcet->nb_table, >>>> + tcet->page_shift, >>>> + &tcet->fd, >>>> + false, >>>> + true); /* force_userspace */ >>>> + >>>> + if (replay) { >>>> + ret = spapr_tce_do_replay(tcet, oldtable); >>>> + } >>>> + >>>> + spapr_tce_free_table(oldtable, oldfd, tcet->nb_table); >>>> + >>>> + return ret; >>>> +} >>>> + >>>> int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname, >>>> sPAPRTCETable *tcet) >>>> { >>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >>>> index 76c988f..d1fa157 100644 >>>> --- a/hw/ppc/spapr_pci.c >>>> +++ b/hw/ppc/spapr_pci.c >>>> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb) >>>> return 0; >>>> } >>>> >>>> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb) >>>> +{ >>>> + int ret = 0, i; >>>> + bool had_vfio = sphb->has_vfio; >>>> + sPAPRTCETable *tcet; >>>> + >>>> + spapr_phb_dma_capabilities_update(sphb); >>> >>> So, in the unplug case, we update caps, but has_vfio = false so we don't do >>> anything else below. >> >> Yes. >> >> >>> Does that mean our KVM-accelerated TCE table won't get restored until reboot? >>> Would it make sense to re-enable it here? >> >> No, it shold be reenabled as DMA config is completely reset during the >> machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of >> DMA config when resetting PHB" > > We don't get a PHB-level reset for PCI hotplug though, so it wouldn't > get re-enabled till guest system reset. Yes, this is what I said :) > I'm not sure how big a deal that > is performance-wise, but it seems a little unexpected. True... >>>> + >>>> + if (!had_vfio && sphb->has_vfio) { >>>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) { >>>> + tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i)); >>>> + if (!tcet || !tcet->enabled) { >>>> + continue; >>>> + } >>>> + if (tcet->fd >= 0) { >>>> + /* >>>> + * We got first vfio-pci device on accelerated table. >>>> + * VFIO acceleration is not possible. >>>> + * Reallocate table in userspace and replay mappings. >>>> + */ >>>> + ret = spapr_tce_realloc_userspace(tcet, true); >>>> + trace_spapr_pci_dma_realloc_update(tcet->liobn, ret); >>>> + } else { >>>> + /* There was no acceleration, so just replay mappings. */ >>>> + ret = spapr_tce_replay(tcet); >>>> + trace_spapr_pci_dma_update(tcet->liobn, ret); >>>> + } >>>> + if (ret) { >>>> + break; >>>> + } >>>> + } >>>> + return ret; >>>> + } >>>> + >>>> + return 0; >>>> +} >>>> + >>>> /* Macros to operate with address in OF binding to PCI */ >>>> #define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p)) >>>> #define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */ >>>> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc, >>>> error_setg(errp, "Failed to create pci child device tree node"); >>>> goto out; >>>> } >>>> + spapr_phb_hotplug_dma_sync(phb); >>>> } >>>> >>>> drck->attach(drc, DEVICE(pdev), >>>> @@ -1116,6 +1154,12 @@ out: >>>> } >>>> } >>>> >>>> +static void spapr_phb_remove_sync_dma(struct rcu_head *head) >>>> +{ >>>> + sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu); >>>> + spapr_phb_hotplug_dma_sync(sphb); >>>> +} >>>> + >>>> static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque) >>>> { >>>> /* some version guests do not wait for completion of a device >>>> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque) >>>> */ >>>> pci_device_reset(PCI_DEVICE(dev)); >>>> object_unparent(OBJECT(dev)); >>>> + >>>> + /* Actual VFIO device release happens from RCU so postpone DMA update */ >>>> + call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma); >>> >>> Hmm... can't think of any reason this wouldn't work, but would be nice >>> if there was something a bit more straightforward... >>> >>> When the device is actually finalized, it does: >> >> The problem is with "when". I looked at gdb, this vfio_instance_finalize() >> is called from an RCU handler because the last reference is dropped because >> of some memory region was removed and this was postponed to RCU. >> >> If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same >> stack, I would not need this call_rcu1. > > Right, object_unparent() has no guaruntee of immediately finalizing it, > but you *do* have the guaruntee that > vfio_instance_finalize()->vfio_put_group() will only be called once the > device is actually finalized, regardless of whether or not it's kicked > off by the RCU thread. So it seems more straightforward to hook into > that rather than needing to employ internal knowledge of > object_unparent(). Right... I'll try adding a hook and see what review I'll receive. >> >>> static void vfio_instance_finalize(Object *obj) >>> { >>> PCIDevice *pci_dev = PCI_DEVICE(obj); >>> VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev); >>> VFIOGroup *group = vdev->vbasedev.group; >>> >>> ... >>> >>> vfio_put_device(vdev); >>> vfio_put_group(group); >>> } >>> >>> When all the groups are removed from a VFIO container, there's a >>> call to container->iommu_data.release(container). This is the >>> event we really care about, not so much the fact that a device >>> got released. >>> >>> Right now all it does it remove the memory listener, but maybe it >>> makes sense to allow an additional callback/opaque to register for >>> the event. Not sure what the best way to do that is though... >> >> In this context I rather care about container's fd being closed so >> VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know >> that there is no more VFIO devices. > > VFIO container getting closed also corresponds to the last group being > removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO > would fail unless at least one iommu group was attached to the > container? So knowing when the first/last group is removed seems to > be the real main event. Right. This is still VFIO knowledge which PHB does not have access to. We will need these new hooks. >>> And, kind of a separate topic, but if we could do something >>> similar for the initial group attach, we could drop *all* the >>> plug/unplug hooks, and the hooks themselves could drop all >>> the !had_vfio / has_vfio logic/probing, since that would then >>> be clear from the context. >> >> Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they >> what HMP calls on "device_add"? > > I mean all the places we call into code that ends up doing: > spapr_phb_dma_capabilities_update(sphb); > /* do something special if has_vfio changed */ > > We currently have one in PCI plug, PCI unplug, and PHB reset. If we > hooked into vfio_put_group(), we could drop each of those hooks. PHB > reset would still have the special case of restoring default 32-bit > window config, but it wouldn't need to care about has_vfio status > anymore, all that code could be handled by > vfio_put_group/vfio_get_group callbacks. > > I wouldn't hold up the series for it, but I think it would greatly > simplify tracking has_vfio changes. It is now 2 series and I can try what you suggest. > But I do think spapr_phb_hotplug_dma_sync() should have some logic > squashed in for re-enabling TCE acceleration on > (had_vfio && !has_vfio). -- Alexey