From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55888) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZE9Mt-0005fg-P5 for qemu-devel@nongnu.org; Sun, 12 Jul 2015 01:00:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZE9Mq-0007wM-HW for qemu-devel@nongnu.org; Sun, 12 Jul 2015 00:59:59 -0400 Received: from mail-pd0-f174.google.com ([209.85.192.174]:33817) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZE9Mq-0007u8-7y for qemu-devel@nongnu.org; Sun, 12 Jul 2015 00:59:56 -0400 Received: by pdbep18 with SMTP id ep18so206410779pdb.1 for ; Sat, 11 Jul 2015 21:59:53 -0700 (PDT) References: <1436148670-6592-1-git-send-email-aik@ozlabs.ru> <1436148670-6592-11-git-send-email-aik@ozlabs.ru> <20150710213301.22411.30046@loki> From: Alexey Kardashevskiy Message-ID: <55A1F441.9050705@ozlabs.ru> Date: Sun, 12 Jul 2015 14:59:45 +1000 MIME-Version: 1.0 In-Reply-To: <20150710213301.22411.30046@loki> Content-Type: text/plain; charset=koi8-r; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Michael Roth , qemu-devel@nongnu.org Cc: Alex Williamson , qemu-ppc@nongnu.org, Gavin Shan , David Gibson On 07/11/2015 07:33 AM, Michael Roth wrote: > Quoting Alexey Kardashevskiy (2015-07-05 21:11:06) >> sPAPR IOMMU is managing two copies of an TCE table: >> 1) a guest view of the table - this is what emulated devices use and >> this is where H_GET_TCE reads from; >> 2) a hardware TCE table - only present if there is at least one vfio-pci >> device on a PHB; it is updated via a memory listener on a PHB address >> space which forwards map/unmap requests to vfio-pci IOMMU host driver. >> >> At the moment presence of vfio-pci devices on a bus affect the way >> the guest view table is allocated. If there is no vfio-pci on a PHB >> and the host kernel supports KVM acceleration of H_PUT_TCE, a table >> is allocated in KVM. However, if there is vfio-pci and we do yet not >> support KVM acceleration for these, the table has to be allocated >> by the userspace. >> >> When vfio-pci device is hotplugged and there were no vfio-pci devices >> already, the guest view table could have been allocated by KVM which >> means that H_PUT_TCE is handled by the host kernel and since we >> do not support vfio-pci in KVM, the hardware table will not be updated. >> >> This reallocates the guest view table in QEMU if the first vfio-pci >> device has just been plugged. spapr_tce_realloc_userspace() handles this. >> >> This replays all the mappings to make sure that the tables are in sync. >> This will not have a visible effect though as for a new device >> the guest kernel will allocate-and-map new addresses and therefore >> existing mappings from emulated devices will not be used by vfio-pci >> devices. >> >> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug >> hooks. >> >> Signed-off-by: Alexey Kardashevskiy >> --- >> Changes: >> v10: >> * removed unnecessary memory_region_del_subregion() and >> memory_region_add_subregion() as >> "vfio: Unregister IOMMU notifiers when container is destroyed" removes >> notifiers in a more correct way >> >> v9: >> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than >> via object_child_foreach() >> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() + >> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not >> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!) >> if we do hotplug+hotunplug+hotplug of the same device. >> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling >> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise >> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device. >> Actual VFIO PCI device release happens from rcu and since we add ours later, >> it gets executed later and we are good. >> --- >> hw/ppc/spapr_iommu.c | 51 ++++++++++++++++++++++++++++++++++++++++++--- >> hw/ppc/spapr_pci.c | 47 +++++++++++++++++++++++++++++++++++++++++ >> include/hw/pci-host/spapr.h | 1 + >> include/hw/ppc/spapr.h | 2 ++ >> trace-events | 2 ++ >> 5 files changed, 100 insertions(+), 3 deletions(-) >> >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c >> index 45c00d8..2d99c3b 100644 >> --- a/hw/ppc/spapr_iommu.c >> +++ b/hw/ppc/spapr_iommu.c >> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn, >> uint32_t nb_table, >> uint32_t page_shift, >> int *fd, >> - bool vfio_accel) >> + bool vfio_accel, >> + bool force_userspace) >> { >> uint64_t *table = NULL; >> uint64_t window_size = (uint64_t)nb_table << page_shift; >> >> - if (kvm_enabled() && !(window_size >> 32)) { >> + if (kvm_enabled() && !force_userspace && !(window_size >> 32)) { >> table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel); >> } >> >> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel) >> tcet->nb_table, >> tcet->page_shift, >> &tcet->fd, >> - vfio_accel); >> + vfio_accel, >> + false); >> >> memory_region_set_size(&tcet->iommu, >> (uint64_t)tcet->nb_table << tcet->page_shift); >> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname, >> return 0; >> } >> >> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table) >> +{ >> + target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift); >> + long i, ret = 0; >> + >> + for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) { >> + ret = put_tce_emu(tcet, ioba, table[i]); >> + if (ret) { >> + break; >> + } >> + } >> + >> + return ret; >> +} >> + >> +int spapr_tce_replay(sPAPRTCETable *tcet) >> +{ >> + return spapr_tce_do_replay(tcet, tcet->table); >> +} >> + >> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay) >> +{ >> + int ret = 0, oldfd; >> + uint64_t *oldtable; >> + >> + oldtable = tcet->table; >> + oldfd = tcet->fd; >> + tcet->table = spapr_tce_alloc_table(tcet->liobn, >> + tcet->nb_table, >> + tcet->page_shift, >> + &tcet->fd, >> + false, >> + true); /* force_userspace */ >> + >> + if (replay) { >> + ret = spapr_tce_do_replay(tcet, oldtable); >> + } >> + >> + spapr_tce_free_table(oldtable, oldfd, tcet->nb_table); >> + >> + return ret; >> +} >> + >> int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname, >> sPAPRTCETable *tcet) >> { >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >> index 76c988f..d1fa157 100644 >> --- a/hw/ppc/spapr_pci.c >> +++ b/hw/ppc/spapr_pci.c >> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb) >> return 0; >> } >> >> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb) >> +{ >> + int ret = 0, i; >> + bool had_vfio = sphb->has_vfio; >> + sPAPRTCETable *tcet; >> + >> + spapr_phb_dma_capabilities_update(sphb); > > So, in the unplug case, we update caps, but has_vfio = false so we don't do > anything else below. Yes. > Does that mean our KVM-accelerated TCE table won't get restored until reboot? > Would it make sense to re-enable it here? No, it shold be reenabled as DMA config is completely reset during the machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB" >> + >> + if (!had_vfio && sphb->has_vfio) { >> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) { >> + tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i)); >> + if (!tcet || !tcet->enabled) { >> + continue; >> + } >> + if (tcet->fd >= 0) { >> + /* >> + * We got first vfio-pci device on accelerated table. >> + * VFIO acceleration is not possible. >> + * Reallocate table in userspace and replay mappings. >> + */ >> + ret = spapr_tce_realloc_userspace(tcet, true); >> + trace_spapr_pci_dma_realloc_update(tcet->liobn, ret); >> + } else { >> + /* There was no acceleration, so just replay mappings. */ >> + ret = spapr_tce_replay(tcet); >> + trace_spapr_pci_dma_update(tcet->liobn, ret); >> + } >> + if (ret) { >> + break; >> + } >> + } >> + return ret; >> + } >> + >> + return 0; >> +} >> + >> /* Macros to operate with address in OF binding to PCI */ >> #define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p)) >> #define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */ >> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc, >> error_setg(errp, "Failed to create pci child device tree node"); >> goto out; >> } >> + spapr_phb_hotplug_dma_sync(phb); >> } >> >> drck->attach(drc, DEVICE(pdev), >> @@ -1116,6 +1154,12 @@ out: >> } >> } >> >> +static void spapr_phb_remove_sync_dma(struct rcu_head *head) >> +{ >> + sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu); >> + spapr_phb_hotplug_dma_sync(sphb); >> +} >> + >> static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque) >> { >> /* some version guests do not wait for completion of a device >> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque) >> */ >> pci_device_reset(PCI_DEVICE(dev)); >> object_unparent(OBJECT(dev)); >> + >> + /* Actual VFIO device release happens from RCU so postpone DMA update */ >> + call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma); > > Hmm... can't think of any reason this wouldn't work, but would be nice > if there was something a bit more straightforward... > > When the device is actually finalized, it does: The problem is with "when". I looked at gdb, this vfio_instance_finalize() is called from an RCU handler because the last reference is dropped because of some memory region was removed and this was postponed to RCU. If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same stack, I would not need this call_rcu1. > static void vfio_instance_finalize(Object *obj) > { > PCIDevice *pci_dev = PCI_DEVICE(obj); > VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev); > VFIOGroup *group = vdev->vbasedev.group; > > ... > > vfio_put_device(vdev); > vfio_put_group(group); > } > > When all the groups are removed from a VFIO container, there's a > call to container->iommu_data.release(container). This is the > event we really care about, not so much the fact that a device > got released. > > Right now all it does it remove the memory listener, but maybe it > makes sense to allow an additional callback/opaque to register for > the event. Not sure what the best way to do that is though... In this context I rather care about container's fd being closed so VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know that there is no more VFIO devices. > And, kind of a separate topic, but if we could do something > similar for the initial group attach, we could drop *all* the > plug/unplug hooks, and the hooks themselves could drop all > the !had_vfio / has_vfio logic/probing, since that would then > be clear from the context. Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they what HMP calls on "device_add"? -- Alexey