From: Michael Roth <mdroth@linux.vnet.ibm.com>
To: Alexey Kardashevskiy <aik@ozlabs.ru>, qemu-devel@nongnu.org
Cc: Alex Williamson <alex.williamson@redhat.com>,
qemu-ppc@nongnu.org, Gavin Shan <gwshan@linux.vnet.ibm.com>,
David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
Date: Sun, 12 Jul 2015 09:41:27 -0500 [thread overview]
Message-ID: <20150712144127.4709.81283@loki> (raw)
In-Reply-To: <55A1F441.9050705@ozlabs.ru>
Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
> On 07/11/2015 07:33 AM, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
> >> sPAPR IOMMU is managing two copies of an TCE table:
> >> 1) a guest view of the table - this is what emulated devices use and
> >> this is where H_GET_TCE reads from;
> >> 2) a hardware TCE table - only present if there is at least one vfio-pci
> >> device on a PHB; it is updated via a memory listener on a PHB address
> >> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
> >>
> >> At the moment presence of vfio-pci devices on a bus affect the way
> >> the guest view table is allocated. If there is no vfio-pci on a PHB
> >> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
> >> is allocated in KVM. However, if there is vfio-pci and we do yet not
> >> support KVM acceleration for these, the table has to be allocated
> >> by the userspace.
> >>
> >> When vfio-pci device is hotplugged and there were no vfio-pci devices
> >> already, the guest view table could have been allocated by KVM which
> >> means that H_PUT_TCE is handled by the host kernel and since we
> >> do not support vfio-pci in KVM, the hardware table will not be updated.
> >>
> >> This reallocates the guest view table in QEMU if the first vfio-pci
> >> device has just been plugged. spapr_tce_realloc_userspace() handles this.
> >>
> >> This replays all the mappings to make sure that the tables are in sync.
> >> This will not have a visible effect though as for a new device
> >> the guest kernel will allocate-and-map new addresses and therefore
> >> existing mappings from emulated devices will not be used by vfio-pci
> >> devices.
> >>
> >> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
> >> hooks.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v10:
> >> * removed unnecessary memory_region_del_subregion() and
> >> memory_region_add_subregion() as
> >> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
> >> notifiers in a more correct way
> >>
> >> v9:
> >> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
> >> via object_child_foreach()
> >> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
> >> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
> >> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
> >> if we do hotplug+hotunplug+hotplug of the same device.
> >> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
> >> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
> >> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
> >> Actual VFIO PCI device release happens from rcu and since we add ours later,
> >> it gets executed later and we are good.
> >> ---
> >> hw/ppc/spapr_iommu.c | 51 ++++++++++++++++++++++++++++++++++++++++++---
> >> hw/ppc/spapr_pci.c | 47 +++++++++++++++++++++++++++++++++++++++++
> >> include/hw/pci-host/spapr.h | 1 +
> >> include/hw/ppc/spapr.h | 2 ++
> >> trace-events | 2 ++
> >> 5 files changed, 100 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> >> index 45c00d8..2d99c3b 100644
> >> --- a/hw/ppc/spapr_iommu.c
> >> +++ b/hw/ppc/spapr_iommu.c
> >> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
> >> uint32_t nb_table,
> >> uint32_t page_shift,
> >> int *fd,
> >> - bool vfio_accel)
> >> + bool vfio_accel,
> >> + bool force_userspace)
> >> {
> >> uint64_t *table = NULL;
> >> uint64_t window_size = (uint64_t)nb_table << page_shift;
> >>
> >> - if (kvm_enabled() && !(window_size >> 32)) {
> >> + if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
> >> table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
> >> }
> >>
> >> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
> >> tcet->nb_table,
> >> tcet->page_shift,
> >> &tcet->fd,
> >> - vfio_accel);
> >> + vfio_accel,
> >> + false);
> >>
> >> memory_region_set_size(&tcet->iommu,
> >> (uint64_t)tcet->nb_table << tcet->page_shift);
> >> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> >> return 0;
> >> }
> >>
> >> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
> >> +{
> >> + target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
> >> + long i, ret = 0;
> >> +
> >> + for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
> >> + ret = put_tce_emu(tcet, ioba, table[i]);
> >> + if (ret) {
> >> + break;
> >> + }
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +int spapr_tce_replay(sPAPRTCETable *tcet)
> >> +{
> >> + return spapr_tce_do_replay(tcet, tcet->table);
> >> +}
> >> +
> >> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
> >> +{
> >> + int ret = 0, oldfd;
> >> + uint64_t *oldtable;
> >> +
> >> + oldtable = tcet->table;
> >> + oldfd = tcet->fd;
> >> + tcet->table = spapr_tce_alloc_table(tcet->liobn,
> >> + tcet->nb_table,
> >> + tcet->page_shift,
> >> + &tcet->fd,
> >> + false,
> >> + true); /* force_userspace */
> >> +
> >> + if (replay) {
> >> + ret = spapr_tce_do_replay(tcet, oldtable);
> >> + }
> >> +
> >> + spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
> >> sPAPRTCETable *tcet)
> >> {
> >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> >> index 76c988f..d1fa157 100644
> >> --- a/hw/ppc/spapr_pci.c
> >> +++ b/hw/ppc/spapr_pci.c
> >> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
> >> return 0;
> >> }
> >>
> >> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
> >> +{
> >> + int ret = 0, i;
> >> + bool had_vfio = sphb->has_vfio;
> >> + sPAPRTCETable *tcet;
> >> +
> >> + spapr_phb_dma_capabilities_update(sphb);
> >
> > So, in the unplug case, we update caps, but has_vfio = false so we don't do
> > anything else below.
>
> Yes.
>
>
> > Does that mean our KVM-accelerated TCE table won't get restored until reboot?
> > Would it make sense to re-enable it here?
>
> No, it shold be reenabled as DMA config is completely reset during the
> machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of
> DMA config when resetting PHB"
We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
get re-enabled till guest system reset. I'm not sure how big a deal that
is performance-wise, but it seems a little unexpected.
>
>
>
> >> +
> >> + if (!had_vfio && sphb->has_vfio) {
> >> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
> >> + tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
> >> + if (!tcet || !tcet->enabled) {
> >> + continue;
> >> + }
> >> + if (tcet->fd >= 0) {
> >> + /*
> >> + * We got first vfio-pci device on accelerated table.
> >> + * VFIO acceleration is not possible.
> >> + * Reallocate table in userspace and replay mappings.
> >> + */
> >> + ret = spapr_tce_realloc_userspace(tcet, true);
> >> + trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
> >> + } else {
> >> + /* There was no acceleration, so just replay mappings. */
> >> + ret = spapr_tce_replay(tcet);
> >> + trace_spapr_pci_dma_update(tcet->liobn, ret);
> >> + }
> >> + if (ret) {
> >> + break;
> >> + }
> >> + }
> >> + return ret;
> >> + }
> >> +
> >> + return 0;
> >> +}
> >> +
> >> /* Macros to operate with address in OF binding to PCI */
> >> #define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p))
> >> #define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */
> >> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
> >> error_setg(errp, "Failed to create pci child device tree node");
> >> goto out;
> >> }
> >> + spapr_phb_hotplug_dma_sync(phb);
> >> }
> >>
> >> drck->attach(drc, DEVICE(pdev),
> >> @@ -1116,6 +1154,12 @@ out:
> >> }
> >> }
> >>
> >> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
> >> +{
> >> + sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
> >> + spapr_phb_hotplug_dma_sync(sphb);
> >> +}
> >> +
> >> static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> >> {
> >> /* some version guests do not wait for completion of a device
> >> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
> >> */
> >> pci_device_reset(PCI_DEVICE(dev));
> >> object_unparent(OBJECT(dev));
> >> +
> >> + /* Actual VFIO device release happens from RCU so postpone DMA update */
> >> + call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
> >
> > Hmm... can't think of any reason this wouldn't work, but would be nice
> > if there was something a bit more straightforward...
> >
> > When the device is actually finalized, it does:
>
> The problem is with "when". I looked at gdb, this vfio_instance_finalize()
> is called from an RCU handler because the last reference is dropped because
> of some memory region was removed and this was postponed to RCU.
>
> If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same
> stack, I would not need this call_rcu1.
Right, object_unparent() has no guaruntee of immediately finalizing it,
but you *do* have the guaruntee that
vfio_instance_finalize()->vfio_put_group() will only be called once the
device is actually finalized, regardless of whether or not it's kicked
off by the RCU thread. So it seems more straightforward to hook into
that rather than needing to employ internal knowledge of
object_unparent().
>
> > static void vfio_instance_finalize(Object *obj)
> > {
> > PCIDevice *pci_dev = PCI_DEVICE(obj);
> > VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
> > VFIOGroup *group = vdev->vbasedev.group;
> >
> > ...
> >
> > vfio_put_device(vdev);
> > vfio_put_group(group);
> > }
> >
> > When all the groups are removed from a VFIO container, there's a
> > call to container->iommu_data.release(container). This is the
> > event we really care about, not so much the fact that a device
> > got released.
> >
> > Right now all it does it remove the memory listener, but maybe it
> > makes sense to allow an additional callback/opaque to register for
> > the event. Not sure what the best way to do that is though...
>
> In this context I rather care about container's fd being closed so
> VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know
> that there is no more VFIO devices.
VFIO container getting closed also corresponds to the last group being
removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
would fail unless at least one iommu group was attached to the
container? So knowing when the first/last group is removed seems to
be the real main event.
>
>
> > And, kind of a separate topic, but if we could do something
> > similar for the initial group attach, we could drop *all* the
> > plug/unplug hooks, and the hooks themselves could drop all
> > the !had_vfio / has_vfio logic/probing, since that would then
> > be clear from the context.
>
> Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they
> what HMP calls on "device_add"?
I mean all the places we call into code that ends up doing:
spapr_phb_dma_capabilities_update(sphb);
/* do something special if has_vfio changed */
We currently have one in PCI plug, PCI unplug, and PHB reset. If we
hooked into vfio_put_group(), we could drop each of those hooks. PHB
reset would still have the special case of restoring default 32-bit
window config, but it wouldn't need to care about has_vfio status
anymore, all that code could be handled by
vfio_put_group/vfio_get_group callbacks.
I wouldn't hold up the series for it, but I think it would greatly
simplify tracking has_vfio changes.
But I do think spapr_phb_hotplug_dma_sync() should have some logic
squashed in for re-enabling TCE acceleration on
(had_vfio && !has_vfio).
>
>
> --
> Alexey
>
next prev parent reply other threads:[~2015-07-12 14:41 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-06 2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
2015-07-06 11:18 ` Paolo Bonzini
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2015-07-06 14:21 ` Thomas Huth
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
2015-07-06 16:41 ` Laurent Vivier
2015-07-07 0:28 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2015-07-06 15:14 ` Thomas Huth
2015-07-06 15:43 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2015-07-06 10:07 ` David Gibson
2015-07-06 17:04 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
2015-07-06 16:45 ` Laurent Vivier
2015-07-06 17:11 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
2015-07-06 19:15 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
2015-07-06 21:13 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
2015-07-06 10:27 ` David Gibson
2015-07-06 21:31 ` Thomas Huth
2015-07-07 9:28 ` Alexey Kardashevskiy
2015-07-10 21:33 ` Michael Roth
2015-07-12 4:59 ` Alexey Kardashevskiy
2015-07-12 14:41 ` Michael Roth [this message]
2015-07-13 1:10 ` David Gibson
2015-07-13 7:06 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
2015-07-07 7:02 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
2015-07-06 10:33 ` David Gibson
2015-07-06 12:49 ` Alex Williamson
2015-07-06 12:59 ` Alexey Kardashevskiy
2015-07-06 13:45 ` Alex Williamson
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-06 13:42 ` Alex Williamson
2015-07-06 15:34 ` Alexey Kardashevskiy
2015-07-06 16:13 ` Alex Williamson
2015-07-07 0:29 ` David Gibson
2015-07-07 0:36 ` Alexey Kardashevskiy
2015-07-07 12:11 ` Alexey Kardashevskiy
2015-07-07 16:24 ` Alex Williamson
2015-07-08 6:26 ` Alexey Kardashevskiy
2015-07-08 14:51 ` Alex Williamson
2015-07-07 7:23 ` Thomas Huth
2015-07-07 10:05 ` Alexey Kardashevskiy
2015-07-07 10:21 ` Thomas Huth
2015-07-07 11:05 ` Alexey Kardashevskiy
2015-07-08 4:30 ` David Gibson
2015-07-08 6:24 ` Thomas Huth
2015-07-08 6:50 ` David Gibson
2015-07-08 7:07 ` Alexey Kardashevskiy
2015-07-08 14:47 ` Alex Williamson
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2015-07-06 11:06 ` David Gibson
2015-07-06 11:27 ` Alexey Kardashevskiy
2015-07-07 9:46 ` Alexey Kardashevskiy
2015-07-07 4:58 ` David Gibson
2015-07-07 9:33 ` Thomas Huth
2015-07-07 10:43 ` Alexey Kardashevskiy
2015-07-07 11:35 ` Thomas Huth
2015-07-07 11:53 ` Alexey Kardashevskiy
2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
2015-07-06 15:54 ` Thomas Huth
2015-07-06 16:07 ` Alexey Kardashevskiy
2015-07-06 16:13 ` Thomas Huth
2015-07-08 4:34 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150712144127.4709.81283@loki \
--to=mdroth@linux.vnet.ibm.com \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=gwshan@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).