From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Michael Roth <mdroth@linux.vnet.ibm.com>, qemu-devel@nongnu.org
Cc: Alex Williamson <alex.williamson@redhat.com>,
qemu-ppc@nongnu.org, Gavin Shan <gwshan@linux.vnet.ibm.com>,
David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
Date: Mon, 13 Jul 2015 17:06:55 +1000 [thread overview]
Message-ID: <55A3638F.1070504@ozlabs.ru> (raw)
In-Reply-To: <20150712144127.4709.81283@loki>
On 07/13/2015 12:41 AM, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2015-07-11 23:59:45)
>> On 07/11/2015 07:33 AM, Michael Roth wrote:
>>> Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
>>>> sPAPR IOMMU is managing two copies of an TCE table:
>>>> 1) a guest view of the table - this is what emulated devices use and
>>>> this is where H_GET_TCE reads from;
>>>> 2) a hardware TCE table - only present if there is at least one vfio-pci
>>>> device on a PHB; it is updated via a memory listener on a PHB address
>>>> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
>>>>
>>>> At the moment presence of vfio-pci devices on a bus affect the way
>>>> the guest view table is allocated. If there is no vfio-pci on a PHB
>>>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
>>>> is allocated in KVM. However, if there is vfio-pci and we do yet not
>>>> support KVM acceleration for these, the table has to be allocated
>>>> by the userspace.
>>>>
>>>> When vfio-pci device is hotplugged and there were no vfio-pci devices
>>>> already, the guest view table could have been allocated by KVM which
>>>> means that H_PUT_TCE is handled by the host kernel and since we
>>>> do not support vfio-pci in KVM, the hardware table will not be updated.
>>>>
>>>> This reallocates the guest view table in QEMU if the first vfio-pci
>>>> device has just been plugged. spapr_tce_realloc_userspace() handles this.
>>>>
>>>> This replays all the mappings to make sure that the tables are in sync.
>>>> This will not have a visible effect though as for a new device
>>>> the guest kernel will allocate-and-map new addresses and therefore
>>>> existing mappings from emulated devices will not be used by vfio-pci
>>>> devices.
>>>>
>>>> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
>>>> hooks.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v10:
>>>> * removed unnecessary memory_region_del_subregion() and
>>>> memory_region_add_subregion() as
>>>> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
>>>> notifiers in a more correct way
>>>>
>>>> v9:
>>>> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
>>>> via object_child_foreach()
>>>> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
>>>> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
>>>> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
>>>> if we do hotplug+hotunplug+hotplug of the same device.
>>>> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
>>>> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
>>>> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
>>>> Actual VFIO PCI device release happens from rcu and since we add ours later,
>>>> it gets executed later and we are good.
>>>> ---
>>>> hw/ppc/spapr_iommu.c | 51 ++++++++++++++++++++++++++++++++++++++++++---
>>>> hw/ppc/spapr_pci.c | 47 +++++++++++++++++++++++++++++++++++++++++
>>>> include/hw/pci-host/spapr.h | 1 +
>>>> include/hw/ppc/spapr.h | 2 ++
>>>> trace-events | 2 ++
>>>> 5 files changed, 100 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>>>> index 45c00d8..2d99c3b 100644
>>>> --- a/hw/ppc/spapr_iommu.c
>>>> +++ b/hw/ppc/spapr_iommu.c
>>>> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>>>> uint32_t nb_table,
>>>> uint32_t page_shift,
>>>> int *fd,
>>>> - bool vfio_accel)
>>>> + bool vfio_accel,
>>>> + bool force_userspace)
>>>> {
>>>> uint64_t *table = NULL;
>>>> uint64_t window_size = (uint64_t)nb_table << page_shift;
>>>>
>>>> - if (kvm_enabled() && !(window_size >> 32)) {
>>>> + if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
>>>> table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>>>> }
>>>>
>>>> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>>>> tcet->nb_table,
>>>> tcet->page_shift,
>>>> &tcet->fd,
>>>> - vfio_accel);
>>>> + vfio_accel,
>>>> + false);
>>>>
>>>> memory_region_set_size(&tcet->iommu,
>>>> (uint64_t)tcet->nb_table << tcet->page_shift);
>>>> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>>> return 0;
>>>> }
>>>>
>>>> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
>>>> +{
>>>> + target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
>>>> + long i, ret = 0;
>>>> +
>>>> + for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
>>>> + ret = put_tce_emu(tcet, ioba, table[i]);
>>>> + if (ret) {
>>>> + break;
>>>> + }
>>>> + }
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> +int spapr_tce_replay(sPAPRTCETable *tcet)
>>>> +{
>>>> + return spapr_tce_do_replay(tcet, tcet->table);
>>>> +}
>>>> +
>>>> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
>>>> +{
>>>> + int ret = 0, oldfd;
>>>> + uint64_t *oldtable;
>>>> +
>>>> + oldtable = tcet->table;
>>>> + oldfd = tcet->fd;
>>>> + tcet->table = spapr_tce_alloc_table(tcet->liobn,
>>>> + tcet->nb_table,
>>>> + tcet->page_shift,
>>>> + &tcet->fd,
>>>> + false,
>>>> + true); /* force_userspace */
>>>> +
>>>> + if (replay) {
>>>> + ret = spapr_tce_do_replay(tcet, oldtable);
>>>> + }
>>>> +
>>>> + spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>>> sPAPRTCETable *tcet)
>>>> {
>>>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>>>> index 76c988f..d1fa157 100644
>>>> --- a/hw/ppc/spapr_pci.c
>>>> +++ b/hw/ppc/spapr_pci.c
>>>> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>>> return 0;
>>>> }
>>>>
>>>> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>>>> +{
>>>> + int ret = 0, i;
>>>> + bool had_vfio = sphb->has_vfio;
>>>> + sPAPRTCETable *tcet;
>>>> +
>>>> + spapr_phb_dma_capabilities_update(sphb);
>>>
>>> So, in the unplug case, we update caps, but has_vfio = false so we don't do
>>> anything else below.
>>
>> Yes.
>>
>>
>>> Does that mean our KVM-accelerated TCE table won't get restored until reboot?
>>> Would it make sense to re-enable it here?
>>
>> No, it shold be reenabled as DMA config is completely reset during the
>> machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of
>> DMA config when resetting PHB"
>
> We don't get a PHB-level reset for PCI hotplug though, so it wouldn't
> get re-enabled till guest system reset.
Yes, this is what I said :)
> I'm not sure how big a deal that
> is performance-wise, but it seems a little unexpected.
True...
>>>> +
>>>> + if (!had_vfio && sphb->has_vfio) {
>>>> + for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>>>> + tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>>>> + if (!tcet || !tcet->enabled) {
>>>> + continue;
>>>> + }
>>>> + if (tcet->fd >= 0) {
>>>> + /*
>>>> + * We got first vfio-pci device on accelerated table.
>>>> + * VFIO acceleration is not possible.
>>>> + * Reallocate table in userspace and replay mappings.
>>>> + */
>>>> + ret = spapr_tce_realloc_userspace(tcet, true);
>>>> + trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
>>>> + } else {
>>>> + /* There was no acceleration, so just replay mappings. */
>>>> + ret = spapr_tce_replay(tcet);
>>>> + trace_spapr_pci_dma_update(tcet->liobn, ret);
>>>> + }
>>>> + if (ret) {
>>>> + break;
>>>> + }
>>>> + }
>>>> + return ret;
>>>> + }
>>>> +
>>>> + return 0;
>>>> +}
>>>> +
>>>> /* Macros to operate with address in OF binding to PCI */
>>>> #define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p))
>>>> #define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */
>>>> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>>> error_setg(errp, "Failed to create pci child device tree node");
>>>> goto out;
>>>> }
>>>> + spapr_phb_hotplug_dma_sync(phb);
>>>> }
>>>>
>>>> drck->attach(drc, DEVICE(pdev),
>>>> @@ -1116,6 +1154,12 @@ out:
>>>> }
>>>> }
>>>>
>>>> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
>>>> +{
>>>> + sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
>>>> + spapr_phb_hotplug_dma_sync(sphb);
>>>> +}
>>>> +
>>>> static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>>> {
>>>> /* some version guests do not wait for completion of a device
>>>> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>>> */
>>>> pci_device_reset(PCI_DEVICE(dev));
>>>> object_unparent(OBJECT(dev));
>>>> +
>>>> + /* Actual VFIO device release happens from RCU so postpone DMA update */
>>>> + call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
>>>
>>> Hmm... can't think of any reason this wouldn't work, but would be nice
>>> if there was something a bit more straightforward...
>>>
>>> When the device is actually finalized, it does:
>>
>> The problem is with "when". I looked at gdb, this vfio_instance_finalize()
>> is called from an RCU handler because the last reference is dropped because
>> of some memory region was removed and this was postponed to RCU.
>>
>> If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same
>> stack, I would not need this call_rcu1.
>
> Right, object_unparent() has no guaruntee of immediately finalizing it,
> but you *do* have the guaruntee that
> vfio_instance_finalize()->vfio_put_group() will only be called once the
> device is actually finalized, regardless of whether or not it's kicked
> off by the RCU thread. So it seems more straightforward to hook into
> that rather than needing to employ internal knowledge of
> object_unparent().
Right... I'll try adding a hook and see what review I'll receive.
>>
>>> static void vfio_instance_finalize(Object *obj)
>>> {
>>> PCIDevice *pci_dev = PCI_DEVICE(obj);
>>> VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
>>> VFIOGroup *group = vdev->vbasedev.group;
>>>
>>> ...
>>>
>>> vfio_put_device(vdev);
>>> vfio_put_group(group);
>>> }
>>>
>>> When all the groups are removed from a VFIO container, there's a
>>> call to container->iommu_data.release(container). This is the
>>> event we really care about, not so much the fact that a device
>>> got released.
>>>
>>> Right now all it does it remove the memory listener, but maybe it
>>> makes sense to allow an additional callback/opaque to register for
>>> the event. Not sure what the best way to do that is though...
>>
>> In this context I rather care about container's fd being closed so
>> VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know
>> that there is no more VFIO devices.
>
> VFIO container getting closed also corresponds to the last group being
> removed though. Even if it didn't, I think VFIO_IOMMU_SPAPR_TCE_GET_INFO
> would fail unless at least one iommu group was attached to the
> container? So knowing when the first/last group is removed seems to
> be the real main event.
Right. This is still VFIO knowledge which PHB does not have access to. We
will need these new hooks.
>>> And, kind of a separate topic, but if we could do something
>>> similar for the initial group attach, we could drop *all* the
>>> plug/unplug hooks, and the hooks themselves could drop all
>>> the !had_vfio / has_vfio logic/probing, since that would then
>>> be clear from the context.
>>
>> Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they
>> what HMP calls on "device_add"?
>
> I mean all the places we call into code that ends up doing:
> spapr_phb_dma_capabilities_update(sphb);
> /* do something special if has_vfio changed */
>
> We currently have one in PCI plug, PCI unplug, and PHB reset. If we
> hooked into vfio_put_group(), we could drop each of those hooks. PHB
> reset would still have the special case of restoring default 32-bit
> window config, but it wouldn't need to care about has_vfio status
> anymore, all that code could be handled by
> vfio_put_group/vfio_get_group callbacks.
>
> I wouldn't hold up the series for it, but I think it would greatly
> simplify tracking has_vfio changes.
It is now 2 series and I can try what you suggest.
> But I do think spapr_phb_hotplug_dma_sync() should have some logic
> squashed in for re-enabling TCE acceleration on
> (had_vfio && !has_vfio).
--
Alexey
next prev parent reply other threads:[~2015-07-13 7:07 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-06 2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
2015-07-06 11:18 ` Paolo Bonzini
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2015-07-06 14:21 ` Thomas Huth
2015-07-06 2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
2015-07-06 16:41 ` Laurent Vivier
2015-07-07 0:28 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2015-07-06 15:14 ` Thomas Huth
2015-07-06 15:43 ` Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2015-07-06 10:07 ` David Gibson
2015-07-06 17:04 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
2015-07-06 16:45 ` Laurent Vivier
2015-07-06 17:11 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
2015-07-06 19:15 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
2015-07-06 21:13 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
2015-07-06 10:27 ` David Gibson
2015-07-06 21:31 ` Thomas Huth
2015-07-07 9:28 ` Alexey Kardashevskiy
2015-07-10 21:33 ` Michael Roth
2015-07-12 4:59 ` Alexey Kardashevskiy
2015-07-12 14:41 ` Michael Roth
2015-07-13 1:10 ` David Gibson
2015-07-13 7:06 ` Alexey Kardashevskiy [this message]
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
2015-07-07 7:02 ` Thomas Huth
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
2015-07-06 10:33 ` David Gibson
2015-07-06 12:49 ` Alex Williamson
2015-07-06 12:59 ` Alexey Kardashevskiy
2015-07-06 13:45 ` Alex Williamson
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-06 13:42 ` Alex Williamson
2015-07-06 15:34 ` Alexey Kardashevskiy
2015-07-06 16:13 ` Alex Williamson
2015-07-07 0:29 ` David Gibson
2015-07-07 0:36 ` Alexey Kardashevskiy
2015-07-07 12:11 ` Alexey Kardashevskiy
2015-07-07 16:24 ` Alex Williamson
2015-07-08 6:26 ` Alexey Kardashevskiy
2015-07-08 14:51 ` Alex Williamson
2015-07-07 7:23 ` Thomas Huth
2015-07-07 10:05 ` Alexey Kardashevskiy
2015-07-07 10:21 ` Thomas Huth
2015-07-07 11:05 ` Alexey Kardashevskiy
2015-07-08 4:30 ` David Gibson
2015-07-08 6:24 ` Thomas Huth
2015-07-08 6:50 ` David Gibson
2015-07-08 7:07 ` Alexey Kardashevskiy
2015-07-08 14:47 ` Alex Williamson
2015-07-06 2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2015-07-06 11:06 ` David Gibson
2015-07-06 11:27 ` Alexey Kardashevskiy
2015-07-07 9:46 ` Alexey Kardashevskiy
2015-07-07 4:58 ` David Gibson
2015-07-07 9:33 ` Thomas Huth
2015-07-07 10:43 ` Alexey Kardashevskiy
2015-07-07 11:35 ` Thomas Huth
2015-07-07 11:53 ` Alexey Kardashevskiy
2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
2015-07-06 15:54 ` Thomas Huth
2015-07-06 16:07 ` Alexey Kardashevskiy
2015-07-06 16:13 ` Thomas Huth
2015-07-08 4:34 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55A3638F.1070504@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=david@gibson.dropbear.id.au \
--cc=gwshan@linux.vnet.ibm.com \
--cc=mdroth@linux.vnet.ibm.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).