qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: Michael Roth <mdroth@linux.vnet.ibm.com>, qemu-devel@nongnu.org
Cc: Alex Williamson <alex.williamson@redhat.com>,
	qemu-ppc@nongnu.org, Gavin Shan <gwshan@linux.vnet.ibm.com>,
	David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug
Date: Sun, 12 Jul 2015 14:59:45 +1000	[thread overview]
Message-ID: <55A1F441.9050705@ozlabs.ru> (raw)
In-Reply-To: <20150710213301.22411.30046@loki>

On 07/11/2015 07:33 AM, Michael Roth wrote:
> Quoting Alexey Kardashevskiy (2015-07-05 21:11:06)
>> sPAPR IOMMU is managing two copies of an TCE table:
>> 1) a guest view of the table - this is what emulated devices use and
>> this is where H_GET_TCE reads from;
>> 2) a hardware TCE table - only present if there is at least one vfio-pci
>> device on a PHB; it is updated via a memory listener on a PHB address
>> space which forwards map/unmap requests to vfio-pci IOMMU host driver.
>>
>> At the moment presence of vfio-pci devices on a bus affect the way
>> the guest view table is allocated. If there is no vfio-pci on a PHB
>> and the host kernel supports KVM acceleration of H_PUT_TCE, a table
>> is allocated in KVM. However, if there is vfio-pci and we do yet not
>> support KVM acceleration for these, the table has to be allocated
>> by the userspace.
>>
>> When vfio-pci device is hotplugged and there were no vfio-pci devices
>> already, the guest view table could have been allocated by KVM which
>> means that H_PUT_TCE is handled by the host kernel and since we
>> do not support vfio-pci in KVM, the hardware table will not be updated.
>>
>> This reallocates the guest view table in QEMU if the first vfio-pci
>> device has just been plugged. spapr_tce_realloc_userspace() handles this.
>>
>> This replays all the mappings to make sure that the tables are in sync.
>> This will not have a visible effect though as for a new device
>> the guest kernel will allocate-and-map new addresses and therefore
>> existing mappings from emulated devices will not be used by vfio-pci
>> devices.
>>
>> This adds calls to spapr_phb_dma_capabilities_update() in PCI hotplug
>> hooks.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v10:
>> * removed unnecessary  memory_region_del_subregion() and
>> memory_region_add_subregion() as
>> "vfio: Unregister IOMMU notifiers when container is destroyed" removes
>> notifiers in a more correct way
>>
>> v9:
>> * spapr_phb_hotplug_dma_sync() enumerates TCE tables explicitely rather than
>> via object_child_foreach()
>> * spapr_phb_hotplug_dma_sync() does memory_region_del_subregion() +
>> memory_region_add_subregion() as otherwise vfio_listener_region_del() is not
>> called and we end up with vfio_iommu_map_notify registered twice (comments welcome!)
>> if we do hotplug+hotunplug+hotplug of the same device.
>> * moved spapr_phb_hotplug_dma_sync() on unplug event to rcu as before calling
>> spapr_phb_hotplug_dma_sync(), we need VFIO to release the container, otherwise
>> spapr_phb_dma_capabilities_update() will decide that the PHB still has VFIO device.
>> Actual VFIO PCI device release happens from rcu and since we add ours later,
>> it gets executed later and we are good.
>> ---
>>   hw/ppc/spapr_iommu.c        | 51 ++++++++++++++++++++++++++++++++++++++++++---
>>   hw/ppc/spapr_pci.c          | 47 +++++++++++++++++++++++++++++++++++++++++
>>   include/hw/pci-host/spapr.h |  1 +
>>   include/hw/ppc/spapr.h      |  2 ++
>>   trace-events                |  2 ++
>>   5 files changed, 100 insertions(+), 3 deletions(-)
>>
>> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
>> index 45c00d8..2d99c3b 100644
>> --- a/hw/ppc/spapr_iommu.c
>> +++ b/hw/ppc/spapr_iommu.c
>> @@ -78,12 +78,13 @@ static uint64_t *spapr_tce_alloc_table(uint32_t liobn,
>>                                          uint32_t nb_table,
>>                                          uint32_t page_shift,
>>                                          int *fd,
>> -                                       bool vfio_accel)
>> +                                       bool vfio_accel,
>> +                                       bool force_userspace)
>>   {
>>       uint64_t *table = NULL;
>>       uint64_t window_size = (uint64_t)nb_table << page_shift;
>>
>> -    if (kvm_enabled() && !(window_size >> 32)) {
>> +    if (kvm_enabled() && !force_userspace && !(window_size >> 32)) {
>>           table = kvmppc_create_spapr_tce(liobn, window_size, fd, vfio_accel);
>>       }
>>
>> @@ -222,7 +223,8 @@ static void spapr_tce_table_do_enable(sPAPRTCETable *tcet, bool vfio_accel)
>>                                           tcet->nb_table,
>>                                           tcet->page_shift,
>>                                           &tcet->fd,
>> -                                        vfio_accel);
>> +                                        vfio_accel,
>> +                                        false);
>>
>>       memory_region_set_size(&tcet->iommu,
>>                              (uint64_t)tcet->nb_table << tcet->page_shift);
>> @@ -495,6 +497,49 @@ int spapr_dma_dt(void *fdt, int node_off, const char *propname,
>>       return 0;
>>   }
>>
>> +static int spapr_tce_do_replay(sPAPRTCETable *tcet, uint64_t *table)
>> +{
>> +    target_ulong ioba = tcet->bus_offset, pgsz = (1ULL << tcet->page_shift);
>> +    long i, ret = 0;
>> +
>> +    for (i = 0; i < tcet->nb_table; ++i, ioba += pgsz) {
>> +        ret = put_tce_emu(tcet, ioba, table[i]);
>> +        if (ret) {
>> +            break;
>> +        }
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +int spapr_tce_replay(sPAPRTCETable *tcet)
>> +{
>> +    return spapr_tce_do_replay(tcet, tcet->table);
>> +}
>> +
>> +int spapr_tce_realloc_userspace(sPAPRTCETable *tcet, bool replay)
>> +{
>> +    int ret = 0, oldfd;
>> +    uint64_t *oldtable;
>> +
>> +    oldtable = tcet->table;
>> +    oldfd = tcet->fd;
>> +    tcet->table = spapr_tce_alloc_table(tcet->liobn,
>> +                                        tcet->nb_table,
>> +                                        tcet->page_shift,
>> +                                        &tcet->fd,
>> +                                        false,
>> +                                        true); /* force_userspace */
>> +
>> +    if (replay) {
>> +        ret = spapr_tce_do_replay(tcet, oldtable);
>> +    }
>> +
>> +    spapr_tce_free_table(oldtable, oldfd, tcet->nb_table);
>> +
>> +    return ret;
>> +}
>> +
>>   int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>                         sPAPRTCETable *tcet)
>>   {
>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
>> index 76c988f..d1fa157 100644
>> --- a/hw/ppc/spapr_pci.c
>> +++ b/hw/ppc/spapr_pci.c
>> @@ -827,6 +827,43 @@ int spapr_phb_dma_reset(sPAPRPHBState *sphb)
>>       return 0;
>>   }
>>
>> +static int spapr_phb_hotplug_dma_sync(sPAPRPHBState *sphb)
>> +{
>> +    int ret = 0, i;
>> +    bool had_vfio = sphb->has_vfio;
>> +    sPAPRTCETable *tcet;
>> +
>> +    spapr_phb_dma_capabilities_update(sphb);
>
> So, in the unplug case, we update caps, but has_vfio = false so we don't do
> anything else below.

Yes.


> Does that mean our KVM-accelerated TCE table won't get restored until reboot?
> Would it make sense to re-enable it here?

No, it shold be reenabled as DMA config is completely reset during the 
machine reset by "[PATCH qemu v10 08/14] spapr_pci: Do complete reset of 
DMA config when resetting PHB"



>> +
>> +    if (!had_vfio && sphb->has_vfio) {
>> +        for (i = 0; i < SPAPR_PCI_DMA_MAX_WINDOWS; ++i) {
>> +            tcet = spapr_tce_find_by_liobn(SPAPR_PCI_LIOBN(sphb->index, i));
>> +            if (!tcet || !tcet->enabled) {
>> +                continue;
>> +            }
>> +            if (tcet->fd >= 0) {
>> +                /*
>> +                 * We got first vfio-pci device on accelerated table.
>> +                 * VFIO acceleration is not possible.
>> +                 * Reallocate table in userspace and replay mappings.
>> +                 */
>> +                ret = spapr_tce_realloc_userspace(tcet, true);
>> +                trace_spapr_pci_dma_realloc_update(tcet->liobn, ret);
>> +            } else {
>> +                /* There was no acceleration, so just replay mappings. */
>> +                ret = spapr_tce_replay(tcet);
>> +                trace_spapr_pci_dma_update(tcet->liobn, ret);
>> +            }
>> +            if (ret) {
>> +                break;
>> +            }
>> +        }
>> +        return ret;
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   /* Macros to operate with address in OF binding to PCI */
>>   #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
>>   #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
>> @@ -1106,6 +1143,7 @@ static void spapr_phb_add_pci_device(sPAPRDRConnector *drc,
>>               error_setg(errp, "Failed to create pci child device tree node");
>>               goto out;
>>           }
>> +        spapr_phb_hotplug_dma_sync(phb);
>>       }
>>
>>       drck->attach(drc, DEVICE(pdev),
>> @@ -1116,6 +1154,12 @@ out:
>>       }
>>   }
>>
>> +static void spapr_phb_remove_sync_dma(struct rcu_head *head)
>> +{
>> +    sPAPRPHBState *sphb = container_of(head, sPAPRPHBState, rcu);
>> +    spapr_phb_hotplug_dma_sync(sphb);
>> +}
>> +
>>   static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>   {
>>       /* some version guests do not wait for completion of a device
>> @@ -1130,6 +1174,9 @@ static void spapr_phb_remove_pci_device_cb(DeviceState *dev, void *opaque)
>>        */
>>       pci_device_reset(PCI_DEVICE(dev));
>>       object_unparent(OBJECT(dev));
>> +
>> +    /* Actual VFIO device release happens from RCU so postpone DMA update */
>> +    call_rcu1(&((sPAPRPHBState *)opaque)->rcu, spapr_phb_remove_sync_dma);
>
> Hmm... can't think of any reason this wouldn't work, but would be nice
> if there was something a bit more straightforward...
>
> When the device is actually finalized, it does:

The problem is with "when". I looked at gdb, this vfio_instance_finalize() 
is called from an RCU handler because the last reference is dropped because 
of some memory region was removed and this was postponed to RCU.

If object_unparent(OBJECT(dev)) did call vfio_put_group() on the same 
stack, I would not need this call_rcu1.

>    static void vfio_instance_finalize(Object *obj)
>    {
>        PCIDevice *pci_dev = PCI_DEVICE(obj);
>        VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pci_dev);
>        VFIOGroup *group = vdev->vbasedev.group;
>
>        ...
>
>        vfio_put_device(vdev);
>        vfio_put_group(group);
>    }
>
> When all the groups are removed from a VFIO container, there's a
> call to container->iommu_data.release(container). This is the
> event we really care about, not so much the fact that a device
> got released.
>
> Right now all it does it remove the memory listener, but maybe it
> makes sense to allow an additional callback/opaque to register for
> the event. Not sure what the best way to do that is though...

In this context I rather care about container's fd being closed so 
VFIO_IOMMU_SPAPR_TCE_GET_INFO would fail in my dma-sync and this way I know 
that there is no more VFIO devices.


> And, kind of a separate topic, but if we could do something
> similar for the initial group attach,  we could drop *all* the
> plug/unplug hooks, and the hooks themselves could drop all
> the !had_vfio / has_vfio logic/probing, since that would then
> be clear from the context.

Drop all hooks? HotplugHandlerClass hooks? Can you do that? :) Are not they 
what HMP calls on "device_add"?


-- 
Alexey

  reply	other threads:[~2015-07-12  5:00 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-06  2:10 [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) Alexey Kardashevskiy
2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 01/14] linux-headers: Update to 4.2-rc1 Alexey Kardashevskiy
2015-07-06 11:18   ` Paolo Bonzini
2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 02/14] vmstate: Define VARRAY with VMS_ALLOC Alexey Kardashevskiy
2015-07-06 14:21   ` Thomas Huth
2015-07-06  2:10 ` [Qemu-devel] [PATCH qemu v10 03/14] spapr_pci: Convert finish_realize() to dma_capabilities_update()+dma_init_window() Alexey Kardashevskiy
2015-07-06 16:41   ` Laurent Vivier
2015-07-07  0:28     ` Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 04/14] spapr_iommu: Move table allocation to helpers Alexey Kardashevskiy
2015-07-06 15:14   ` Thomas Huth
2015-07-06 15:43     ` Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 05/14] spapr_iommu: Introduce "enabled" state for TCE table Alexey Kardashevskiy
2015-07-06 10:07   ` David Gibson
2015-07-06 17:04   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 06/14] spapr_iommu: Remove vfio_accel flag from sPAPRTCETable Alexey Kardashevskiy
2015-07-06 16:45   ` Laurent Vivier
2015-07-06 17:11   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 07/14] spapr_iommu: Add root memory region Alexey Kardashevskiy
2015-07-06 19:15   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 08/14] spapr_pci: Do complete reset of DMA config when resetting PHB Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 09/14] spapr_vfio_pci: Remove redundant spapr-pci-vfio-host-bridge Alexey Kardashevskiy
2015-07-06 21:13   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 10/14] spapr_pci: Enable vfio-pci hotplug Alexey Kardashevskiy
2015-07-06 10:27   ` David Gibson
2015-07-06 21:31   ` Thomas Huth
2015-07-07  9:28     ` Alexey Kardashevskiy
2015-07-10 21:33   ` Michael Roth
2015-07-12  4:59     ` Alexey Kardashevskiy [this message]
2015-07-12 14:41       ` Michael Roth
2015-07-13  1:10         ` David Gibson
2015-07-13  7:06         ` Alexey Kardashevskiy
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 11/14] spapr_pci_vfio: Enable multiple groups per container Alexey Kardashevskiy
2015-07-07  7:02   ` Thomas Huth
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 12/14] vfio: Unregister IOMMU notifiers when container is destroyed Alexey Kardashevskiy
2015-07-06 10:33   ` David Gibson
2015-07-06 12:49     ` Alex Williamson
2015-07-06 12:59       ` Alexey Kardashevskiy
2015-07-06 13:45         ` Alex Williamson
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 13/14] vfio: spapr: Add SPAPR IOMMU v2 support (DMA memory preregistering) Alexey Kardashevskiy
2015-07-06 13:42   ` Alex Williamson
2015-07-06 15:34     ` Alexey Kardashevskiy
2015-07-06 16:13       ` Alex Williamson
2015-07-07  0:29         ` David Gibson
2015-07-07  0:36           ` Alexey Kardashevskiy
2015-07-07 12:11         ` Alexey Kardashevskiy
2015-07-07 16:24           ` Alex Williamson
2015-07-08  6:26             ` Alexey Kardashevskiy
2015-07-08 14:51               ` Alex Williamson
2015-07-07  7:23   ` Thomas Huth
2015-07-07 10:05     ` Alexey Kardashevskiy
2015-07-07 10:21       ` Thomas Huth
2015-07-07 11:05         ` Alexey Kardashevskiy
2015-07-08  4:30           ` David Gibson
2015-07-08  6:24             ` Thomas Huth
2015-07-08  6:50               ` David Gibson
2015-07-08  7:07             ` Alexey Kardashevskiy
2015-07-08 14:47             ` Alex Williamson
2015-07-06  2:11 ` [Qemu-devel] [PATCH qemu v10 14/14] spapr_pci/spapr_pci_vfio: Support Dynamic DMA Windows (DDW) Alexey Kardashevskiy
2015-07-06 11:06   ` David Gibson
2015-07-06 11:27     ` Alexey Kardashevskiy
2015-07-07  9:46     ` Alexey Kardashevskiy
2015-07-07  4:58   ` David Gibson
2015-07-07  9:33   ` Thomas Huth
2015-07-07 10:43     ` Alexey Kardashevskiy
2015-07-07 11:35       ` Thomas Huth
2015-07-07 11:53         ` Alexey Kardashevskiy
2015-07-06 11:13 ` [Qemu-devel] [PATCH qemu v10 00/14] spapr: vfio: Enable Dynamic DMA windows (DDW) David Gibson
2015-07-06 15:54 ` Thomas Huth
2015-07-06 16:07   ` Alexey Kardashevskiy
2015-07-06 16:13     ` Thomas Huth
2015-07-08  4:34   ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55A1F441.9050705@ozlabs.ru \
    --to=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=gwshan@linux.vnet.ibm.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).