[Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64
@ 2013-08-30 10:15 Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 01/12] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
                   ` (12 more replies)
  0 siblings, 13 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

Yet another try with VFIO on SPAPR (server PPC64).

Changes:
v3 -> v4:
* addressed all comments from Alex Williamson
* moved spapr-pci-phb-vfio-phb to new file
* split spapr-pci-phb-vfio to many smaller patches

The "spapr vfio: add vfio_container_spapr_get_info()" needs kernel
headers update (v3.11-rc6);
The "spapr kvm vfio: enable in-kernel acceleration" needs a kernel
patch which is not in the kernel yet and posted separately as
"[PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO".

More details in the individual patches commit messages.

Alexey Kardashevskiy (9):
  spapr vfio: add vfio_container_spapr_get_info()
  spapr_pci: convert init to realize
  spapr_pci: add spapr_pci trace
  spapr_pci: converts fprintf to error_report
  spapr_iommu: introduce SPAPR_TCE_TABLE class
  spapr_iommu: add SPAPR VFIO IOMMU
  spapr vfio: add spapr-pci-vfio-host-bridge to support vfio
  spapr vfio: enable for spapr
  spapr kvm vfio: enable in-kernel acceleration

David Gibson (3):
  vfio: Introduce VFIO address spaces
  vfio: Create VFIOAddressSpace objects as needed
  vfio: Add guest side IOMMU support

 hw/misc/vfio.c              | 291 ++++++++++++++++++++++++++++++++++++++++----
 hw/ppc/Makefile.objs        |   2 +-
 hw/ppc/spapr_iommu.c        | 168 ++++++++++++++++++++++---
 hw/ppc/spapr_pci.c          |  90 ++++++++------
 hw/ppc/spapr_pci_vfio.c     | 198 ++++++++++++++++++++++++++++++
 include/hw/misc/vfio.h      |  11 ++
 include/hw/pci-host/spapr.h |  31 ++++-
 include/hw/ppc/spapr.h      |  19 +++
 target-ppc/kvm.c            |  47 +++++++
 target-ppc/kvm_ppc.h        |  13 ++
 trace-events                |   1 +
 11 files changed, 792 insertions(+), 79 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_vfio.c
 create mode 100644 include/hw/misc/vfio.h

-- 
1.8.4.rc4

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 01/12] vfio: Introduce VFIO address spaces
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

The only model so far supported for VFIO passthrough devices is the model
usually used on x86, where all of the guest's RAM is mapped into the
(host) IOMMU and there is no IOMMU visible in the guest.

This patch begins to relax this model, introducing the notion of a
VFIOAddressSpace.  This represents a logical DMA address space which will
be visible to one or more VFIO devices by appropriate mapping in the (host)
IOMMU.  Thus the currently global list of containers becomes local to
a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO
group to multiple address spaces.

For now, only one VFIOAddressSpace is created and used, corresponding to
main system memory, that will change in future patches.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changes:
v4:
* removed redundant checks and asserts
* fixed some return error codes
---
 hw/misc/vfio.c | 52 +++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 37 insertions(+), 15 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index dfffffa..93a316e 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -130,9 +130,17 @@ enum {
     VFIO_INT_MSIX = 3,
 };
 
+typedef struct VFIOAddressSpace {
+    AddressSpace *as;
+    QLIST_HEAD(, VFIOContainer) containers;
+} VFIOAddressSpace;
+
+static VFIOAddressSpace vfio_address_space_memory;
+
 struct VFIOGroup;
 
 typedef struct VFIOContainer {
+    VFIOAddressSpace *space;
     int fd; /* /dev/vfio/vfio, empowered by the attached groups */
     struct {
         /* enable abstraction to support various iommu backends */
@@ -197,9 +205,6 @@ typedef struct VFIOGroup {
 
 #define MSIX_CAP_LENGTH 12
 
-static QLIST_HEAD(, VFIOContainer)
-    container_list = QLIST_HEAD_INITIALIZER(container_list);
-
 static QLIST_HEAD(, VFIOGroup)
     group_list = QLIST_HEAD_INITIALIZER(group_list);
 
@@ -2606,16 +2611,18 @@ static int vfio_load_rom(VFIODevice *vdev)
     return 0;
 }
 
-static int vfio_connect_container(VFIOGroup *group)
+static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
+{
+    space->as = as;
+    QLIST_INIT(&space->containers);
+}
+
+static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
 {
     VFIOContainer *container;
     int ret, fd;
 
-    if (group->container) {
-        return 0;
-    }
-
-    QLIST_FOREACH(container, &container_list, next) {
+    QLIST_FOREACH(container, &space->containers, next) {
         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
             group->container = container;
             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
@@ -2638,6 +2645,7 @@ static int vfio_connect_container(VFIOGroup *group)
     }
 
     container = g_malloc0(sizeof(*container));
+    container->space = space;
     container->fd = fd;
 
     if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) {
@@ -2660,7 +2668,8 @@ static int vfio_connect_container(VFIOGroup *group)
         container->iommu_data.listener = vfio_memory_listener;
         container->iommu_data.release = vfio_listener_release;
 
-        memory_listener_register(&container->iommu_data.listener, &address_space_memory);
+        memory_listener_register(&container->iommu_data.listener,
+                                 container->space->as);
     } else {
         error_report("vfio: No available IOMMU models");
         g_free(container);
@@ -2669,7 +2678,7 @@ static int vfio_connect_container(VFIOGroup *group)
     }
 
     QLIST_INIT(&container->group_list);
-    QLIST_INSERT_HEAD(&container_list, container, next);
+    QLIST_INSERT_HEAD(&space->containers, container, next);
 
     group->container = container;
     QLIST_INSERT_HEAD(&container->group_list, group, container_next);
@@ -2700,7 +2709,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-static VFIOGroup *vfio_get_group(int groupid)
+static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *space)
 {
     VFIOGroup *group;
     char path[32];
@@ -2708,7 +2717,14 @@ static VFIOGroup *vfio_get_group(int groupid)
 
     QLIST_FOREACH(group, &group_list, next) {
         if (group->groupid == groupid) {
-            return group;
+            /* Found it.  Now is it already in the right context? */
+            if (group->container->space == space) {
+                return group;
+            } else {
+                error_report("vfio: group %d used in multiple address spaces",
+                             group->groupid);
+                return NULL;
+            }
         }
     }
 
@@ -2741,7 +2757,7 @@ static VFIOGroup *vfio_get_group(int groupid)
     group->groupid = groupid;
     QLIST_INIT(&group->device_list);
 
-    if (vfio_connect_container(group)) {
+    if (vfio_connect_container(group, space)) {
         error_report("vfio: failed to setup container for group %d", groupid);
         close(group->fd);
         g_free(group);
@@ -3095,7 +3111,12 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    group = vfio_get_group(groupid);
+    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
+        error_report("vfio: DMA address space must be system memory");
+        return -EINVAL;
+    }
+
+    group = vfio_get_group(groupid, &vfio_address_space_memory);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
@@ -3318,6 +3339,7 @@ static const TypeInfo vfio_pci_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
+    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
     type_register_static(&vfio_pci_dev_info);
 }
 
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 01/12] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-09-05 18:24   ` Alex Williamson
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support Alexey Kardashevskiy
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

So far, VFIO has a notion of different logical DMA address spaces, but
only ever uses one (system memory).  This patch extends this, creating
new VFIOAddressSpace objects as necessary, according to the AddressSpace
reported by the PCI subsystem for this device's DMAs.

This isn't enough yet to support guest side IOMMUs with VFIO, but it does
mean we could now support VFIO devices on, for example, a guest side PCI
host bridge which maps system memory at somewhere other than 0 in PCI
space.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/misc/vfio.c | 43 +++++++++++++++++++++++++++++++++++--------
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 93a316e..c16f41b 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -133,9 +133,10 @@ enum {
 typedef struct VFIOAddressSpace {
     AddressSpace *as;
     QLIST_HEAD(, VFIOContainer) containers;
+    QLIST_ENTRY(VFIOAddressSpace) list;
 } VFIOAddressSpace;
 
-static VFIOAddressSpace vfio_address_space_memory;
+QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
 
 struct VFIOGroup;
 
@@ -2611,10 +2612,34 @@ static int vfio_load_rom(VFIODevice *vdev)
     return 0;
 }
 
-static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
+static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
 {
+    VFIOAddressSpace *space;
+
+    QLIST_FOREACH(space, &vfio_address_spaces, list) {
+        if (space->as == as) {
+            return space;
+        }
+    }
+
+    /* No suitable VFIOAddressSpace, create a new one */
+    space = g_malloc0(sizeof(*space));
     space->as = as;
     QLIST_INIT(&space->containers);
+
+    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
+
+    return space;
+}
+
+static void vfio_put_address_space(VFIOAddressSpace *space)
+{
+    if (!QLIST_EMPTY(&space->containers)) {
+        return;
+    }
+
+    QLIST_REMOVE(space, list);
+    g_free(space);
 }
 
 static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
@@ -2699,6 +2724,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
     group->container = NULL;
 
     if (QLIST_EMPTY(&container->group_list)) {
+        VFIOAddressSpace *space = container->space;
+
         if (container->iommu_data.release) {
             container->iommu_data.release(container);
         }
@@ -2706,6 +2733,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
         DPRINTF("vfio_disconnect_container: close container->fd\n");
         close(container->fd);
         g_free(container);
+
+        vfio_put_address_space(space);
     }
 }
 
@@ -3076,6 +3105,7 @@ static int vfio_initfn(PCIDevice *pdev)
 {
     VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
     VFIOGroup *group;
+    VFIOAddressSpace *space;
     char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
     ssize_t len;
     struct stat st;
@@ -3111,14 +3141,12 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
-        error_report("vfio: DMA address space must be system memory");
-        return -EINVAL;
-    }
+    space = vfio_get_address_space(pci_device_iommu_address_space(pdev));
 
-    group = vfio_get_group(groupid, &vfio_address_space_memory);
+    group = vfio_get_group(groupid, space);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
+        vfio_put_address_space(space);
         return -ENOENT;
     }
 
@@ -3339,7 +3367,6 @@ static const TypeInfo vfio_pci_dev_info = {
 
 static void register_vfio_pci_dev_type(void)
 {
-    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
     type_register_static(&vfio_pci_dev_info);
 }
 
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 01/12] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-09-05 18:49   ` Alex Williamson
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

From: David Gibson <david@gibson.dropbear.id.au>

This patch uses the new IOMMU notifiers to allow VFIO pass through devices
to work with guest side IOMMUs, as long as the host-side VFIO iommu has
sufficient capability and granularity to match the guest side. This works
by tracking all map and unmap operations on the guest IOMMU using the
notifiers, and mirroring them into VFIO.

There are a number of FIXMEs, and the scheme involves rather more notifier
structures than I'd like, but it should make for a reasonable proof of
concept.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

---
Changes:
v4:
* fixed list objects naming
* vfio_listener_region_add() reworked to call memory_region_ref() from one
place only, it is also easier to review the changes
* fixes boundary check not to fail on sections == 2^64 bytes,
the "vfio: Fix debug output for int128 values" patch is required;
this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
for 128 bit memory section sizes" patch proposal
---
 hw/misc/vfio.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 128 insertions(+), 9 deletions(-)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index c16f41b..53791fb 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -150,10 +150,18 @@ typedef struct VFIOContainer {
         };
         void (*release)(struct VFIOContainer *);
     } iommu_data;
+    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
     QLIST_HEAD(, VFIOGroup) group_list;
     QLIST_ENTRY(VFIOContainer) next;
 } VFIOContainer;
 
+typedef struct VFIOGuestIOMMU {
+    VFIOContainer *container;
+    MemoryRegion *iommu;
+    Notifier n;
+    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
+} VFIOGuestIOMMU;
+
 /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
 typedef struct VFIOMSIXInfo {
     uint8_t table_bar;
@@ -1917,7 +1925,63 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
 
 static bool vfio_listener_skipped_section(MemoryRegionSection *section)
 {
-    return !memory_region_is_ram(section->mr);
+    return !memory_region_is_ram(section->mr) &&
+        !memory_region_is_iommu(section->mr);
+}
+
+static void vfio_iommu_map_notify(Notifier *n, void *data)
+{
+    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
+    VFIOContainer *container = giommu->container;
+    IOMMUTLBEntry *iotlb = data;
+    MemoryRegion *mr;
+    hwaddr xlat;
+    hwaddr len = iotlb->addr_mask + 1;
+    void *vaddr;
+    int ret;
+
+    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
+            iotlb->iova, iotlb->iova + iotlb->addr_mask);
+
+    /*
+     * The IOMMU TLB entry we have just covers translation through
+     * this IOMMU to its immediate target.  We need to translate
+     * it the rest of the way through to memory.
+     */
+    mr = address_space_translate(&address_space_memory,
+                                 iotlb->translated_addr,
+                                 &xlat, &len, iotlb->perm & IOMMU_WO);
+    if (!memory_region_is_ram(mr)) {
+        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
+                xlat);
+        return;
+    }
+    if (len & iotlb->addr_mask) {
+        DPRINTF("iommu has granularity incompatible with target AS\n");
+        return;
+    }
+
+    vaddr = memory_region_get_ram_ptr(mr) + xlat;
+
+    if (iotlb->perm != IOMMU_NONE) {
+        ret = vfio_dma_map(container, iotlb->iova,
+                           iotlb->addr_mask + 1, vaddr,
+                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
+        if (ret) {
+            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
+                         container, iotlb->iova,
+                         iotlb->addr_mask + 1, vaddr, ret);
+        }
+    } else {
+        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
+        if (ret) {
+            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
+                         "0x%"HWADDR_PRIx") = %d (%m)",
+                         container, iotlb->iova,
+                         iotlb->addr_mask + 1, ret);
+        }
+    }
 }
 
 static void vfio_listener_region_add(MemoryListener *listener,
@@ -1926,11 +1990,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
     VFIOContainer *container = container_of(listener, VFIOContainer,
                                             iommu_data.listener);
     hwaddr iova, end;
+    Int128 llend;
     void *vaddr;
     int ret;
 
-    assert(!memory_region_is_iommu(section->mr));
-
     if (vfio_listener_skipped_section(section)) {
         DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
                 section->offset_within_address_space,
@@ -1946,21 +2009,57 @@ static void vfio_listener_region_add(MemoryListener *listener,
     }
 
     iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
+    llend = int128_make64(section->offset_within_address_space);
+    llend = int128_add(llend, section->size);
+    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
+
+    if (int128_ge(int128_make64(iova), llend)) {
+        return;
+    }
+
+    memory_region_ref(section->mr);
+
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+
+        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
+                iova, int128_get64(int128_sub(llend, int128_one())));
+        /*
+         * FIXME: We should do some checking to see if the
+         * capabilities of the host VFIO IOMMU are adequate to model
+         * the guest IOMMU
+         *
+         * FIXME: This assumes that the guest IOMMU is empty of
+         * mappings at this point - we should either enforce this, or
+         * loop through existing mappings to map them into VFIO.
+         *
+         * FIXME: For VFIO iommu types which have KVM acceleration to
+         * avoid bouncing all map/unmaps through qemu this way, this
+         * would be the right place to wire that up (tell the KVM
+         * device emulation the VFIO iommu handles to use).
+         */
+        giommu = g_malloc0(sizeof(*giommu));
+        giommu->iommu = section->mr;
+        giommu->container = container;
+        giommu->n.notify = vfio_iommu_map_notify;
+        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
+        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
+
+        return;
+    }
+
+    /* Here we assume that memory_region_is_ram(section->mr)==true */
+
     end = (section->offset_within_address_space + int128_get64(section->size)) &
           TARGET_PAGE_MASK;
 
-    if (iova >= end) {
-        return;
-    }
-
     vaddr = memory_region_get_ram_ptr(section->mr) +
             section->offset_within_region +
             (iova - section->offset_within_address_space);
 
-    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
+    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
             iova, end - 1, vaddr);
 
-    memory_region_ref(section->mr);
     ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
     if (ret) {
         error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
@@ -1991,6 +2090,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
         return;
     }
 
+    if (memory_region_is_iommu(section->mr)) {
+        VFIOGuestIOMMU *giommu;
+
+        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
+            if (giommu->iommu == section->mr) {
+                memory_region_unregister_iommu_notifier(&giommu->n);
+                QLIST_REMOVE(giommu, giommu_next);
+                g_free(giommu);
+                break;
+            }
+        }
+
+        /*
+         * FIXME: We assume the one big unmap below is adequate to
+         * remove any individual page mappings in the IOMMU which
+         * might have been copied into VFIO.  That may not be true for
+         * all IOMMU types
+         */
+    }
+
     iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
     end = (section->offset_within_address_space + int128_get64(section->size)) &
           TARGET_PAGE_MASK;
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-09-05 19:01   ` Alex Williamson
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 05/12] spapr_pci: convert init to realize Alexey Kardashevskiy
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

As sPAPR platform supports DMA windows on a PCI bus, the information
about their location and size should be passed into the guest via
the device tree.

The patch adds a helper to read this info from the container fd.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* fixed possible leaks on error paths
---
 hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
 include/hw/misc/vfio.h | 11 +++++++++++
 2 files changed, 56 insertions(+)
 create mode 100644 include/hw/misc/vfio.h

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 53791fb..4210471 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -39,6 +39,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+#include "hw/misc/vfio.h"
 
 /* #define DEBUG_VFIO */
 #ifdef DEBUG_VFIO
@@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
+                                  struct vfio_iommu_spapr_tce_info *info,
+                                  int *group_fd)
+{
+    VFIOAddressSpace *space;
+    VFIOGroup *group;
+    VFIOContainer *container;
+    int ret, fd;
+
+    space = vfio_get_address_space(as);
+    if (!space) {
+        return -1;
+    }
+    group = vfio_get_group(groupid, space);
+    if (!group) {
+        goto put_as_exit;
+    }
+    container = group->container;
+    if (!group->container) {
+        goto put_group_exit;
+    }
+    fd = container->fd;
+    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        goto put_group_exit;
+    }
+    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
+    if (ret) {
+        error_report("vfio: failed to get iommu info for container: %s",
+                     strerror(errno));
+        goto put_group_exit;
+    }
+    *group_fd = group->fd;
+
+    return 0;
+
+put_group_exit:
+    vfio_put_group(group);
+
+put_as_exit:
+    vfio_put_address_space(space);
+
+    return -1;
+}
diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
new file mode 100644
index 0000000..ac9a971
--- /dev/null
+++ b/include/hw/misc/vfio.h
@@ -0,0 +1,11 @@
+#ifndef VFIO_API_H
+#define VFIO_API_H
+
+#include "qemu/typedefs.h"
+#include <linux/vfio.h>
+
+extern int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
+                                         struct vfio_iommu_spapr_tce_info *info,
+                                         int *group_fd);
+
+#endif
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 05/12] spapr_pci: convert init to realize
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (3 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 06/12] spapr_pci: add spapr_pci trace Alexey Kardashevskiy
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

This converts init() callback to realize(). The conversion includes:
1. change of prototype;
2. conversion of fprintf to error_setg;
3. introducing a finish_realize() callback which at the moment
does IOMMU setup.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_pci.c          | 82 +++++++++++++++++++++++++++------------------
 include/hw/pci-host/spapr.h | 18 ++++++++--
 2 files changed, 66 insertions(+), 34 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 9b6ee32..11bb737 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -32,6 +32,7 @@
 #include "exec/address-spaces.h"
 #include <libfdt.h>
 #include "trace.h"
+#include "qemu/error-report.h"
 
 #include "hw/pci/pci_bus.h"
 
@@ -483,15 +484,17 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn)
     return &phb->iommu_as;
 }
 
-static int spapr_phb_init(SysBusDevice *s)
+static void spapr_phb_realize(DeviceState *dev, Error **errp)
 {
-    DeviceState *dev = DEVICE(s);
+    SysBusDevice *s = SYS_BUS_DEVICE(dev);
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
     PCIHostState *phb = PCI_HOST_BRIDGE(s);
+    sPAPRPHBClass *info = SPAPR_PCI_HOST_BRIDGE_GET_CLASS(s);
     const char *busname;
     char *namebuf;
     int i;
     PCIBus *bus;
+    Error *error = NULL;
 
     if (sphb->index != -1) {
         hwaddr windows_base;
@@ -499,9 +502,9 @@ static int spapr_phb_init(SysBusDevice *s)
         if ((sphb->buid != -1) || (sphb->dma_liobn != -1)
             || (sphb->mem_win_addr != -1)
             || (sphb->io_win_addr != -1)) {
-            fprintf(stderr, "Either \"index\" or other parameters must"
-                    " be specified for PAPR PHB, not both\n");
-            return -1;
+            error_setg(errp, "Either \"index\" or other parameters must"
+                       " be specified for PAPR PHB, not both");
+            return;
         }
 
         sphb->buid = SPAPR_PCI_BASE_BUID + sphb->index;
@@ -514,28 +517,28 @@ static int spapr_phb_init(SysBusDevice *s)
     }
 
     if (sphb->buid == -1) {
-        fprintf(stderr, "BUID not specified for PHB\n");
-        return -1;
+        error_setg(errp, "BUID not specified for PHB");
+        return;
     }
 
     if (sphb->dma_liobn == -1) {
-        fprintf(stderr, "LIOBN not specified for PHB\n");
-        return -1;
+        error_setg(errp, "LIOBN not specified for PHB");
+        return;
     }
 
     if (sphb->mem_win_addr == -1) {
-        fprintf(stderr, "Memory window address not specified for PHB\n");
-        return -1;
+        error_setg(errp, "Memory window address not specified for PHB");
+        return;
     }
 
     if (sphb->io_win_addr == -1) {
-        fprintf(stderr, "IO window address not specified for PHB\n");
-        return -1;
+        error_setg(errp, "IO window address not specified for PHB");
+        return;
     }
 
     if (find_phb(spapr, sphb->buid)) {
-        fprintf(stderr, "PCI host bridges must have unique BUIDs\n");
-        return -1;
+        error_setg(errp, "PCI host bridges must have unique BUIDs");
+        return;
     }
 
     sphb->dtbusname = g_strdup_printf("pci@%" PRIx64, sphb->buid);
@@ -597,19 +600,6 @@ static int spapr_phb_init(SysBusDevice *s)
                            PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
     phb->bus = bus;
 
-    sphb->dma_window_start = 0;
-    sphb->dma_window_size = 0x40000000;
-    sphb->tcet = spapr_tce_new_table(dev, sphb->dma_liobn,
-                                     sphb->dma_window_size);
-    if (!sphb->tcet) {
-        fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
-        return -1;
-    }
-    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
-                       sphb->dtbusname);
-
-    pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb);
-
     QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
 
     /* Initialize the LSI table */
@@ -618,13 +608,39 @@ static int spapr_phb_init(SysBusDevice *s)
 
         irq = spapr_allocate_lsi(0);
         if (!irq) {
-            return -1;
+            error_setg(errp, "spapr_allocate_lsi failed");
+            return;
         }
 
         sphb->lsi_table[i].irq = irq;
     }
 
-    return 0;
+    if (!info->finish_realize) {
+        error_setg(errp, "finish_realize not defined");
+        return;
+    }
+    info->finish_realize(sphb, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
+    pci_setup_iommu(sphb->parent_obj.bus, spapr_pci_dma_iommu, sphb);
+}
+
+static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
+{
+    sphb->dma_window_start = 0;
+    sphb->dma_window_size = 0x40000000;
+    sphb->tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
+                                     sphb->dma_window_size);
+    if (!sphb->tcet) {
+        error_setg(errp, "Unable to create TCE table for %s",
+                   sphb->dtbusname);
+        return ;
+    }
+    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
+                       sphb->dtbusname);
 }
 
 static void spapr_phb_reset(DeviceState *qdev)
@@ -707,14 +723,15 @@ static const char *spapr_phb_root_bus_path(PCIHostState *host_bridge,
 static void spapr_phb_class_init(ObjectClass *klass, void *data)
 {
     PCIHostBridgeClass *hc = PCI_HOST_BRIDGE_CLASS(klass);
-    SysBusDeviceClass *sdc = SYS_BUS_DEVICE_CLASS(klass);
     DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
 
     hc->root_bus_path = spapr_phb_root_bus_path;
-    sdc->init = spapr_phb_init;
+    dc->realize = spapr_phb_realize;
     dc->props = spapr_phb_properties;
     dc->reset = spapr_phb_reset;
     dc->vmsd = &vmstate_spapr_pci;
+    spc->finish_realize = spapr_phb_finish_realize;
 }
 
 static const TypeInfo spapr_phb_info = {
@@ -722,6 +739,7 @@ static const TypeInfo spapr_phb_info = {
     .parent        = TYPE_PCI_HOST_BRIDGE,
     .instance_size = sizeof(sPAPRPHBState),
     .class_init    = spapr_phb_class_init,
+    .class_size    = sizeof(sPAPRPHBClass),
 };
 
 PCIHostState *spapr_create_phb(sPAPREnvironment *spapr, int index)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 970b4a9..0f428a1 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -34,7 +34,21 @@
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
-typedef struct sPAPRPHBState {
+#define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
+#define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(sPAPRPHBClass, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
+
+typedef struct sPAPRPHBClass sPAPRPHBClass;
+typedef struct sPAPRPHBState sPAPRPHBState;
+
+struct sPAPRPHBClass {
+    PCIHostBridgeClass parent_class;
+
+    void (*finish_realize)(sPAPRPHBState *sphb, Error **errp);
+};
+
+struct sPAPRPHBState {
     PCIHostState parent_obj;
 
     int32_t index;
@@ -62,7 +76,7 @@ typedef struct sPAPRPHBState {
     } msi_table[SPAPR_MSIX_MAX_DEVS];
 
     QLIST_ENTRY(sPAPRPHBState) list;
-} sPAPRPHBState;
+};
 
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
 
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 06/12] spapr_pci: add spapr_pci trace
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (4 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 05/12] spapr_pci: convert init to realize Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 07/12] spapr_pci: converts fprintf to error_report Alexey Kardashevskiy
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 trace-events | 1 +
 1 file changed, 1 insertion(+)

diff --git a/trace-events b/trace-events
index 8ccffdc..c67fcf1 100644
--- a/trace-events
+++ b/trace-events
@@ -1113,6 +1113,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/ppc/spapr_pci.c
+spapr_pci(const char *msg1, const char *msg2) "%s%s"
 spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 07/12] spapr_pci: converts fprintf to error_report
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (5 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 06/12] spapr_pci: add spapr_pci trace Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 08/12] spapr_iommu: introduce SPAPR_TCE_TABLE class Alexey Kardashevskiy
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_pci.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 11bb737..5129257 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -293,7 +293,7 @@ static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
         ret_intr_type = RTAS_TYPE_MSIX;
         break;
     default:
-        fprintf(stderr, "rtas_ibm_change_msi(%u) is not implemented\n", func);
+        error_report("rtas_ibm_change_msi(%u) is not implemented", func);
         rtas_st(rets, 0, -3); /* Parameter error */
         return;
     }
@@ -327,7 +327,7 @@ static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     /* Find a device number in the map to add or reuse the existing one */
     ndev = spapr_msicfg_find(phb, config_addr, true);
     if (ndev >= SPAPR_MSIX_MAX_DEVS || ndev < 0) {
-        fprintf(stderr, "No free entry for a new MSI device\n");
+        error_report("No free entry for a new MSI device");
         rtas_st(rets, 0, -1); /* Hardware error */
         return;
     }
@@ -336,7 +336,7 @@ static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     /* Check if there is an old config and MSI number has not changed */
     if (phb->msi_table[ndev].nvec && (req_num != phb->msi_table[ndev].nvec)) {
         /* Unexpected behaviour */
-        fprintf(stderr, "Cannot reuse MSI config for device#%d", ndev);
+        error_report("Cannot reuse MSI config for device#%d", ndev);
         rtas_st(rets, 0, -1); /* Hardware error */
         return;
     }
@@ -346,7 +346,7 @@ static void rtas_ibm_change_msi(PowerPCCPU *cpu, sPAPREnvironment *spapr,
         irq = spapr_allocate_irq_block(req_num, false,
                                        ret_intr_type == RTAS_TYPE_MSI);
         if (irq < 0) {
-            fprintf(stderr, "Cannot allocate MSIs for device#%d", ndev);
+            error_report("Cannot allocate MSIs for device#%d", ndev);
             rtas_st(rets, 0, -1); /* Hardware error */
             return;
         }
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 08/12] spapr_iommu: introduce SPAPR_TCE_TABLE class
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (6 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 07/12] spapr_pci: converts fprintf to error_report Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 09/12] spapr_iommu: add SPAPR VFIO IOMMU Alexey Kardashevskiy
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

This introduces a SPAPR TCE TABLE device class with a put_tce() callback.
This reworks the H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercall
handlers to make use of the new put_tce() callback.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   | 76 +++++++++++++++++++++++++++++++++++++++-----------
 include/hw/ppc/spapr.h | 13 +++++++++
 2 files changed, 73 insertions(+), 16 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 022e268..0b235b3 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -235,17 +235,34 @@ static target_ulong h_put_tce_indirect(PowerPCCPU *cpu,
     target_ulong npages = args[3];
     target_ulong ret = H_PARAMETER;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    sPAPRTCETableClass *info;
+
+    if (!tcet) {
+        return H_PARAMETER;
+    }
+
+    info = SPAPR_TCE_TABLE_GET_CLASS(tcet);
+    if (!info || !info->put_tce) {
+        return H_PARAMETER;
+    }
+
+    if ((tce_list & SPAPR_TCE_PAGE_MASK) || (npages > 512)) {
+        return H_PARAMETER;
+    }
 
     ioba &= ~SPAPR_TCE_PAGE_MASK;
     tce_list &= ~SPAPR_TCE_PAGE_MASK;
+    if (liobn & 0xFFFFFFFF00000000ULL) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN "
+                      TARGET_FMT_lx "\n", liobn);
+        return H_PARAMETER;
+    }
 
-    if (tcet) {
-        for (i = 0; i < npages; i++, ioba += SPAPR_TCE_PAGE_SIZE) {
-            target_ulong tce = ldq_phys(tce_list + i * sizeof(target_ulong));
-            ret = put_tce_emu(tcet, ioba, tce);
-            if (ret) {
-                break;
-            }
+    for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
+        target_ulong tce = ldq_phys(tce_list + i * sizeof(target_ulong));
+        ret = info->put_tce(tcet, ioba, tce);
+        if (ret) {
+            break;
         }
     }
     trace_spapr_iommu_indirect(liobn, ioba, tce_list, ret);
@@ -263,15 +280,29 @@ static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     target_ulong npages = args[3];
     target_ulong ret = H_PARAMETER;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    sPAPRTCETableClass *info;
+
+    if (!tcet) {
+        return H_PARAMETER;
+    }
+
+    info = SPAPR_TCE_TABLE_GET_CLASS(tcet);
+    if (!info || !info->put_tce) {
+        return H_PARAMETER;
+    }
+
+    if (liobn & 0xFFFFFFFF00000000ULL) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN "
+                      TARGET_FMT_lx "\n", liobn);
+        return H_PARAMETER;
+    }
 
     ioba &= ~SPAPR_TCE_PAGE_MASK;
 
-    if (tcet) {
-        for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
-            ret = put_tce_emu(tcet, ioba, tce_value);
-            if (ret) {
-                break;
-            }
+    for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
+        ret = info->put_tce(tcet, ioba, tce_value);
+        if (ret) {
+            break;
         }
     }
     trace_spapr_iommu_stuff(liobn, ioba, tce_value, ret);
@@ -287,12 +318,21 @@ static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     target_ulong tce = args[2];
     target_ulong ret = H_PARAMETER;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    sPAPRTCETableClass *info;
+
+    if (!tcet) {
+        return H_PARAMETER;
+    }
+
+    info = SPAPR_TCE_TABLE_GET_CLASS(tcet);
+    if (!info || !info->put_tce) {
+        return H_PARAMETER;
+    }
 
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
-    if (tcet) {
-        ret = put_tce_emu(tcet, ioba, tce);
-    }
+    ret = info->put_tce(tcet, ioba, tce);
+
     trace_spapr_iommu_put(liobn, ioba, tce, ret);
 
     return ret;
@@ -342,9 +382,12 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
 static void spapr_tce_table_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRTCETableClass *stc = SPAPR_TCE_TABLE_CLASS(klass);
+
     dc->vmsd = &vmstate_spapr_tce_table;
     dc->init = spapr_tce_table_realize;
     dc->reset = spapr_tce_reset;
+    stc->put_tce = put_tce_emu;
 
     QLIST_INIT(&spapr_tce_tables);
 
@@ -359,6 +402,7 @@ static TypeInfo spapr_tce_table_info = {
     .parent = TYPE_DEVICE,
     .instance_size = sizeof(sPAPRTCETable),
     .class_init = spapr_tce_table_class_init,
+    .class_size = sizeof(sPAPRTCETableClass),
     .instance_finalize = spapr_tce_table_finalize,
 };
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ca175b0..a54de97 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -362,12 +362,25 @@ int spapr_rtas_device_tree_setup(void *fdt, hwaddr rtas_addr,
 
 #define RTAS_ERROR_LOG_MAX      2048
 
+typedef struct sPAPRTCETableClass sPAPRTCETableClass;
 typedef struct sPAPRTCETable sPAPRTCETable;
 
 #define TYPE_SPAPR_TCE_TABLE "spapr-tce-table"
 #define SPAPR_TCE_TABLE(obj) \
     OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE)
 
+#define SPAPR_TCE_TABLE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(sPAPRTCETableClass, (klass), TYPE_SPAPR_TCE_TABLE)
+#define SPAPR_TCE_TABLE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(sPAPRTCETableClass, (obj), TYPE_SPAPR_TCE_TABLE)
+
+struct sPAPRTCETableClass {
+    DeviceClass parent_class;
+
+    target_ulong (*put_tce)(sPAPRTCETable *tcet, target_ulong ioba,
+                            target_ulong tce);
+};
+
 struct sPAPRTCETable {
     DeviceState parent;
     uint32_t liobn;
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 09/12] spapr_iommu: add SPAPR VFIO IOMMU
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (7 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 08/12] spapr_iommu: introduce SPAPR_TCE_TABLE class Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 10/12] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

This adds SPAPR VFIO IOMMU device in order to support DMA operations
for VFIO devices.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c   | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  6 ++++
 2 files changed, 94 insertions(+)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 0b235b3..d91dee5 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -224,6 +224,76 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
     return H_SUCCESS;
 }
 
+static IOMMUTLBEntry spapr_vfio_translate_iommu(MemoryRegion *iommu,
+                                                hwaddr addr)
+{
+    IOMMUTLBEntry entry;
+    /*
+     * This callback would normally be used by a QEMU device for DMA
+     * but in this case the vfio-pci device does not do any DMA.
+     * Instead, the real hardware does DMA and hardware TCE table
+     * performs the address translation.
+     */
+    assert(0);
+    return entry;
+}
+
+static MemoryRegionIOMMUOps spapr_vfio_iommu_ops = {
+    .translate = spapr_vfio_translate_iommu,
+};
+
+static int spapr_tce_table_vfio_realize(DeviceState *dev)
+{
+    sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
+
+    memory_region_init_iommu(&tcet->iommu, NULL, &spapr_vfio_iommu_ops,
+                             "iommu-vfio-spapr", UINT64_MAX);
+
+    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
+
+    return 0;
+}
+
+sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn,
+                                    int group_fd)
+{
+    sPAPRTCETable *tcet;
+
+    if (spapr_tce_find_by_liobn(liobn)) {
+        fprintf(stderr, "Attempted to create TCE table with duplicate"
+                " LIOBN 0x%x\n", liobn);
+        return NULL;
+    }
+
+    tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE_VFIO));
+    tcet->liobn = liobn;
+    object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL);
+
+    object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
+
+    return tcet;
+}
+
+static target_ulong put_tce_vfio(sPAPRTCETable *tcet, target_ulong ioba,
+                                 target_ulong tce)
+{
+    IOMMUTLBEntry entry;
+
+    entry.iova = ioba & ~SPAPR_TCE_PAGE_MASK;
+    entry.translated_addr = tce & ~SPAPR_TCE_PAGE_MASK;
+    entry.addr_mask = SPAPR_TCE_PAGE_MASK;
+    entry.perm = 0;
+    if ((tce & SPAPR_TCE_RO) == SPAPR_TCE_RO) {
+        entry.perm |= IOMMU_RO;
+    }
+    if ((tce & SPAPR_TCE_WO) == SPAPR_TCE_WO) {
+        entry.perm |= IOMMU_WO;
+    }
+    memory_region_notify_iommu(&tcet->iommu, entry);
+
+    return H_SUCCESS;
+}
+
 static target_ulong h_put_tce_indirect(PowerPCCPU *cpu,
                                        sPAPREnvironment *spapr,
                                        target_ulong opcode, target_ulong *args)
@@ -406,9 +476,27 @@ static TypeInfo spapr_tce_table_info = {
     .instance_finalize = spapr_tce_table_finalize,
 };
 
+static void spapr_tce_table_vfio_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRTCETableClass *stc = SPAPR_TCE_TABLE_CLASS(klass);
+
+    dc->init = spapr_tce_table_vfio_realize;
+    stc->put_tce = put_tce_vfio;
+}
+
+static TypeInfo spapr_tce_table_vfio_info = {
+    .name = TYPE_SPAPR_TCE_TABLE_VFIO,
+    .parent = TYPE_SPAPR_TCE_TABLE,
+    .instance_size = sizeof(sPAPRTCETable),
+    .class_init = spapr_tce_table_vfio_class_init,
+    .class_size = sizeof(sPAPRTCETableClass),
+};
+
 static void register_types(void)
 {
     type_register_static(&spapr_tce_table_info);
+    type_register_static(&spapr_tce_table_vfio_info);
 }
 
 type_init(register_types);
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index a54de97..7054bdb 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -369,6 +369,10 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 #define SPAPR_TCE_TABLE(obj) \
     OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE)
 
+#define TYPE_SPAPR_TCE_TABLE_VFIO "spapr-tce-table-vfio"
+#define SPAPR_TCE_TABLE_VFIO(obj) \
+    OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE_VFIO)
+
 #define SPAPR_TCE_TABLE_CLASS(klass) \
      OBJECT_CLASS_CHECK(sPAPRTCETableClass, (klass), TYPE_SPAPR_TCE_TABLE)
 #define SPAPR_TCE_TABLE_GET_CLASS(obj) \
@@ -397,6 +401,8 @@ void spapr_events_init(sPAPREnvironment *spapr);
 void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
                                    size_t window_size);
+sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn,
+                                    int group_fd);
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
 void spapr_tce_set_bypass(sPAPRTCETable *tcet, bool bypass);
 int spapr_dma_dt(void *fdt, int node_off, const char *propname,
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 10/12] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (8 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 09/12] spapr_iommu: add SPAPR VFIO IOMMU Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

The patch adds a spapr-pci-vfio-host-bridge device type
which is a PCI Host Bridge with VFIO support. The new device
inherits from the spapr-pci-host-bridge device and adds
the following properties:
	iommu - IOMMU group ID which represents a Partitionable
	 	Endpoint, QEMU/ppc64 uses a separate PHB for
		an IOMMU group so the guest kernel has to have
		PCI Domain support enabled.
	forceaddr (optional, 0 by default) - forces QEMU to copy
		device:function from the host address as
		certain guest drivers expect devices to appear in
		particular locations;
	mf (optional, 0 by default) - forces multifunction bit for
		the function #0 of a found device, only makes sense
		for multifunction devices and only with the forceaddr
		property set. It would not be required if there
		was a way to know in advance whether a device is
		multifunctional or not.
	scan (optional, 1 by default) - if non-zero, the new PHB walks
		through all non-bridge devices in the group and tries
		adding them to the PHB; if zero, all devices in the group
		have to be configured manually via the QEMU command line.

Examples of use:
1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6

2) Configure and Add 3 functions of a multifunctional device to QEMU:
(the NEC PCI USB card is used as an example here):
	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* moved IOMMU changes to separate patches
* moved spapr-pci-vfio-host-bridge to new file
---
 hw/ppc/Makefile.objs        |   2 +-
 hw/ppc/spapr_pci_vfio.c     | 198 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h |  13 +++
 3 files changed, 212 insertions(+), 1 deletion(-)
 create mode 100644 hw/ppc/spapr_pci_vfio.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 7a1cd5d..19e2c7b 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -3,7 +3,7 @@ obj-y += ppc.o ppc_booke.o
 # IBM pSeries (sPAPR)
 obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
 obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
-obj-$(CONFIG_PSERIES) += spapr_pci.o
+obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_pci_vfio.o
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
new file mode 100644
index 0000000..1279ed8
--- /dev/null
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -0,0 +1,198 @@
+/*
+ * QEMU sPAPR PCI host for VFIO
+ *
+ * Copyright (c) 2011 Alexey Kardashevskiy, IBM Corporation.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include <sys/types.h>
+#include <dirent.h>
+
+#include "hw/hw.h"
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "hw/misc/vfio.h"
+#include "hw/pci/pci_bus.h"
+#include "trace.h"
+#include "qemu/error-report.h"
+
+/* sPAPR VFIO */
+static Property spapr_phb_vfio_properties[] = {
+    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb, Error **errp)
+{
+    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
+    char *iommupath;
+    DIR *dirp;
+    struct dirent *entry;
+    Error *error = NULL;
+
+    if (!svphb->scan) {
+        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
+        return;
+    }
+
+    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
+                                svphb->iommugroupid);
+    if (!iommupath) {
+        return;
+    }
+
+    dirp = opendir(iommupath);
+    if (!dirp) {
+        error_report("spapr-vfio: vfio scan failed on opendir: %m");
+        g_free(iommupath);
+        return;
+    }
+
+    while ((entry = readdir(dirp)) != NULL) {
+        Error *err = NULL;
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_pci("Reading device class from ", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+            if (ret != 1) {
+                continue;
+            }
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
+            /* Skip bridges */
+            continue;
+        }
+        trace_spapr_pci("Creating device from ", entry->d_name);
+
+        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
+        if (!dev) {
+            trace_spapr_pci("failed to create vfio-pci", entry->d_name);
+            continue;
+        }
+        qdev_prop_parse(dev, "host", entry->d_name, &err);
+        if (err != NULL) {
+            continue;
+        }
+        if (svphb->force_addr) {
+            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+            err = NULL;
+            qdev_prop_parse(dev, "addr", addr, &err);
+            if (err != NULL) {
+                continue;
+            }
+        }
+        if (svphb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        object_property_set_bool(OBJECT(dev), true, "realized", &error);
+        if (error) {
+            error_propagate(errp, error);
+            break;
+        }
+    }
+    closedir(dirp);
+    g_free(iommupath);
+}
+
+static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
+    int ret, group_fd;
+    Error *error = NULL;
+
+    if (svphb->iommugroupid == -1) {
+        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
+        return;
+    }
+
+    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
+                                        svphb->iommugroupid,
+                                        &info, &group_fd);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "spapr-vfio: get info from container failed");
+        return;
+    }
+
+    svphb->phb.dma_window_start = info.dma32_window_start;
+    svphb->phb.dma_window_size = info.dma32_window_size;
+    svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn,
+                                           group_fd);
+
+    address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet),
+                       sphb->dtbusname);
+
+    spapr_pci_vfio_scan(svphb, &error);
+    if (error) {
+        error_propagate(errp, error);
+    }
+}
+
+static void spapr_phb_vfio_reset(DeviceState *qdev)
+{
+    /* Do nothing */
+}
+
+static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
+
+    dc->props = spapr_phb_vfio_properties;
+    dc->reset = spapr_phb_vfio_reset;
+    spc->finish_realize = spapr_phb_vfio_finish_realize;
+}
+
+static const TypeInfo spapr_phb_vfio_info = {
+    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
+    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .class_init    = spapr_phb_vfio_class_init,
+    .class_size    = sizeof(sPAPRPHBClass),
+};
+
+static void spapr_pci_vfio_register_types(void)
+{
+    type_register_static(&spapr_phb_vfio_info);
+}
+
+type_init(spapr_pci_vfio_register_types)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 0f428a1..18acb67 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -30,10 +30,14 @@
 #define SPAPR_MSIX_MAX_DEVS 32
 
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
+#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
+    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
+
 #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
      OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
 #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
@@ -41,6 +45,7 @@
 
 typedef struct sPAPRPHBClass sPAPRPHBClass;
 typedef struct sPAPRPHBState sPAPRPHBState;
+typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
@@ -78,6 +83,14 @@ struct sPAPRPHBState {
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
+struct sPAPRPHBVFIOState {
+    sPAPRPHBState phb;
+
+    struct VFIOContainer *container;
+    int32_t iommugroupid;
+    uint8_t scan, enable_multifunction, force_addr;
+};
+
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
 
 #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (9 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 10/12] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-09-05 19:05   ` Alex Williamson
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 12/12] spapr kvm vfio: enable in-kernel acceleration Alexey Kardashevskiy
  2013-09-05  6:43 ` [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
  12 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

This turns the sPAPR support on and enables VFIO container use
in the kernel.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v4:
* fixed format string to use %m which is a glibc extension:
"Print output of strerror(errno). No argument is required."
---
 hw/misc/vfio.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 4210471..882da70 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -2815,6 +2815,36 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
 
         memory_listener_register(&container->iommu_data.listener,
                                  container->space->as);
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+        if (ret) {
+            error_report("vfio: failed to enable container: %m");
+            g_free(container);
+            close(fd);
+            return -errno;
+        }
+
+        container->iommu_data.listener = vfio_memory_listener;
+        container->iommu_data.release = vfio_listener_release;
+
+        memory_listener_register(&container->iommu_data.listener,
+                                 container->space->as);
     } else {
         error_report("vfio: No available IOMMU models");
         g_free(container);
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [Qemu-devel] [PATCH v4 12/12] spapr kvm vfio: enable in-kernel acceleration
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (10 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr Alexey Kardashevskiy
@ 2013-08-30 10:15 ` Alexey Kardashevskiy
  2013-09-05  6:43 ` [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-08-30 10:15 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc,
	David Gibson

This enables in-kernel support for DMA operations on VFIO PHBs.
This creates a "SPAPR TCE IOMMU" KVM device and initializes it
with LIOBN and IOMMU fd. Once enabled, it keeps H_PUT_TCE,
H_PUT_TCE_INDIRECT, H_STUFF_TCE hypercalls in the kernel so
their QEMU handlers won't be called.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c |  4 ++++
 target-ppc/kvm.c     | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 target-ppc/kvm_ppc.h | 13 +++++++++++++
 3 files changed, 64 insertions(+)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index d91dee5..32255fc 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -258,6 +258,7 @@ sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn,
                                     int group_fd)
 {
     sPAPRTCETable *tcet;
+    int fd;
 
     if (spapr_tce_find_by_liobn(liobn)) {
         fprintf(stderr, "Attempted to create TCE table with duplicate"
@@ -265,8 +266,11 @@ sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn,
         return NULL;
     }
 
+    fd = kvmppc_create_spapr_tce_iommu(liobn, group_fd);
+
     tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE_VFIO));
     tcet->liobn = liobn;
+    tcet->fd = fd;
     object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL);
 
     object_property_set_bool(OBJECT(tcet), true, "realized", NULL);
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 5e72c22..ec7556f 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -60,6 +60,7 @@ static int cap_booke_sregs;
 static int cap_ppc_smt;
 static int cap_ppc_rma;
 static int cap_spapr_tce;
+static int cap_spapr_tce_iommu;
 static int cap_hior;
 static int cap_one_reg;
 static int cap_epr;
@@ -96,6 +97,7 @@ int kvm_arch_init(KVMState *s)
     cap_ppc_smt = kvm_check_extension(s, KVM_CAP_PPC_SMT);
     cap_ppc_rma = kvm_check_extension(s, KVM_CAP_PPC_RMA);
     cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE);
+    cap_spapr_tce_iommu = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_IOMMU);
     cap_one_reg = kvm_check_extension(s, KVM_CAP_ONE_REG);
     cap_hior = kvm_check_extension(s, KVM_CAP_PPC_HIOR);
     cap_epr = kvm_check_extension(s, KVM_CAP_PPC_EPR);
@@ -1662,6 +1664,51 @@ int kvmppc_remove_spapr_tce(void *table, int fd, uint32_t window_size)
     return 0;
 }
 
+int kvmppc_create_spapr_tce_iommu(uint32_t liobn, int group_fd)
+{
+    int rc;
+    struct kvm_create_spapr_tce_iommu_linkage args = {
+        .liobn = liobn,
+        .fd = group_fd
+    };
+    struct kvm_device_attr attr = {
+        .flags = 0,
+        .group = KVM_DEV_SPAPR_TCE_IOMMU_ATTR_LINKAGE,
+        .addr = (uint64_t)(uintptr_t)&args,
+    };
+    struct kvm_create_device kcd = {
+        .type = KVM_DEV_TYPE_SPAPR_TCE_IOMMU,
+        .flags = 0,
+    };
+
+    if (!kvm_enabled() || !cap_spapr_tce_iommu) {
+        fprintf(stderr, "KVM VFIO: TCE IOMMU capability is not present, DMA may be slow\n");
+        return -1;
+    }
+
+    rc = kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &kcd);
+    if (rc < 0) {
+        fprintf(stderr, "Error on KVM_CREATE_DEVICE for SPAPR TCE IOMMU\n");
+        return rc;
+    }
+    rc = ioctl(kcd.fd, KVM_SET_DEVICE_ATTR, &attr);
+    if (rc < 0) {
+        fprintf(stderr, "KVM VFIO: Failed to create TCE table for liobn 0x%x, ret = %d, DMA may be slow\n",
+                liobn, rc);
+    }
+
+    return kcd.fd;
+}
+
+int kvmppc_remove_spapr_tce_iommu(int fd)
+{
+    if (fd < 0) {
+        return -1;
+    }
+
+    return close(fd);
+}
+
 int kvmppc_reset_htab(int shift_hint)
 {
     uint32_t shift = shift_hint;
diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
index 5f78e4b..35d3325 100644
--- a/target-ppc/kvm_ppc.h
+++ b/target-ppc/kvm_ppc.h
@@ -33,6 +33,8 @@ int kvmppc_booke_watchdog_enable(PowerPCCPU *cpu);
 off_t kvmppc_alloc_rma(const char *name, MemoryRegion *sysmem);
 void *kvmppc_create_spapr_tce(uint32_t liobn, uint32_t window_size, int *pfd);
 int kvmppc_remove_spapr_tce(void *table, int pfd, uint32_t window_size);
+int kvmppc_create_spapr_tce_iommu(uint32_t liobn, int group_fd);
+int kvmppc_remove_spapr_tce_iommu(int fd);
 int kvmppc_reset_htab(int shift_hint);
 uint64_t kvmppc_rma_size(uint64_t current_size, unsigned int hash_shift);
 #endif /* !CONFIG_USER_ONLY */
@@ -137,6 +139,17 @@ static inline int kvmppc_remove_spapr_tce(void *table, int pfd,
     return -1;
 }
 
+static inline int kvmppc_create_spapr_tce_iommu(uint32_t liobn,
+                                                uint32_t iommu_id)
+{
+    return -1;
+}
+
+static inline int kvmppc_remove_spapr_tce_iommu(int fd)
+{
+    return -1;
+}
+
 static inline int kvmppc_reset_htab(int shift_hint)
 {
     return -1;
-- 
1.8.4.rc4

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64
  2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (11 preceding siblings ...)
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 12/12] spapr kvm vfio: enable in-kernel acceleration Alexey Kardashevskiy
@ 2013-09-05  6:43 ` Alexey Kardashevskiy
  12 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-05  6:43 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc,
	David Gibson

On 08/30/2013 08:15 PM, Alexey Kardashevskiy wrote:
> Yet another try with VFIO on SPAPR (server PPC64).


Alex (Williamson), could you please give it some review and
comment/sob/rb/ab (whichever suits) this stuff? Since kernel headers update
is on its way to upstream, we could try to include this stuff (except the
very last patch) to upstream QEMU as well. Thanks!


> 
> Changes:
> v3 -> v4:
> * addressed all comments from Alex Williamson
> * moved spapr-pci-phb-vfio-phb to new file
> * split spapr-pci-phb-vfio to many smaller patches
> 
> The "spapr vfio: add vfio_container_spapr_get_info()" needs kernel
> headers update (v3.11-rc6);
> The "spapr kvm vfio: enable in-kernel acceleration" needs a kernel
> patch which is not in the kernel yet and posted separately as
> "[PATCH v9 00/13] KVM: PPC: IOMMU in-kernel handling of VFIO".
> 
> More details in the individual patches commit messages.
> 
> Alexey Kardashevskiy (9):
>   spapr vfio: add vfio_container_spapr_get_info()
>   spapr_pci: convert init to realize
>   spapr_pci: add spapr_pci trace
>   spapr_pci: converts fprintf to error_report
>   spapr_iommu: introduce SPAPR_TCE_TABLE class
>   spapr_iommu: add SPAPR VFIO IOMMU
>   spapr vfio: add spapr-pci-vfio-host-bridge to support vfio
>   spapr vfio: enable for spapr
>   spapr kvm vfio: enable in-kernel acceleration
> 
> David Gibson (3):
>   vfio: Introduce VFIO address spaces
>   vfio: Create VFIOAddressSpace objects as needed
>   vfio: Add guest side IOMMU support
> 
>  hw/misc/vfio.c              | 291 ++++++++++++++++++++++++++++++++++++++++----
>  hw/ppc/Makefile.objs        |   2 +-
>  hw/ppc/spapr_iommu.c        | 168 ++++++++++++++++++++++---
>  hw/ppc/spapr_pci.c          |  90 ++++++++------
>  hw/ppc/spapr_pci_vfio.c     | 198 ++++++++++++++++++++++++++++++
>  include/hw/misc/vfio.h      |  11 ++
>  include/hw/pci-host/spapr.h |  31 ++++-
>  include/hw/ppc/spapr.h      |  19 +++
>  target-ppc/kvm.c            |  47 +++++++
>  target-ppc/kvm_ppc.h        |  13 ++
>  trace-events                |   1 +
>  11 files changed, 792 insertions(+), 79 deletions(-)
>  create mode 100644 hw/ppc/spapr_pci_vfio.c
>  create mode 100644 include/hw/misc/vfio.h
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
@ 2013-09-05 18:24   ` Alex Williamson
  2013-09-10  8:09     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-05 18:24 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> So far, VFIO has a notion of different logical DMA address spaces, but
> only ever uses one (system memory).  This patch extends this, creating
> new VFIOAddressSpace objects as necessary, according to the AddressSpace
> reported by the PCI subsystem for this device's DMAs.
> 
> This isn't enough yet to support guest side IOMMUs with VFIO, but it does
> mean we could now support VFIO devices on, for example, a guest side PCI
> host bridge which maps system memory at somewhere other than 0 in PCI
> space.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
>  hw/misc/vfio.c | 43 +++++++++++++++++++++++++++++++++++--------
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 93a316e..c16f41b 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -133,9 +133,10 @@ enum {
>  typedef struct VFIOAddressSpace {
>      AddressSpace *as;
>      QLIST_HEAD(, VFIOContainer) containers;
> +    QLIST_ENTRY(VFIOAddressSpace) list;
>  } VFIOAddressSpace;
>  
> -static VFIOAddressSpace vfio_address_space_memory;
> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
>  
>  struct VFIOGroup;
>  
> @@ -2611,10 +2612,34 @@ static int vfio_load_rom(VFIODevice *vdev)
>      return 0;
>  }
>  
> -static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>  {
> +    VFIOAddressSpace *space;
> +
> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> +        if (space->as == as) {
> +            return space;
> +        }
> +    }
> +
> +    /* No suitable VFIOAddressSpace, create a new one */
> +    space = g_malloc0(sizeof(*space));
>      space->as = as;
>      QLIST_INIT(&space->containers);
> +
> +    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
> +
> +    return space;
> +}
> +
> +static void vfio_put_address_space(VFIOAddressSpace *space)
> +{
> +    if (!QLIST_EMPTY(&space->containers)) {
> +        return;
> +    }
> +
> +    QLIST_REMOVE(space, list);
> +    g_free(space);
>  }
>  
>  static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
> @@ -2699,6 +2724,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>      group->container = NULL;
>  
>      if (QLIST_EMPTY(&container->group_list)) {
> +        VFIOAddressSpace *space = container->space;
> +
>          if (container->iommu_data.release) {
>              container->iommu_data.release(container);
>          }
> @@ -2706,6 +2733,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>          DPRINTF("vfio_disconnect_container: close container->fd\n");
>          close(container->fd);
>          g_free(container);
> +
> +        vfio_put_address_space(space);
>      }
>  }
>  
> @@ -3076,6 +3105,7 @@ static int vfio_initfn(PCIDevice *pdev)
>  {
>      VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
>      VFIOGroup *group;
> +    VFIOAddressSpace *space;
>      char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>      ssize_t len;
>      struct stat st;
> @@ -3111,14 +3141,12 @@ static int vfio_initfn(PCIDevice *pdev)
>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>  
> -    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
> -        error_report("vfio: DMA address space must be system memory");
> -        return -EINVAL;
> -    }
> +    space = vfio_get_address_space(pci_device_iommu_address_space(pdev));
>  
> -    group = vfio_get_group(groupid, &vfio_address_space_memory);
> +    group = vfio_get_group(groupid, space);
>      if (!group) {
>          error_report("vfio: failed to get group %d", groupid);
> +        vfio_put_address_space(space);
>          return -ENOENT;
>      }
>  

Kind of a code flow issue here, on teardown we have:

vfio_put_group
  vfio_disconnect_container
    vfio_put_address_space

On setup we do:

vfio_get_address_space
vfio_get_group
  vfio_connect_container

We could easily move vfio_get_address_space into vfio_get_group to make
things a little more balanced.  It doesn't seem like too much additional
to pass the address space through vfio_get_group into
vfio_connect_container so that we could have a completely symmetric flow
though.


> @@ -3339,7 +3367,6 @@ static const TypeInfo vfio_pci_dev_info = {
>  
>  static void register_vfio_pci_dev_type(void)
>  {
> -    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>      type_register_static(&vfio_pci_dev_info);
>  }
>  

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support Alexey Kardashevskiy
@ 2013-09-05 18:49   ` Alex Williamson
  2013-09-10  8:22     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-05 18:49 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> From: David Gibson <david@gibson.dropbear.id.au>
> 
> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
> sufficient capability and granularity to match the guest side. This works
> by tracking all map and unmap operations on the guest IOMMU using the
> notifiers, and mirroring them into VFIO.
> 
> There are a number of FIXMEs, and the scheme involves rather more notifier
> structures than I'd like, but it should make for a reasonable proof of
> concept.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> ---
> Changes:
> v4:
> * fixed list objects naming
> * vfio_listener_region_add() reworked to call memory_region_ref() from one
> place only, it is also easier to review the changes
> * fixes boundary check not to fail on sections == 2^64 bytes,
> the "vfio: Fix debug output for int128 values" patch is required;
> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
> for 128 bit memory section sizes" patch proposal
> ---
>  hw/misc/vfio.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 128 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index c16f41b..53791fb 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -150,10 +150,18 @@ typedef struct VFIOContainer {
>          };
>          void (*release)(struct VFIOContainer *);
>      } iommu_data;
> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>      QLIST_HEAD(, VFIOGroup) group_list;
>      QLIST_ENTRY(VFIOContainer) next;
>  } VFIOContainer;
>  
> +typedef struct VFIOGuestIOMMU {
> +    VFIOContainer *container;
> +    MemoryRegion *iommu;
> +    Notifier n;
> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> +} VFIOGuestIOMMU;
> +
>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>  typedef struct VFIOMSIXInfo {
>      uint8_t table_bar;
> @@ -1917,7 +1925,63 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>  
>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>  {
> -    return !memory_region_is_ram(section->mr);
> +    return !memory_region_is_ram(section->mr) &&
> +        !memory_region_is_iommu(section->mr);
> +}
> +
> +static void vfio_iommu_map_notify(Notifier *n, void *data)
> +{
> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> +    VFIOContainer *container = giommu->container;
> +    IOMMUTLBEntry *iotlb = data;
> +    MemoryRegion *mr;
> +    hwaddr xlat;
> +    hwaddr len = iotlb->addr_mask + 1;
> +    void *vaddr;
> +    int ret;
> +
> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
> +
> +    /*
> +     * The IOMMU TLB entry we have just covers translation through
> +     * this IOMMU to its immediate target.  We need to translate
> +     * it the rest of the way through to memory.
> +     */
> +    mr = address_space_translate(&address_space_memory,
> +                                 iotlb->translated_addr,
> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> +    if (!memory_region_is_ram(mr)) {
> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> +                xlat);
> +        return;
> +    }
> +    if (len & iotlb->addr_mask) {
> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> +        return;
> +    }
> +
> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
> +
> +    if (iotlb->perm != IOMMU_NONE) {
> +        ret = vfio_dma_map(container, iotlb->iova,
> +                           iotlb->addr_mask + 1, vaddr,
> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> +        if (ret) {
> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> +                         container, iotlb->iova,
> +                         iotlb->addr_mask + 1, vaddr, ret);
> +        }
> +    } else {
> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> +        if (ret) {
> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> +                         container, iotlb->iova,
> +                         iotlb->addr_mask + 1, ret);
> +        }
> +    }
>  }
>  
>  static void vfio_listener_region_add(MemoryListener *listener,
> @@ -1926,11 +1990,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      VFIOContainer *container = container_of(listener, VFIOContainer,
>                                              iommu_data.listener);
>      hwaddr iova, end;
> +    Int128 llend;
>      void *vaddr;
>      int ret;
>  
> -    assert(!memory_region_is_iommu(section->mr));
> -
>      if (vfio_listener_skipped_section(section)) {
>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>                  section->offset_within_address_space,
> @@ -1946,21 +2009,57 @@ static void vfio_listener_region_add(MemoryListener *listener,
>      }
>  
>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> +    llend = int128_make64(section->offset_within_address_space);
> +    llend = int128_add(llend, section->size);
> +    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));

So you're still looking for me to pickup patches 1/3 & 2/3 from the
previous int128 series for this?

> +
> +    if (int128_ge(int128_make64(iova), llend)) {
> +        return;
> +    }
> +
> +    memory_region_ref(section->mr);
> +
> +    if (memory_region_is_iommu(section->mr)) {
> +        VFIOGuestIOMMU *giommu;
> +
> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> +                iova, int128_get64(int128_sub(llend, int128_one())));
> +        /*
> +         * FIXME: We should do some checking to see if the
> +         * capabilities of the host VFIO IOMMU are adequate to model
> +         * the guest IOMMU
> +         *
> +         * FIXME: This assumes that the guest IOMMU is empty of
> +         * mappings at this point - we should either enforce this, or
> +         * loop through existing mappings to map them into VFIO.

I would hope that when you do the below
memory_region_register_iommu_notifier() the callback gets a replay of
the current state of the iommu like we do for a memory region when we
register the listener that gets us here.  Is that not the case?

> +         *
> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> +         * avoid bouncing all map/unmaps through qemu this way, this
> +         * would be the right place to wire that up (tell the KVM
> +         * device emulation the VFIO iommu handles to use).
> +         */
> +        giommu = g_malloc0(sizeof(*giommu));
> +        giommu->iommu = section->mr;
> +        giommu->container = container;
> +        giommu->n.notify = vfio_iommu_map_notify;
> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> +
> +        return;
> +    }
> +
> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
> +
>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>            TARGET_PAGE_MASK;

Why isn't this now calculated from llend?  I thought that was part of
why the int128 stuff was rolled in here.

>  
> -    if (iova >= end) {
> -        return;
> -    }
> -
>      vaddr = memory_region_get_ram_ptr(section->mr) +
>              section->offset_within_region +
>              (iova - section->offset_within_address_space);
>  
> -    DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
> +    DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n",
>              iova, end - 1, vaddr);
>  
> -    memory_region_ref(section->mr);
>      ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly);
>      if (ret) {
>          error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> @@ -1991,6 +2090,26 @@ static void vfio_listener_region_del(MemoryListener *listener,
>          return;
>      }
>  
> +    if (memory_region_is_iommu(section->mr)) {
> +        VFIOGuestIOMMU *giommu;
> +
> +        QLIST_FOREACH(giommu, &container->giommu_list, giommu_next) {
> +            if (giommu->iommu == section->mr) {
> +                memory_region_unregister_iommu_notifier(&giommu->n);
> +                QLIST_REMOVE(giommu, giommu_next);
> +                g_free(giommu);
> +                break;
> +            }
> +        }
> +
> +        /*
> +         * FIXME: We assume the one big unmap below is adequate to
> +         * remove any individual page mappings in the IOMMU which
> +         * might have been copied into VFIO.  That may not be true for
> +         * all IOMMU types
> +         */
> +    }
> +
>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>            TARGET_PAGE_MASK;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
@ 2013-09-05 19:01   ` Alex Williamson
  2013-09-10  8:36     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-05 19:01 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> As sPAPR platform supports DMA windows on a PCI bus, the information
> about their location and size should be passed into the guest via
> the device tree.
> 
> The patch adds a helper to read this info from the container fd.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * fixed possible leaks on error paths
> ---
>  hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/misc/vfio.h | 11 +++++++++++
>  2 files changed, 56 insertions(+)
>  create mode 100644 include/hw/misc/vfio.h
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 53791fb..4210471 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -39,6 +39,7 @@
>  #include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/sysemu.h"
> +#include "hw/misc/vfio.h"
>  
>  /* #define DEBUG_VFIO */
>  #ifdef DEBUG_VFIO
> @@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
>  }
>  
>  type_init(register_vfio_pci_dev_type)
> +
> +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
> +                                  struct vfio_iommu_spapr_tce_info *info,
> +                                  int *group_fd)
> +{
> +    VFIOAddressSpace *space;
> +    VFIOGroup *group;
> +    VFIOContainer *container;
> +    int ret, fd;
> +
> +    space = vfio_get_address_space(as);
> +    if (!space) {
> +        return -1;
> +    }
> +    group = vfio_get_group(groupid, space);
> +    if (!group) {
> +        goto put_as_exit;
> +    }
> +    container = group->container;
> +    if (!group->container) {
> +        goto put_group_exit;
> +    }
> +    fd = container->fd;
> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +        goto put_group_exit;
> +    }
> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
> +    if (ret) {
> +        error_report("vfio: failed to get iommu info for container: %s",
> +                     strerror(errno));
> +        goto put_group_exit;
> +    }
> +    *group_fd = group->fd;

The above gets don't actually increment a reference count, so copying
the fd seems risky here.

> +
> +    return 0;
> +
> +put_group_exit:
> +    vfio_put_group(group);
> +
> +put_as_exit:
> +    vfio_put_address_space(space);

But put_group calls disconnect_container which calls
put_address_space... so it get's put twice.  The lack of symmetry
already bites us with a bug.

> +
> +    return -1;
> +}
> diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
> new file mode 100644
> index 0000000..ac9a971
> --- /dev/null
> +++ b/include/hw/misc/vfio.h
> @@ -0,0 +1,11 @@
> +#ifndef VFIO_API_H
> +#define VFIO_API_H
> +
> +#include "qemu/typedefs.h"
> +#include <linux/vfio.h>
> +
> +extern int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
> +                                         struct vfio_iommu_spapr_tce_info *info,
> +                                         int *group_fd);
> +
> +#endif

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr
  2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr Alexey Kardashevskiy
@ 2013-09-05 19:05   ` Alex Williamson
  2013-09-10  9:00     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-05 19:05 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> This turns the sPAPR support on and enables VFIO container use
> in the kernel.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v4:
> * fixed format string to use %m which is a glibc extension:
> "Print output of strerror(errno). No argument is required."
> ---
>  hw/misc/vfio.c | 30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 4210471..882da70 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -2815,6 +2815,36 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
>  
>          memory_listener_register(&container->iommu_data.listener,
>                                   container->space->as);
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %m");
> +            g_free(container);
> +            close(fd);
> +            return -errno;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %m");
> +            g_free(container);
> +            close(fd);
> +            return -errno;
> +        }
> +
> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> +        if (ret) {
> +            error_report("vfio: failed to enable container: %m");
> +            g_free(container);
> +            close(fd);
> +            return -errno;

These (and the copies that already exist in this function) are screaming
for a goto.

> +        }
> +
> +        container->iommu_data.listener = vfio_memory_listener;
> +        container->iommu_data.release = vfio_listener_release;
> +
> +        memory_listener_register(&container->iommu_data.listener,
> +                                 container->space->as);
>      } else {
>          error_report("vfio: No available IOMMU models");
>          g_free(container);

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed
  2013-09-05 18:24   ` Alex Williamson
@ 2013-09-10  8:09     ` Alexey Kardashevskiy
  2013-09-10 21:51       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-10  8:09 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/06/2013 04:24 AM, Alex Williamson wrote:
> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>> From: David Gibson <david@gibson.dropbear.id.au>
>>
>> So far, VFIO has a notion of different logical DMA address spaces, but
>> only ever uses one (system memory).  This patch extends this, creating
>> new VFIOAddressSpace objects as necessary, according to the AddressSpace
>> reported by the PCI subsystem for this device's DMAs.
>>
>> This isn't enough yet to support guest side IOMMUs with VFIO, but it does
>> mean we could now support VFIO devices on, for example, a guest side PCI
>> host bridge which maps system memory at somewhere other than 0 in PCI
>> space.
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>  hw/misc/vfio.c | 43 +++++++++++++++++++++++++++++++++++--------
>>  1 file changed, 35 insertions(+), 8 deletions(-)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 93a316e..c16f41b 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -133,9 +133,10 @@ enum {
>>  typedef struct VFIOAddressSpace {
>>      AddressSpace *as;
>>      QLIST_HEAD(, VFIOContainer) containers;
>> +    QLIST_ENTRY(VFIOAddressSpace) list;
>>  } VFIOAddressSpace;
>>  
>> -static VFIOAddressSpace vfio_address_space_memory;
>> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
>>  
>>  struct VFIOGroup;
>>  
>> @@ -2611,10 +2612,34 @@ static int vfio_load_rom(VFIODevice *vdev)
>>      return 0;
>>  }
>>  
>> -static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
>> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
>>  {
>> +    VFIOAddressSpace *space;
>> +
>> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
>> +        if (space->as == as) {
>> +            return space;
>> +        }
>> +    }
>> +
>> +    /* No suitable VFIOAddressSpace, create a new one */
>> +    space = g_malloc0(sizeof(*space));
>>      space->as = as;
>>      QLIST_INIT(&space->containers);
>> +
>> +    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
>> +
>> +    return space;
>> +}
>> +
>> +static void vfio_put_address_space(VFIOAddressSpace *space)
>> +{
>> +    if (!QLIST_EMPTY(&space->containers)) {
>> +        return;
>> +    }
>> +
>> +    QLIST_REMOVE(space, list);
>> +    g_free(space);
>>  }
>>  
>>  static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
>> @@ -2699,6 +2724,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>      group->container = NULL;
>>  
>>      if (QLIST_EMPTY(&container->group_list)) {
>> +        VFIOAddressSpace *space = container->space;
>> +
>>          if (container->iommu_data.release) {
>>              container->iommu_data.release(container);
>>          }
>> @@ -2706,6 +2733,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
>>          DPRINTF("vfio_disconnect_container: close container->fd\n");
>>          close(container->fd);
>>          g_free(container);
>> +
>> +        vfio_put_address_space(space);
>>      }
>>  }
>>  
>> @@ -3076,6 +3105,7 @@ static int vfio_initfn(PCIDevice *pdev)
>>  {
>>      VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
>>      VFIOGroup *group;
>> +    VFIOAddressSpace *space;
>>      char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
>>      ssize_t len;
>>      struct stat st;
>> @@ -3111,14 +3141,12 @@ static int vfio_initfn(PCIDevice *pdev)
>>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
>>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
>>  
>> -    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
>> -        error_report("vfio: DMA address space must be system memory");
>> -        return -EINVAL;
>> -    }
>> +    space = vfio_get_address_space(pci_device_iommu_address_space(pdev));
>>  
>> -    group = vfio_get_group(groupid, &vfio_address_space_memory);
>> +    group = vfio_get_group(groupid, space);
>>      if (!group) {
>>          error_report("vfio: failed to get group %d", groupid);
>> +        vfio_put_address_space(space);
>>          return -ENOENT;
>>      }
>>  
> 
> Kind of a code flow issue here, on teardown we have:
> 
> vfio_put_group
>   vfio_disconnect_container
>     vfio_put_address_space
> 
> On setup we do:
> 
> vfio_get_address_space
> vfio_get_group
>   vfio_connect_container
> 
> We could easily move vfio_get_address_space into vfio_get_group to make
> things a little more balanced.  It doesn't seem like too much additional
> to pass the address space through vfio_get_group into
> vfio_connect_container so that we could have a completely symmetric flow
> though.

I can do that. I will just need to call vfio_put_address_space() on every
branch which returns NULL. Or rework a bit more. So I ended up with this:

(not a patch, just cut-n-paste).
===

-static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *space)
+static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as)
 {
+    VFIOAddressSpace *space;
     VFIOGroup *group;
     char path[32];
     struct vfio_group_status status = { .argsz = sizeof(status) };

+    space = vfio_get_address_space(as);
+
     QLIST_FOREACH(group, &group_list, next) {
         if (group->groupid == groupid) {
             /* Found it.  Now is it already in the right context? */
@@ -2723,7 +2755,7 @@ static VFIOGroup *vfio_get_group(int groupid,
VFIOAddressSpace *space)
             } else {
                 error_report("vfio: group %d used in multiple address spaces",
                              group->groupid);
-                return NULL;
+                goto error_exit;
             }
         }
     }
@@ -2734,24 +2766,19 @@ static VFIOGroup *vfio_get_group(int groupid,
VFIOAddressSpace *space)
     group->fd = qemu_open(path, O_RDWR);
     if (group->fd < 0) {
         error_report("vfio: error opening %s: %m", path);
-        g_free(group);
-        return NULL;
+        goto free_group_exit;
     }

     if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
         error_report("vfio: error getting group status: %m");
-        close(group->fd);
-        g_free(group);
-        return NULL;
+        goto close_fd_exit;
     }

     if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
         error_report("vfio: error, group %d is not viable, please ensure "
                      "all devices within the iommu_group are bound to their "
                      "vfio bus driver.", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
+        goto close_fd_exit;
     }

     group->groupid = groupid;
@@ -2759,14 +2786,23 @@ static VFIOGroup *vfio_get_group(int groupid,
VFIOAddressSpace *space)

     if (vfio_connect_container(group, space)) {
         error_report("vfio: failed to setup container for group %d", groupid);
-        close(group->fd);
-        g_free(group);
-        return NULL;
+        goto close_fd_exit;
     }

     QLIST_INSERT_HEAD(&group_list, group, next);

     return group;
+
+close_fd_exit:
+    close(group->fd);
+
+free_group_exit:
+    g_free(group);
+
+error_exit:
+    vfio_put_address_space(space);
+
+    return NULL;
 }

===

Is that ok? Split it into 2 patches for easier review?


> 
> 
>> @@ -3339,7 +3367,6 @@ static const TypeInfo vfio_pci_dev_info = {
>>  
>>  static void register_vfio_pci_dev_type(void)
>>  {
>> -    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
>>      type_register_static(&vfio_pci_dev_info);
>>  }
>>  
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support
  2013-09-05 18:49   ` Alex Williamson
@ 2013-09-10  8:22     ` Alexey Kardashevskiy
  2013-09-10 22:02       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-10  8:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/06/2013 04:49 AM, Alex Williamson wrote:
> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>> From: David Gibson <david@gibson.dropbear.id.au>
>>
>> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
>> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
>> sufficient capability and granularity to match the guest side. This works
>> by tracking all map and unmap operations on the guest IOMMU using the
>> notifiers, and mirroring them into VFIO.
>>
>> There are a number of FIXMEs, and the scheme involves rather more notifier
>> structures than I'd like, but it should make for a reasonable proof of
>> concept.
>>
>> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>
>> ---
>> Changes:
>> v4:
>> * fixed list objects naming
>> * vfio_listener_region_add() reworked to call memory_region_ref() from one
>> place only, it is also easier to review the changes
>> * fixes boundary check not to fail on sections == 2^64 bytes,
>> the "vfio: Fix debug output for int128 values" patch is required;
>> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
>> for 128 bit memory section sizes" patch proposal
>> ---
>>  hw/misc/vfio.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 128 insertions(+), 9 deletions(-)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index c16f41b..53791fb 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -150,10 +150,18 @@ typedef struct VFIOContainer {
>>          };
>>          void (*release)(struct VFIOContainer *);
>>      } iommu_data;
>> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
>>      QLIST_HEAD(, VFIOGroup) group_list;
>>      QLIST_ENTRY(VFIOContainer) next;
>>  } VFIOContainer;
>>  
>> +typedef struct VFIOGuestIOMMU {
>> +    VFIOContainer *container;
>> +    MemoryRegion *iommu;
>> +    Notifier n;
>> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
>> +} VFIOGuestIOMMU;
>> +
>>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
>>  typedef struct VFIOMSIXInfo {
>>      uint8_t table_bar;
>> @@ -1917,7 +1925,63 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
>>  
>>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
>>  {
>> -    return !memory_region_is_ram(section->mr);
>> +    return !memory_region_is_ram(section->mr) &&
>> +        !memory_region_is_iommu(section->mr);
>> +}
>> +
>> +static void vfio_iommu_map_notify(Notifier *n, void *data)
>> +{
>> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
>> +    VFIOContainer *container = giommu->container;
>> +    IOMMUTLBEntry *iotlb = data;
>> +    MemoryRegion *mr;
>> +    hwaddr xlat;
>> +    hwaddr len = iotlb->addr_mask + 1;
>> +    void *vaddr;
>> +    int ret;
>> +
>> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
>> +
>> +    /*
>> +     * The IOMMU TLB entry we have just covers translation through
>> +     * this IOMMU to its immediate target.  We need to translate
>> +     * it the rest of the way through to memory.
>> +     */
>> +    mr = address_space_translate(&address_space_memory,
>> +                                 iotlb->translated_addr,
>> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
>> +    if (!memory_region_is_ram(mr)) {
>> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
>> +                xlat);
>> +        return;
>> +    }
>> +    if (len & iotlb->addr_mask) {
>> +        DPRINTF("iommu has granularity incompatible with target AS\n");
>> +        return;
>> +    }
>> +
>> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
>> +
>> +    if (iotlb->perm != IOMMU_NONE) {
>> +        ret = vfio_dma_map(container, iotlb->iova,
>> +                           iotlb->addr_mask + 1, vaddr,
>> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
>> +        if (ret) {
>> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
>> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
>> +                         container, iotlb->iova,
>> +                         iotlb->addr_mask + 1, vaddr, ret);
>> +        }
>> +    } else {
>> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
>> +        if (ret) {
>> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
>> +                         "0x%"HWADDR_PRIx") = %d (%m)",
>> +                         container, iotlb->iova,
>> +                         iotlb->addr_mask + 1, ret);
>> +        }
>> +    }
>>  }
>>  
>>  static void vfio_listener_region_add(MemoryListener *listener,
>> @@ -1926,11 +1990,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>      VFIOContainer *container = container_of(listener, VFIOContainer,
>>                                              iommu_data.listener);
>>      hwaddr iova, end;
>> +    Int128 llend;
>>      void *vaddr;
>>      int ret;
>>  
>> -    assert(!memory_region_is_iommu(section->mr));
>> -
>>      if (vfio_listener_skipped_section(section)) {
>>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
>>                  section->offset_within_address_space,
>> @@ -1946,21 +2009,57 @@ static void vfio_listener_region_add(MemoryListener *listener,
>>      }
>>  
>>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
>> +    llend = int128_make64(section->offset_within_address_space);
>> +    llend = int128_add(llend, section->size);
>> +    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> 
> So you're still looking for me to pickup patches 1/3 & 2/3 from the
> previous int128 series for this?


Oops. Just noticed the question.
Yes, please. Your tree is a lot faster :)


>> +
>> +    if (int128_ge(int128_make64(iova), llend)) {
>> +        return;
>> +    }
>> +
>> +    memory_region_ref(section->mr);
>> +
>> +    if (memory_region_is_iommu(section->mr)) {
>> +        VFIOGuestIOMMU *giommu;
>> +
>> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
>> +                iova, int128_get64(int128_sub(llend, int128_one())));
>> +        /*
>> +         * FIXME: We should do some checking to see if the
>> +         * capabilities of the host VFIO IOMMU are adequate to model
>> +         * the guest IOMMU
>> +         *
>> +         * FIXME: This assumes that the guest IOMMU is empty of
>> +         * mappings at this point - we should either enforce this, or
>> +         * loop through existing mappings to map them into VFIO.
> 
> I would hope that when you do the below
> memory_region_register_iommu_notifier() the callback gets a replay of
> the current state of the iommu like we do for a memory region when we
> register the listener that gets us here.  Is that not the case?


>From what I see, it just adds a notifier:

void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
{
    notifier_list_add(&mr->iommu_notify, n);
}


>> +         *
>> +         * FIXME: For VFIO iommu types which have KVM acceleration to
>> +         * avoid bouncing all map/unmaps through qemu this way, this
>> +         * would be the right place to wire that up (tell the KVM
>> +         * device emulation the VFIO iommu handles to use).
>> +         */
>> +        giommu = g_malloc0(sizeof(*giommu));
>> +        giommu->iommu = section->mr;
>> +        giommu->container = container;
>> +        giommu->n.notify = vfio_iommu_map_notify;
>> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
>> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
>> +
>> +        return;
>> +    }
>> +
>> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
>> +
>>      end = (section->offset_within_address_space + int128_get64(section->size)) &
>>            TARGET_PAGE_MASK;
> 
> Why isn't this now calculated from llend? 

I did not want to change the existing code :) I will fix it.
.

> I thought that was part of
> why the int128 stuff was rolled in here.

Not sure that I understood the note. Everything I wanted here was not
calling int128_get64() for IOMMU case.



-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-09-05 19:01   ` Alex Williamson
@ 2013-09-10  8:36     ` Alexey Kardashevskiy
  2013-09-10 22:11       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-10  8:36 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/06/2013 05:01 AM, Alex Williamson wrote:
> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>> As sPAPR platform supports DMA windows on a PCI bus, the information
>> about their location and size should be passed into the guest via
>> the device tree.
>>
>> The patch adds a helper to read this info from the container fd.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v4:
>> * fixed possible leaks on error paths
>> ---
>>  hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/misc/vfio.h | 11 +++++++++++
>>  2 files changed, 56 insertions(+)
>>  create mode 100644 include/hw/misc/vfio.h
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 53791fb..4210471 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -39,6 +39,7 @@
>>  #include "qemu/range.h"
>>  #include "sysemu/kvm.h"
>>  #include "sysemu/sysemu.h"
>> +#include "hw/misc/vfio.h"
>>  
>>  /* #define DEBUG_VFIO */
>>  #ifdef DEBUG_VFIO
>> @@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
>>  }
>>  
>>  type_init(register_vfio_pci_dev_type)
>> +
>> +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
>> +                                  struct vfio_iommu_spapr_tce_info *info,
>> +                                  int *group_fd)
>> +{
>> +    VFIOAddressSpace *space;
>> +    VFIOGroup *group;
>> +    VFIOContainer *container;
>> +    int ret, fd;
>> +
>> +    space = vfio_get_address_space(as);
>> +    if (!space) {
>> +        return -1;
>> +    }
>> +    group = vfio_get_group(groupid, space);
>> +    if (!group) {
>> +        goto put_as_exit;
>> +    }
>> +    container = group->container;
>> +    if (!group->container) {
>> +        goto put_group_exit;
>> +    }
>> +    fd = container->fd;
>> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +        goto put_group_exit;
>> +    }
>> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
>> +    if (ret) {
>> +        error_report("vfio: failed to get iommu info for container: %s",
>> +                     strerror(errno));
>> +        goto put_group_exit;
>> +    }
>> +    *group_fd = group->fd;
> 
> The above gets don't actually increment a reference count, so copying
> the fd seems risky here.


If fd is gone while I am carrying it to my "external VFIO user" to call
kvmppc_vfio_group_get_external_user() on it, then the guest just shut
itself in a foot, no?
And I do not see how I would make it no risky, do you?


>> +
>> +    return 0;
>> +
>> +put_group_exit:
>> +    vfio_put_group(group);
>> +
>> +put_as_exit:
>> +    vfio_put_address_space(space);
> 
> But put_group calls disconnect_container which calls
> put_address_space... so it get's put twice.  The lack of symmetry
> already bites us with a bug.

True. This will be fixed by moving vfio_get_address_space() into
vfio_get_group().




-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr
  2013-09-05 19:05   ` Alex Williamson
@ 2013-09-10  9:00     ` Alexey Kardashevskiy
  2013-09-10 22:13       ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-10  9:00 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/06/2013 05:05 AM, Alex Williamson wrote:
> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>> This turns the sPAPR support on and enables VFIO container use
>> in the kernel.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v4:
>> * fixed format string to use %m which is a glibc extension:
>> "Print output of strerror(errno). No argument is required."
>> ---
>>  hw/misc/vfio.c | 30 ++++++++++++++++++++++++++++++
>>  1 file changed, 30 insertions(+)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 4210471..882da70 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -2815,6 +2815,36 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
>>  
>>          memory_listener_register(&container->iommu_data.listener,
>>                                   container->space->as);
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %m");
>> +            g_free(container);
>> +            close(fd);
>> +            return -errno;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %m");
>> +            g_free(container);
>> +            close(fd);
>> +            return -errno;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>> +        if (ret) {
>> +            error_report("vfio: failed to enable container: %m");
>> +            g_free(container);
>> +            close(fd);
>> +            return -errno;
> 
> These (and the copies that already exist in this function) are screaming
> for a goto.


Heh. So. There should be 2 patches then - one adding gotos to the existing
code and another one adding new functionality-with-gotos-already.
I can do that, is it what you suggest?


What about the rest of the series? Next time I will split "[Qemu-devel]
[PATCH v4 05/12] spapr_pci: convert init to realize" but the rest will be
still the same. I have understanding that Alex Graf is expecting you to
review the whole thing (ack/sob? not sure how this all works) before he
pulls it into his tree.

And thanks for comments.



-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed
  2013-09-10  8:09     ` Alexey Kardashevskiy
@ 2013-09-10 21:51       ` Alex Williamson
  0 siblings, 0 replies; 32+ messages in thread
From: Alex Williamson @ 2013-09-10 21:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Tue, 2013-09-10 at 18:09 +1000, Alexey Kardashevskiy wrote:
> On 09/06/2013 04:24 AM, Alex Williamson wrote:
> > On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> >> From: David Gibson <david@gibson.dropbear.id.au>
> >>
> >> So far, VFIO has a notion of different logical DMA address spaces, but
> >> only ever uses one (system memory).  This patch extends this, creating
> >> new VFIOAddressSpace objects as necessary, according to the AddressSpace
> >> reported by the PCI subsystem for this device's DMAs.
> >>
> >> This isn't enough yet to support guest side IOMMUs with VFIO, but it does
> >> mean we could now support VFIO devices on, for example, a guest side PCI
> >> host bridge which maps system memory at somewhere other than 0 in PCI
> >> space.
> >>
> >> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >>  hw/misc/vfio.c | 43 +++++++++++++++++++++++++++++++++++--------
> >>  1 file changed, 35 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >> index 93a316e..c16f41b 100644
> >> --- a/hw/misc/vfio.c
> >> +++ b/hw/misc/vfio.c
> >> @@ -133,9 +133,10 @@ enum {
> >>  typedef struct VFIOAddressSpace {
> >>      AddressSpace *as;
> >>      QLIST_HEAD(, VFIOContainer) containers;
> >> +    QLIST_ENTRY(VFIOAddressSpace) list;
> >>  } VFIOAddressSpace;
> >>  
> >> -static VFIOAddressSpace vfio_address_space_memory;
> >> +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces;
> >>  
> >>  struct VFIOGroup;
> >>  
> >> @@ -2611,10 +2612,34 @@ static int vfio_load_rom(VFIODevice *vdev)
> >>      return 0;
> >>  }
> >>  
> >> -static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as)
> >> +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as)
> >>  {
> >> +    VFIOAddressSpace *space;
> >> +
> >> +    QLIST_FOREACH(space, &vfio_address_spaces, list) {
> >> +        if (space->as == as) {
> >> +            return space;
> >> +        }
> >> +    }
> >> +
> >> +    /* No suitable VFIOAddressSpace, create a new one */
> >> +    space = g_malloc0(sizeof(*space));
> >>      space->as = as;
> >>      QLIST_INIT(&space->containers);
> >> +
> >> +    QLIST_INSERT_HEAD(&vfio_address_spaces, space, list);
> >> +
> >> +    return space;
> >> +}
> >> +
> >> +static void vfio_put_address_space(VFIOAddressSpace *space)
> >> +{
> >> +    if (!QLIST_EMPTY(&space->containers)) {
> >> +        return;
> >> +    }
> >> +
> >> +    QLIST_REMOVE(space, list);
> >> +    g_free(space);
> >>  }
> >>  
> >>  static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
> >> @@ -2699,6 +2724,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>      group->container = NULL;
> >>  
> >>      if (QLIST_EMPTY(&container->group_list)) {
> >> +        VFIOAddressSpace *space = container->space;
> >> +
> >>          if (container->iommu_data.release) {
> >>              container->iommu_data.release(container);
> >>          }
> >> @@ -2706,6 +2733,8 @@ static void vfio_disconnect_container(VFIOGroup *group)
> >>          DPRINTF("vfio_disconnect_container: close container->fd\n");
> >>          close(container->fd);
> >>          g_free(container);
> >> +
> >> +        vfio_put_address_space(space);
> >>      }
> >>  }
> >>  
> >> @@ -3076,6 +3105,7 @@ static int vfio_initfn(PCIDevice *pdev)
> >>  {
> >>      VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> >>      VFIOGroup *group;
> >> +    VFIOAddressSpace *space;
> >>      char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name;
> >>      ssize_t len;
> >>      struct stat st;
> >> @@ -3111,14 +3141,12 @@ static int vfio_initfn(PCIDevice *pdev)
> >>      DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
> >>              vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
> >>  
> >> -    if (pci_device_iommu_address_space(pdev) != &address_space_memory) {
> >> -        error_report("vfio: DMA address space must be system memory");
> >> -        return -EINVAL;
> >> -    }
> >> +    space = vfio_get_address_space(pci_device_iommu_address_space(pdev));
> >>  
> >> -    group = vfio_get_group(groupid, &vfio_address_space_memory);
> >> +    group = vfio_get_group(groupid, space);
> >>      if (!group) {
> >>          error_report("vfio: failed to get group %d", groupid);
> >> +        vfio_put_address_space(space);
> >>          return -ENOENT;
> >>      }
> >>  
> > 
> > Kind of a code flow issue here, on teardown we have:
> > 
> > vfio_put_group
> >   vfio_disconnect_container
> >     vfio_put_address_space
> > 
> > On setup we do:
> > 
> > vfio_get_address_space
> > vfio_get_group
> >   vfio_connect_container
> > 
> > We could easily move vfio_get_address_space into vfio_get_group to make
> > things a little more balanced.  It doesn't seem like too much additional
> > to pass the address space through vfio_get_group into
> > vfio_connect_container so that we could have a completely symmetric flow
> > though.
> 
> I can do that. I will just need to call vfio_put_address_space() on every
> branch which returns NULL. Or rework a bit more. So I ended up with this:
> 
> (not a patch, just cut-n-paste).
> ===
> 
> -static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *space)
> +static VFIOGroup *vfio_get_group(int groupid, AddressSpace *as)
>  {
> +    VFIOAddressSpace *space;
>      VFIOGroup *group;
>      char path[32];
>      struct vfio_group_status status = { .argsz = sizeof(status) };
> 
> +    space = vfio_get_address_space(as);
> +

You don't even need this here, you could test

group->container->space->as == as

below, then pass "as" to vfio_connect_container() instead of "space",
then the space would be created and destroyed by the container setup
code.  I think that would cleanup your exit paths too, not that we
couldn't stand to cleanup some of the duplicate exit code w/ a goto
anyway.  Thanks,

Alex

>      QLIST_FOREACH(group, &group_list, next) {
>          if (group->groupid == groupid) {
>              /* Found it.  Now is it already in the right context? */
> @@ -2723,7 +2755,7 @@ static VFIOGroup *vfio_get_group(int groupid,
> VFIOAddressSpace *space)
>              } else {
>                  error_report("vfio: group %d used in multiple address spaces",
>                               group->groupid);
> -                return NULL;
> +                goto error_exit;
>              }
>          }
>      }
> @@ -2734,24 +2766,19 @@ static VFIOGroup *vfio_get_group(int groupid,
> VFIOAddressSpace *space)
>      group->fd = qemu_open(path, O_RDWR);
>      if (group->fd < 0) {
>          error_report("vfio: error opening %s: %m", path);
> -        g_free(group);
> -        return NULL;
> +        goto free_group_exit;
>      }
> 
>      if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) {
>          error_report("vfio: error getting group status: %m");
> -        close(group->fd);
> -        g_free(group);
> -        return NULL;
> +        goto close_fd_exit;
>      }
> 
>      if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) {
>          error_report("vfio: error, group %d is not viable, please ensure "
>                       "all devices within the iommu_group are bound to their "
>                       "vfio bus driver.", groupid);
> -        close(group->fd);
> -        g_free(group);
> -        return NULL;
> +        goto close_fd_exit;
>      }
> 
>      group->groupid = groupid;
> @@ -2759,14 +2786,23 @@ static VFIOGroup *vfio_get_group(int groupid,
> VFIOAddressSpace *space)
> 
>      if (vfio_connect_container(group, space)) {
>          error_report("vfio: failed to setup container for group %d", groupid);
> -        close(group->fd);
> -        g_free(group);
> -        return NULL;
> +        goto close_fd_exit;
>      }
> 
>      QLIST_INSERT_HEAD(&group_list, group, next);
> 
>      return group;
> +
> +close_fd_exit:
> +    close(group->fd);
> +
> +free_group_exit:
> +    g_free(group);
> +
> +error_exit:
> +    vfio_put_address_space(space);
> +
> +    return NULL;
>  }
> 
> ===
> 
> Is that ok? Split it into 2 patches for easier review?
> > 
> > 
> >> @@ -3339,7 +3367,6 @@ static const TypeInfo vfio_pci_dev_info = {
> >>  
> >>  static void register_vfio_pci_dev_type(void)
> >>  {
> >> -    vfio_address_space_init(&vfio_address_space_memory, &address_space_memory);
> >>      type_register_static(&vfio_pci_dev_info);
> >>  }
> >>  
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support
  2013-09-10  8:22     ` Alexey Kardashevskiy
@ 2013-09-10 22:02       ` Alex Williamson
  2013-09-11  6:15         ` Paolo Bonzini
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-10 22:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Tue, 2013-09-10 at 18:22 +1000, Alexey Kardashevskiy wrote:
> On 09/06/2013 04:49 AM, Alex Williamson wrote:
> > On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> >> From: David Gibson <david@gibson.dropbear.id.au>
> >>
> >> This patch uses the new IOMMU notifiers to allow VFIO pass through devices
> >> to work with guest side IOMMUs, as long as the host-side VFIO iommu has
> >> sufficient capability and granularity to match the guest side. This works
> >> by tracking all map and unmap operations on the guest IOMMU using the
> >> notifiers, and mirroring them into VFIO.
> >>
> >> There are a number of FIXMEs, and the scheme involves rather more notifier
> >> structures than I'd like, but it should make for a reasonable proof of
> >> concept.
> >>
> >> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >> ---
> >> Changes:
> >> v4:
> >> * fixed list objects naming
> >> * vfio_listener_region_add() reworked to call memory_region_ref() from one
> >> place only, it is also easier to review the changes
> >> * fixes boundary check not to fail on sections == 2^64 bytes,
> >> the "vfio: Fix debug output for int128 values" patch is required;
> >> this obsoletes the "[PATCH v3 0/3] vfio: fixes for better support
> >> for 128 bit memory section sizes" patch proposal
> >> ---
> >>  hw/misc/vfio.c | 137 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 128 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >> index c16f41b..53791fb 100644
> >> --- a/hw/misc/vfio.c
> >> +++ b/hw/misc/vfio.c
> >> @@ -150,10 +150,18 @@ typedef struct VFIOContainer {
> >>          };
> >>          void (*release)(struct VFIOContainer *);
> >>      } iommu_data;
> >> +    QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
> >>      QLIST_HEAD(, VFIOGroup) group_list;
> >>      QLIST_ENTRY(VFIOContainer) next;
> >>  } VFIOContainer;
> >>  
> >> +typedef struct VFIOGuestIOMMU {
> >> +    VFIOContainer *container;
> >> +    MemoryRegion *iommu;
> >> +    Notifier n;
> >> +    QLIST_ENTRY(VFIOGuestIOMMU) giommu_next;
> >> +} VFIOGuestIOMMU;
> >> +
> >>  /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
> >>  typedef struct VFIOMSIXInfo {
> >>      uint8_t table_bar;
> >> @@ -1917,7 +1925,63 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
> >>  
> >>  static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> >>  {
> >> -    return !memory_region_is_ram(section->mr);
> >> +    return !memory_region_is_ram(section->mr) &&
> >> +        !memory_region_is_iommu(section->mr);
> >> +}
> >> +
> >> +static void vfio_iommu_map_notify(Notifier *n, void *data)
> >> +{
> >> +    VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
> >> +    VFIOContainer *container = giommu->container;
> >> +    IOMMUTLBEntry *iotlb = data;
> >> +    MemoryRegion *mr;
> >> +    hwaddr xlat;
> >> +    hwaddr len = iotlb->addr_mask + 1;
> >> +    void *vaddr;
> >> +    int ret;
> >> +
> >> +    DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >> +            iotlb->iova, iotlb->iova + iotlb->addr_mask);
> >> +
> >> +    /*
> >> +     * The IOMMU TLB entry we have just covers translation through
> >> +     * this IOMMU to its immediate target.  We need to translate
> >> +     * it the rest of the way through to memory.
> >> +     */
> >> +    mr = address_space_translate(&address_space_memory,
> >> +                                 iotlb->translated_addr,
> >> +                                 &xlat, &len, iotlb->perm & IOMMU_WO);
> >> +    if (!memory_region_is_ram(mr)) {
> >> +        DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n",
> >> +                xlat);
> >> +        return;
> >> +    }
> >> +    if (len & iotlb->addr_mask) {
> >> +        DPRINTF("iommu has granularity incompatible with target AS\n");
> >> +        return;
> >> +    }
> >> +
> >> +    vaddr = memory_region_get_ram_ptr(mr) + xlat;
> >> +
> >> +    if (iotlb->perm != IOMMU_NONE) {
> >> +        ret = vfio_dma_map(container, iotlb->iova,
> >> +                           iotlb->addr_mask + 1, vaddr,
> >> +                           !(iotlb->perm & IOMMU_WO) || mr->readonly);
> >> +        if (ret) {
> >> +            error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
> >> +                         "0x%"HWADDR_PRIx", %p) = %d (%m)",
> >> +                         container, iotlb->iova,
> >> +                         iotlb->addr_mask + 1, vaddr, ret);
> >> +        }
> >> +    } else {
> >> +        ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1);
> >> +        if (ret) {
> >> +            error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", "
> >> +                         "0x%"HWADDR_PRIx") = %d (%m)",
> >> +                         container, iotlb->iova,
> >> +                         iotlb->addr_mask + 1, ret);
> >> +        }
> >> +    }
> >>  }
> >>  
> >>  static void vfio_listener_region_add(MemoryListener *listener,
> >> @@ -1926,11 +1990,10 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      VFIOContainer *container = container_of(listener, VFIOContainer,
> >>                                              iommu_data.listener);
> >>      hwaddr iova, end;
> >> +    Int128 llend;
> >>      void *vaddr;
> >>      int ret;
> >>  
> >> -    assert(!memory_region_is_iommu(section->mr));
> >> -
> >>      if (vfio_listener_skipped_section(section)) {
> >>          DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n",
> >>                  section->offset_within_address_space,
> >> @@ -1946,21 +2009,57 @@ static void vfio_listener_region_add(MemoryListener *listener,
> >>      }
> >>  
> >>      iova = TARGET_PAGE_ALIGN(section->offset_within_address_space);
> >> +    llend = int128_make64(section->offset_within_address_space);
> >> +    llend = int128_add(llend, section->size);
> >> +    llend = int128_and(llend, int128_exts64(TARGET_PAGE_MASK));
> > 
> > So you're still looking for me to pickup patches 1/3 & 2/3 from the
> > previous int128 series for this?
> 
> 
> Oops. Just noticed the question.
> Yes, please. Your tree is a lot faster :)

It would be great if your next series was split according to who you
expect to merge the changes so we don't have to cherry pick individual
patches from broader series.  Note that Paolo already ack'd your old
patch 1/3, so you could include that in the re-posting.

> >> +
> >> +    if (int128_ge(int128_make64(iova), llend)) {
> >> +        return;
> >> +    }
> >> +
> >> +    memory_region_ref(section->mr);
> >> +
> >> +    if (memory_region_is_iommu(section->mr)) {
> >> +        VFIOGuestIOMMU *giommu;
> >> +
> >> +        DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n",
> >> +                iova, int128_get64(int128_sub(llend, int128_one())));
> >> +        /*
> >> +         * FIXME: We should do some checking to see if the
> >> +         * capabilities of the host VFIO IOMMU are adequate to model
> >> +         * the guest IOMMU
> >> +         *
> >> +         * FIXME: This assumes that the guest IOMMU is empty of
> >> +         * mappings at this point - we should either enforce this, or
> >> +         * loop through existing mappings to map them into VFIO.
> > 
> > I would hope that when you do the below
> > memory_region_register_iommu_notifier() the callback gets a replay of
> > the current state of the iommu like we do for a memory region when we
> > register the listener that gets us here.  Is that not the case?
> 
> 
> From what I see, it just adds a notifier:
> 
> void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
> {
>     notifier_list_add(&mr->iommu_notify, n);
> }

Ok, that's too bad.  I guess your comment is justified then.

> >> +         *
> >> +         * FIXME: For VFIO iommu types which have KVM acceleration to
> >> +         * avoid bouncing all map/unmaps through qemu this way, this
> >> +         * would be the right place to wire that up (tell the KVM
> >> +         * device emulation the VFIO iommu handles to use).
> >> +         */
> >> +        giommu = g_malloc0(sizeof(*giommu));
> >> +        giommu->iommu = section->mr;
> >> +        giommu->container = container;
> >> +        giommu->n.notify = vfio_iommu_map_notify;
> >> +        QLIST_INSERT_HEAD(&container->giommu_list, giommu, giommu_next);
> >> +        memory_region_register_iommu_notifier(giommu->iommu, &giommu->n);
> >> +
> >> +        return;
> >> +    }
> >> +
> >> +    /* Here we assume that memory_region_is_ram(section->mr)==true */
> >> +
> >>      end = (section->offset_within_address_space + int128_get64(section->size)) &
> >>            TARGET_PAGE_MASK;
> > 
> > Why isn't this now calculated from llend? 
> 
> I did not want to change the existing code :) I will fix it.
> .
> 
> > I thought that was part of
> > why the int128 stuff was rolled in here.
> 
> Not sure that I understood the note. Everything I wanted here was not
> calling int128_get64() for IOMMU case.

Your old patch 3/3 was unmaintainable because it calculated something
that looks like the same value two different ways.  First using Int128,
then using hwaddr.  Now you've rolled it in here, but it still has the
same problem.  I'd prefer your old 3/3 patch that converted to Int128,
replacing the existing calculation, and left (!region_is_iommu) assert,
then add the rest of the functionality of this patch after.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-09-10  8:36     ` Alexey Kardashevskiy
@ 2013-09-10 22:11       ` Alex Williamson
  2013-09-13 10:11         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-10 22:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Tue, 2013-09-10 at 18:36 +1000, Alexey Kardashevskiy wrote:
> On 09/06/2013 05:01 AM, Alex Williamson wrote:
> > On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> >> As sPAPR platform supports DMA windows on a PCI bus, the information
> >> about their location and size should be passed into the guest via
> >> the device tree.
> >>
> >> The patch adds a helper to read this info from the container fd.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v4:
> >> * fixed possible leaks on error paths
> >> ---
> >>  hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/misc/vfio.h | 11 +++++++++++
> >>  2 files changed, 56 insertions(+)
> >>  create mode 100644 include/hw/misc/vfio.h
> >>
> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >> index 53791fb..4210471 100644
> >> --- a/hw/misc/vfio.c
> >> +++ b/hw/misc/vfio.c
> >> @@ -39,6 +39,7 @@
> >>  #include "qemu/range.h"
> >>  #include "sysemu/kvm.h"
> >>  #include "sysemu/sysemu.h"
> >> +#include "hw/misc/vfio.h"
> >>  
> >>  /* #define DEBUG_VFIO */
> >>  #ifdef DEBUG_VFIO
> >> @@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
> >>  }
> >>  
> >>  type_init(register_vfio_pci_dev_type)
> >> +
> >> +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
> >> +                                  struct vfio_iommu_spapr_tce_info *info,
> >> +                                  int *group_fd)
> >> +{
> >> +    VFIOAddressSpace *space;
> >> +    VFIOGroup *group;
> >> +    VFIOContainer *container;
> >> +    int ret, fd;
> >> +
> >> +    space = vfio_get_address_space(as);
> >> +    if (!space) {
> >> +        return -1;
> >> +    }
> >> +    group = vfio_get_group(groupid, space);
> >> +    if (!group) {
> >> +        goto put_as_exit;
> >> +    }
> >> +    container = group->container;
> >> +    if (!group->container) {
> >> +        goto put_group_exit;
> >> +    }
> >> +    fd = container->fd;
> >> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >> +        goto put_group_exit;
> >> +    }
> >> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
> >> +    if (ret) {
> >> +        error_report("vfio: failed to get iommu info for container: %s",
> >> +                     strerror(errno));
> >> +        goto put_group_exit;
> >> +    }
> >> +    *group_fd = group->fd;
> > 
> > The above gets don't actually increment a reference count, so copying
> > the fd seems risky here.
> 
> 
> If fd is gone while I am carrying it to my "external VFIO user" to call
> kvmppc_vfio_group_get_external_user() on it, then the guest just shut
> itself in a foot, no?
> And I do not see how I would make it no risky, do you?

We've handled the case in the kernel where the IOMMU code has a
reference to the group so the group won't go away as long as that
reference is in place, but we don't have that in QEMU.  If you supported
hotplug, how would QEMU vfio notify spapr code to release the group?  I
think you'd be left with the spapr kernel code holding the group
reference and possibly a bogus file descriptor in QEMU if the group is
close()'d and you've cached it from the above code.  Perhaps it's
sufficient to note that you don't support hot remove, but do you
actually do anything to prevent it?  Thanks,

Alex

> >> +
> >> +    return 0;
> >> +
> >> +put_group_exit:
> >> +    vfio_put_group(group);
> >> +
> >> +put_as_exit:
> >> +    vfio_put_address_space(space);
> > 
> > But put_group calls disconnect_container which calls
> > put_address_space... so it get's put twice.  The lack of symmetry
> > already bites us with a bug.
> 
> True. This will be fixed by moving vfio_get_address_space() into
> vfio_get_group().
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr
  2013-09-10  9:00     ` Alexey Kardashevskiy
@ 2013-09-10 22:13       ` Alex Williamson
  2013-09-13 11:34         ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-10 22:13 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Tue, 2013-09-10 at 19:00 +1000, Alexey Kardashevskiy wrote:
> On 09/06/2013 05:05 AM, Alex Williamson wrote:
> > On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> >> This turns the sPAPR support on and enables VFIO container use
> >> in the kernel.
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >> ---
> >> Changes:
> >> v4:
> >> * fixed format string to use %m which is a glibc extension:
> >> "Print output of strerror(errno). No argument is required."
> >> ---
> >>  hw/misc/vfio.c | 30 ++++++++++++++++++++++++++++++
> >>  1 file changed, 30 insertions(+)
> >>
> >> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >> index 4210471..882da70 100644
> >> --- a/hw/misc/vfio.c
> >> +++ b/hw/misc/vfio.c
> >> @@ -2815,6 +2815,36 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
> >>  
> >>          memory_listener_register(&container->iommu_data.listener,
> >>                                   container->space->as);
> >> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to set group container: %m");
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -errno;
> >> +        }
> >> +
> >> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to set iommu for container: %m");
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -errno;
> >> +        }
> >> +
> >> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >> +        if (ret) {
> >> +            error_report("vfio: failed to enable container: %m");
> >> +            g_free(container);
> >> +            close(fd);
> >> +            return -errno;
> > 
> > These (and the copies that already exist in this function) are screaming
> > for a goto.
> 
> 
> Heh. So. There should be 2 patches then - one adding gotos to the existing
> code and another one adding new functionality-with-gotos-already.
> I can do that, is it what you suggest?

Yes please.

> What about the rest of the series? Next time I will split "[Qemu-devel]
> [PATCH v4 05/12] spapr_pci: convert init to realize" but the rest will be
> still the same. I have understanding that Alex Graf is expecting you to
> review the whole thing (ack/sob? not sure how this all works) before he
> pulls it into his tree.

Oh, I've been picking out the vfio ones to review.  Ok, on the next
revision I'll review them all, but please still split it into series for
vfio and series for ppc.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support
  2013-09-10 22:02       ` Alex Williamson
@ 2013-09-11  6:15         ` Paolo Bonzini
  0 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2013-09-11  6:15 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Alexey Kardashevskiy, David Gibson, qemu-ppc, Alexander Graf,
	qemu-devel

Il 11/09/2013 00:02, Alex Williamson ha scritto:
>>> > > I would hope that when you do the below
>>> > > memory_region_register_iommu_notifier() the callback gets a replay of
>>> > > the current state of the iommu like we do for a memory region when we
>>> > > register the listener that gets us here.  Is that not the case?
>> > 
>> > 
>> > From what I see, it just adds a notifier:
>> > 
>> > void memory_region_register_iommu_notifier(MemoryRegion *mr, Notifier *n)
>> > {
>> >     notifier_list_add(&mr->iommu_notify, n);
>> > }
> Ok, that's too bad.  I guess your comment is justified then.
> 

That can be fixed by adding a new element ("replay") to the
MemoryRegionIOMMUOps.  It can be a follow-up.

Paolo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-09-10 22:11       ` Alex Williamson
@ 2013-09-13 10:11         ` Alexey Kardashevskiy
  2013-09-25 20:29           ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-13 10:11 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/11/2013 08:11 AM, Alex Williamson wrote:
> On Tue, 2013-09-10 at 18:36 +1000, Alexey Kardashevskiy wrote:
>> On 09/06/2013 05:01 AM, Alex Williamson wrote:
>>> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>>>> As sPAPR platform supports DMA windows on a PCI bus, the information
>>>> about their location and size should be passed into the guest via
>>>> the device tree.
>>>>
>>>> The patch adds a helper to read this info from the container fd.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v4:
>>>> * fixed possible leaks on error paths
>>>> ---
>>>>  hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/misc/vfio.h | 11 +++++++++++
>>>>  2 files changed, 56 insertions(+)
>>>>  create mode 100644 include/hw/misc/vfio.h
>>>>
>>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>>>> index 53791fb..4210471 100644
>>>> --- a/hw/misc/vfio.c
>>>> +++ b/hw/misc/vfio.c
>>>> @@ -39,6 +39,7 @@
>>>>  #include "qemu/range.h"
>>>>  #include "sysemu/kvm.h"
>>>>  #include "sysemu/sysemu.h"
>>>> +#include "hw/misc/vfio.h"
>>>>  
>>>>  /* #define DEBUG_VFIO */
>>>>  #ifdef DEBUG_VFIO
>>>> @@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
>>>>  }
>>>>  
>>>>  type_init(register_vfio_pci_dev_type)
>>>> +
>>>> +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
>>>> +                                  struct vfio_iommu_spapr_tce_info *info,
>>>> +                                  int *group_fd)
>>>> +{
>>>> +    VFIOAddressSpace *space;
>>>> +    VFIOGroup *group;
>>>> +    VFIOContainer *container;
>>>> +    int ret, fd;
>>>> +
>>>> +    space = vfio_get_address_space(as);
>>>> +    if (!space) {
>>>> +        return -1;
>>>> +    }
>>>> +    group = vfio_get_group(groupid, space);
>>>> +    if (!group) {
>>>> +        goto put_as_exit;
>>>> +    }
>>>> +    container = group->container;
>>>> +    if (!group->container) {
>>>> +        goto put_group_exit;
>>>> +    }
>>>> +    fd = container->fd;
>>>> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>> +        goto put_group_exit;
>>>> +    }
>>>> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
>>>> +    if (ret) {
>>>> +        error_report("vfio: failed to get iommu info for container: %s",
>>>> +                     strerror(errno));
>>>> +        goto put_group_exit;
>>>> +    }
>>>> +    *group_fd = group->fd;
>>>
>>> The above gets don't actually increment a reference count, so copying
>>> the fd seems risky here.
>>
>>
>> If fd is gone while I am carrying it to my "external VFIO user" to call
>> kvmppc_vfio_group_get_external_user() on it, then the guest just shut
>> itself in a foot, no?
>> And I do not see how I would make it no risky, do you?
> 
> We've handled the case in the kernel where the IOMMU code has a
> reference to the group so the group won't go away as long as that
> reference is in place, but we don't have that in QEMU.  If you supported
> hotplug, how would QEMU vfio notify spapr code to release the group?  I
> think you'd be left with the spapr kernel code holding the group
> reference and possibly a bogus file descriptor in QEMU if the group is
> close()'d and you've cached it from the above code.  Perhaps it's
> sufficient to note that you don't support hot remove, but do you
> actually do anything to prevent it?  Thanks,


I do not cache group_fd, I copy iе from VFIOGroup and immediately pass it
to KVM which immediately calls fget() on it. This is really short distance
and the only thing for protection here would be:

-    *group_fd = group->fd;
+    *group_fd = dup(group->fd);

and then close(group_fd) after I passed it to KVM. I guess it has to be
done anyway. But I suspect this is not what you are talking about...



> 
> Alex
> 
>>>> +
>>>> +    return 0;
>>>> +
>>>> +put_group_exit:
>>>> +    vfio_put_group(group);
>>>> +
>>>> +put_as_exit:
>>>> +    vfio_put_address_space(space);
>>>
>>> But put_group calls disconnect_container which calls
>>> put_address_space... so it get's put twice.  The lack of symmetry
>>> already bites us with a bug.
>>
>> True. This will be fixed by moving vfio_get_address_space() into
>> vfio_get_group().


-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr
  2013-09-10 22:13       ` Alex Williamson
@ 2013-09-13 11:34         ` Alexey Kardashevskiy
  2013-09-25 20:33           ` Alex Williamson
  0 siblings, 1 reply; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-13 11:34 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/11/2013 08:13 AM, Alex Williamson wrote:
> On Tue, 2013-09-10 at 19:00 +1000, Alexey Kardashevskiy wrote:
>> On 09/06/2013 05:05 AM, Alex Williamson wrote:
>>> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>>>> This turns the sPAPR support on and enables VFIO container use
>>>> in the kernel.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>> Changes:
>>>> v4:
>>>> * fixed format string to use %m which is a glibc extension:
>>>> "Print output of strerror(errno). No argument is required."
>>>> ---
>>>>  hw/misc/vfio.c | 30 ++++++++++++++++++++++++++++++
>>>>  1 file changed, 30 insertions(+)
>>>>
>>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>>>> index 4210471..882da70 100644
>>>> --- a/hw/misc/vfio.c
>>>> +++ b/hw/misc/vfio.c
>>>> @@ -2815,6 +2815,36 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
>>>>  
>>>>          memory_listener_register(&container->iommu_data.listener,
>>>>                                   container->space->as);
>>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to set group container: %m");
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -errno;
>>>> +        }
>>>> +
>>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to set iommu for container: %m");
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -errno;
>>>> +        }
>>>> +
>>>> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
>>>> +        if (ret) {
>>>> +            error_report("vfio: failed to enable container: %m");
>>>> +            g_free(container);
>>>> +            close(fd);
>>>> +            return -errno;
>>>
>>> These (and the copies that already exist in this function) are screaming
>>> for a goto.
>>
>>
>> Heh. So. There should be 2 patches then - one adding gotos to the existing
>> code and another one adding new functionality-with-gotos-already.
>> I can do that, is it what you suggest?
> 
> Yes please.
> 
>> What about the rest of the series? Next time I will split "[Qemu-devel]
>> [PATCH v4 05/12] spapr_pci: convert init to realize" but the rest will be
>> still the same. I have understanding that Alex Graf is expecting you to
>> review the whole thing (ack/sob? not sure how this all works) before he
>> pulls it into his tree.
> 
> Oh, I've been picking out the vfio ones to review.  Ok, on the next
> revision I'll review them all, but please still split it into series for
> vfio and series for ppc.  Thanks,
> 
> Alex
> 

This is the whole set of what I have on this matter:

KVM: spapr-vfio: enable in-kernel acceleration
spapr-vfio: enable for spapr

spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
spapr-iommu: add SPAPR VFIO IOMMU device
spapr-iommu: introduce SPAPR_TCE_TABLE class
spapr-pci: converts fprintf to error_report
spapr-pci: add spapr_pci trace
spapr-pci: introduce a finish_realize() callback
spapr-pci: convert init() callback to realize()

spapr vfio: add vfio_container_spapr_get_info()
vfio: Add guest side IOMMU support
vfio: Create VFIOAddressSpace objects as needed
vfio: Introduce VFIO address spaces
vfio: rework to have error paths
vfio: Fix 128 bit handling
vfio: Fix debug output for int128 values
int128: add int128_exts64()

There are 3 sets: vfio stuff (8 patches), spapr-pci rework (7 patches) and
finally the enablement (2 patches). Post it as 3 series? Or split 128bit
things to a series as well? :)


-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-09-13 10:11         ` Alexey Kardashevskiy
@ 2013-09-25 20:29           ` Alex Williamson
  2013-09-26 10:22             ` Alexey Kardashevskiy
  0 siblings, 1 reply; 32+ messages in thread
From: Alex Williamson @ 2013-09-25 20:29 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2013-09-13 at 20:11 +1000, Alexey Kardashevskiy wrote:
> On 09/11/2013 08:11 AM, Alex Williamson wrote:
> > On Tue, 2013-09-10 at 18:36 +1000, Alexey Kardashevskiy wrote:
> >> On 09/06/2013 05:01 AM, Alex Williamson wrote:
> >>> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> >>>> As sPAPR platform supports DMA windows on a PCI bus, the information
> >>>> about their location and size should be passed into the guest via
> >>>> the device tree.
> >>>>
> >>>> The patch adds a helper to read this info from the container fd.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v4:
> >>>> * fixed possible leaks on error paths
> >>>> ---
> >>>>  hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
> >>>>  include/hw/misc/vfio.h | 11 +++++++++++
> >>>>  2 files changed, 56 insertions(+)
> >>>>  create mode 100644 include/hw/misc/vfio.h
> >>>>
> >>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >>>> index 53791fb..4210471 100644
> >>>> --- a/hw/misc/vfio.c
> >>>> +++ b/hw/misc/vfio.c
> >>>> @@ -39,6 +39,7 @@
> >>>>  #include "qemu/range.h"
> >>>>  #include "sysemu/kvm.h"
> >>>>  #include "sysemu/sysemu.h"
> >>>> +#include "hw/misc/vfio.h"
> >>>>  
> >>>>  /* #define DEBUG_VFIO */
> >>>>  #ifdef DEBUG_VFIO
> >>>> @@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
> >>>>  }
> >>>>  
> >>>>  type_init(register_vfio_pci_dev_type)
> >>>> +
> >>>> +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
> >>>> +                                  struct vfio_iommu_spapr_tce_info *info,
> >>>> +                                  int *group_fd)
> >>>> +{
> >>>> +    VFIOAddressSpace *space;
> >>>> +    VFIOGroup *group;
> >>>> +    VFIOContainer *container;
> >>>> +    int ret, fd;
> >>>> +
> >>>> +    space = vfio_get_address_space(as);
> >>>> +    if (!space) {
> >>>> +        return -1;
> >>>> +    }
> >>>> +    group = vfio_get_group(groupid, space);
> >>>> +    if (!group) {
> >>>> +        goto put_as_exit;
> >>>> +    }
> >>>> +    container = group->container;
> >>>> +    if (!group->container) {
> >>>> +        goto put_group_exit;
> >>>> +    }
> >>>> +    fd = container->fd;
> >>>> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >>>> +        goto put_group_exit;
> >>>> +    }
> >>>> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
> >>>> +    if (ret) {
> >>>> +        error_report("vfio: failed to get iommu info for container: %s",
> >>>> +                     strerror(errno));
> >>>> +        goto put_group_exit;
> >>>> +    }
> >>>> +    *group_fd = group->fd;
> >>>
> >>> The above gets don't actually increment a reference count, so copying
> >>> the fd seems risky here.
> >>
> >>
> >> If fd is gone while I am carrying it to my "external VFIO user" to call
> >> kvmppc_vfio_group_get_external_user() on it, then the guest just shut
> >> itself in a foot, no?
> >> And I do not see how I would make it no risky, do you?
> > 
> > We've handled the case in the kernel where the IOMMU code has a
> > reference to the group so the group won't go away as long as that
> > reference is in place, but we don't have that in QEMU.  If you supported
> > hotplug, how would QEMU vfio notify spapr code to release the group?  I
> > think you'd be left with the spapr kernel code holding the group
> > reference and possibly a bogus file descriptor in QEMU if the group is
> > close()'d and you've cached it from the above code.  Perhaps it's
> > sufficient to note that you don't support hot remove, but do you
> > actually do anything to prevent it?  Thanks,
> 
> 
> I do not cache group_fd, I copy iе from VFIOGroup and immediately pass it
> to KVM which immediately calls fget() on it. This is really short distance
> and the only thing for protection here would be:
> 
> -    *group_fd = group->fd;
> +    *group_fd = dup(group->fd);
> 
> and then close(group_fd) after I passed it to KVM. I guess it has to be
> done anyway. But I suspect this is not what you are talking about...

Meanwhile each of the processors has executed several million
instructions during this sequence of "immediate" events.  Besides, this
just creates the interface, who uses it and how is outside of our
control after this is in place.  Rather than creating an interface where
you can ask for info, some of which may be closely tied to the lifecycle
of a specific device, why not make an interface where vfio-pci can
register and unregister information about a device as part of it's
lifecycle?  That at least gives you an end point after which you know
the data is no longer valid.  Thanks,

Alex

> >>>> +
> >>>> +    return 0;
> >>>> +
> >>>> +put_group_exit:
> >>>> +    vfio_put_group(group);
> >>>> +
> >>>> +put_as_exit:
> >>>> +    vfio_put_address_space(space);
> >>>
> >>> But put_group calls disconnect_container which calls
> >>> put_address_space... so it get's put twice.  The lack of symmetry
> >>> already bites us with a bug.
> >>
> >> True. This will be fixed by moving vfio_get_address_space() into
> >> vfio_get_group().
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr
  2013-09-13 11:34         ` Alexey Kardashevskiy
@ 2013-09-25 20:33           ` Alex Williamson
  0 siblings, 0 replies; 32+ messages in thread
From: Alex Williamson @ 2013-09-25 20:33 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On Fri, 2013-09-13 at 21:34 +1000, Alexey Kardashevskiy wrote:
> On 09/11/2013 08:13 AM, Alex Williamson wrote:
> > On Tue, 2013-09-10 at 19:00 +1000, Alexey Kardashevskiy wrote:
> >> On 09/06/2013 05:05 AM, Alex Williamson wrote:
> >>> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
> >>>> This turns the sPAPR support on and enables VFIO container use
> >>>> in the kernel.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> Changes:
> >>>> v4:
> >>>> * fixed format string to use %m which is a glibc extension:
> >>>> "Print output of strerror(errno). No argument is required."
> >>>> ---
> >>>>  hw/misc/vfio.c | 30 ++++++++++++++++++++++++++++++
> >>>>  1 file changed, 30 insertions(+)
> >>>>
> >>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> >>>> index 4210471..882da70 100644
> >>>> --- a/hw/misc/vfio.c
> >>>> +++ b/hw/misc/vfio.c
> >>>> @@ -2815,6 +2815,36 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space)
> >>>>  
> >>>>          memory_listener_register(&container->iommu_data.listener,
> >>>>                                   container->space->as);
> >>>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> >>>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio: failed to set group container: %m");
> >>>> +            g_free(container);
> >>>> +            close(fd);
> >>>> +            return -errno;
> >>>> +        }
> >>>> +
> >>>> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio: failed to set iommu for container: %m");
> >>>> +            g_free(container);
> >>>> +            close(fd);
> >>>> +            return -errno;
> >>>> +        }
> >>>> +
> >>>> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> >>>> +        if (ret) {
> >>>> +            error_report("vfio: failed to enable container: %m");
> >>>> +            g_free(container);
> >>>> +            close(fd);
> >>>> +            return -errno;
> >>>
> >>> These (and the copies that already exist in this function) are screaming
> >>> for a goto.
> >>
> >>
> >> Heh. So. There should be 2 patches then - one adding gotos to the existing
> >> code and another one adding new functionality-with-gotos-already.
> >> I can do that, is it what you suggest?
> > 
> > Yes please.
> > 
> >> What about the rest of the series? Next time I will split "[Qemu-devel]
> >> [PATCH v4 05/12] spapr_pci: convert init to realize" but the rest will be
> >> still the same. I have understanding that Alex Graf is expecting you to
> >> review the whole thing (ack/sob? not sure how this all works) before he
> >> pulls it into his tree.
> > 
> > Oh, I've been picking out the vfio ones to review.  Ok, on the next
> > revision I'll review them all, but please still split it into series for
> > vfio and series for ppc.  Thanks,
> > 
> > Alex
> > 
> 
> This is the whole set of what I have on this matter:
> 
> KVM: spapr-vfio: enable in-kernel acceleration
> spapr-vfio: enable for spapr
> 
> spapr-vfio: add spapr-pci-vfio-host-bridge to support vfio
> spapr-iommu: add SPAPR VFIO IOMMU device
> spapr-iommu: introduce SPAPR_TCE_TABLE class
> spapr-pci: converts fprintf to error_report
> spapr-pci: add spapr_pci trace
> spapr-pci: introduce a finish_realize() callback
> spapr-pci: convert init() callback to realize()
> 
> spapr vfio: add vfio_container_spapr_get_info()
> vfio: Add guest side IOMMU support
> vfio: Create VFIOAddressSpace objects as needed
> vfio: Introduce VFIO address spaces
> vfio: rework to have error paths
> vfio: Fix 128 bit handling
> vfio: Fix debug output for int128 values
> int128: add int128_exts64()
> 
> There are 3 sets: vfio stuff (8 patches), spapr-pci rework (7 patches) and
> finally the enablement (2 patches). Post it as 3 series? Or split 128bit
> things to a series as well? :)

Yes, it would be great if you sent the 128bit stuff separately.  I keep
running into the debug output bug but your patch for it is buried among
a few other series that aren't ready and that I've lost track of.  Just
like individual patches, series should contain the smallest useful set
of changes that they can and be targeted at a single subsystem.  That
way we can agree on some things, get them in and move on to the harder
ones.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info()
  2013-09-25 20:29           ` Alex Williamson
@ 2013-09-26 10:22             ` Alexey Kardashevskiy
  0 siblings, 0 replies; 32+ messages in thread
From: Alexey Kardashevskiy @ 2013-09-26 10:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexander Graf, qemu-ppc, qemu-devel, David Gibson

On 09/26/2013 06:29 AM, Alex Williamson wrote:
> On Fri, 2013-09-13 at 20:11 +1000, Alexey Kardashevskiy wrote:
>> On 09/11/2013 08:11 AM, Alex Williamson wrote:
>>> On Tue, 2013-09-10 at 18:36 +1000, Alexey Kardashevskiy wrote:
>>>> On 09/06/2013 05:01 AM, Alex Williamson wrote:
>>>>> On Fri, 2013-08-30 at 20:15 +1000, Alexey Kardashevskiy wrote:
>>>>>> As sPAPR platform supports DMA windows on a PCI bus, the information
>>>>>> about their location and size should be passed into the guest via
>>>>>> the device tree.
>>>>>>
>>>>>> The patch adds a helper to read this info from the container fd.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>> Changes:
>>>>>> v4:
>>>>>> * fixed possible leaks on error paths
>>>>>> ---
>>>>>>  hw/misc/vfio.c         | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  include/hw/misc/vfio.h | 11 +++++++++++
>>>>>>  2 files changed, 56 insertions(+)
>>>>>>  create mode 100644 include/hw/misc/vfio.h
>>>>>>
>>>>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>>>>>> index 53791fb..4210471 100644
>>>>>> --- a/hw/misc/vfio.c
>>>>>> +++ b/hw/misc/vfio.c
>>>>>> @@ -39,6 +39,7 @@
>>>>>>  #include "qemu/range.h"
>>>>>>  #include "sysemu/kvm.h"
>>>>>>  #include "sysemu/sysemu.h"
>>>>>> +#include "hw/misc/vfio.h"
>>>>>>  
>>>>>>  /* #define DEBUG_VFIO */
>>>>>>  #ifdef DEBUG_VFIO
>>>>>> @@ -3490,3 +3491,47 @@ static void register_vfio_pci_dev_type(void)
>>>>>>  }
>>>>>>  
>>>>>>  type_init(register_vfio_pci_dev_type)
>>>>>> +
>>>>>> +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid,
>>>>>> +                                  struct vfio_iommu_spapr_tce_info *info,
>>>>>> +                                  int *group_fd)
>>>>>> +{
>>>>>> +    VFIOAddressSpace *space;
>>>>>> +    VFIOGroup *group;
>>>>>> +    VFIOContainer *container;
>>>>>> +    int ret, fd;
>>>>>> +
>>>>>> +    space = vfio_get_address_space(as);
>>>>>> +    if (!space) {
>>>>>> +        return -1;
>>>>>> +    }
>>>>>> +    group = vfio_get_group(groupid, space);
>>>>>> +    if (!group) {
>>>>>> +        goto put_as_exit;
>>>>>> +    }
>>>>>> +    container = group->container;
>>>>>> +    if (!group->container) {
>>>>>> +        goto put_group_exit;
>>>>>> +    }
>>>>>> +    fd = container->fd;
>>>>>> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>>>>> +        goto put_group_exit;
>>>>>> +    }
>>>>>> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
>>>>>> +    if (ret) {
>>>>>> +        error_report("vfio: failed to get iommu info for container: %s",
>>>>>> +                     strerror(errno));
>>>>>> +        goto put_group_exit;
>>>>>> +    }
>>>>>> +    *group_fd = group->fd;
>>>>>
>>>>> The above gets don't actually increment a reference count, so copying
>>>>> the fd seems risky here.
>>>>
>>>>
>>>> If fd is gone while I am carrying it to my "external VFIO user" to call
>>>> kvmppc_vfio_group_get_external_user() on it, then the guest just shut
>>>> itself in a foot, no?
>>>> And I do not see how I would make it no risky, do you?
>>>
>>> We've handled the case in the kernel where the IOMMU code has a
>>> reference to the group so the group won't go away as long as that
>>> reference is in place, but we don't have that in QEMU.  If you supported
>>> hotplug, how would QEMU vfio notify spapr code to release the group?  I
>>> think you'd be left with the spapr kernel code holding the group
>>> reference and possibly a bogus file descriptor in QEMU if the group is
>>> close()'d and you've cached it from the above code.  Perhaps it's
>>> sufficient to note that you don't support hot remove, but do you
>>> actually do anything to prevent it?  Thanks,
>>
>>
>> I do not cache group_fd, I copy iе from VFIOGroup and immediately pass it
>> to KVM which immediately calls fget() on it. This is really short distance
>> and the only thing for protection here would be:
>>
>> -    *group_fd = group->fd;
>> +    *group_fd = dup(group->fd);
>>
>> and then close(group_fd) after I passed it to KVM. I guess it has to be
>> done anyway. But I suspect this is not what you are talking about...
> 
> Meanwhile each of the processors has executed several million
> instructions during this sequence of "immediate" events.  Besides, this
> just creates the interface, who uses it and how is outside of our
> control after this is in place.  Rather than creating an interface where
> you can ask for info, some of which may be closely tied to the lifecycle
> of a specific device, why not make an interface where vfio-pci can
> register and unregister information about a device as part of it's
> lifecycle?  That at least gives you an end point after which you know
> the data is no longer valid.  Thanks,

Sorry, I am not sure I understood you here.

As I understand the whole VFIO external API thing will move from spapr to
vfio so all I'll have to do will be just passing LIOBN to vfio so
vfio_container_spapr_get_info() will become
vfio_container_spapr_register_liobn_and_get_info() and no business with any
group fd. Is that correct?

Anyway it would be useful to see any rough QEMU patch or some git tree with
it. Thanks!





> 
> Alex
> 
>>>>>> +
>>>>>> +    return 0;
>>>>>> +
>>>>>> +put_group_exit:
>>>>>> +    vfio_put_group(group);
>>>>>> +
>>>>>> +put_as_exit:
>>>>>> +    vfio_put_address_space(space);
>>>>>
>>>>> But put_group calls disconnect_container which calls
>>>>> put_address_space... so it get's put twice.  The lack of symmetry
>>>>> already bites us with a bug.
>>>>
>>>> True. This will be fixed by moving vfio_get_address_space() into
>>>> vfio_get_group().
>>
>>
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-09-26 10:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-30 10:15 [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 01/12] vfio: Introduce VFIO address spaces Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 02/12] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy
2013-09-05 18:24   ` Alex Williamson
2013-09-10  8:09     ` Alexey Kardashevskiy
2013-09-10 21:51       ` Alex Williamson
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 03/12] vfio: Add guest side IOMMU support Alexey Kardashevskiy
2013-09-05 18:49   ` Alex Williamson
2013-09-10  8:22     ` Alexey Kardashevskiy
2013-09-10 22:02       ` Alex Williamson
2013-09-11  6:15         ` Paolo Bonzini
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 04/12] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy
2013-09-05 19:01   ` Alex Williamson
2013-09-10  8:36     ` Alexey Kardashevskiy
2013-09-10 22:11       ` Alex Williamson
2013-09-13 10:11         ` Alexey Kardashevskiy
2013-09-25 20:29           ` Alex Williamson
2013-09-26 10:22             ` Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 05/12] spapr_pci: convert init to realize Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 06/12] spapr_pci: add spapr_pci trace Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 07/12] spapr_pci: converts fprintf to error_report Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 08/12] spapr_iommu: introduce SPAPR_TCE_TABLE class Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 09/12] spapr_iommu: add SPAPR VFIO IOMMU Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 10/12] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 11/12] spapr vfio: enable for spapr Alexey Kardashevskiy
2013-09-05 19:05   ` Alex Williamson
2013-09-10  9:00     ` Alexey Kardashevskiy
2013-09-10 22:13       ` Alex Williamson
2013-09-13 11:34         ` Alexey Kardashevskiy
2013-09-25 20:33           ` Alex Williamson
2013-08-30 10:15 ` [Qemu-devel] [PATCH v4 12/12] spapr kvm vfio: enable in-kernel acceleration Alexey Kardashevskiy
2013-09-05  6:43 ` [Qemu-devel] [PATCH v4 00/12] vfio on spapr-ppc64 Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).