[Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64
@ 2013-04-22  8:02 Alexey Kardashevskiy
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 1/3] vfio: make some of VFIO API public Alexey Kardashevskiy
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Alexey Kardashevskiy @ 2013-04-22  8:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

Yes, we are still tuning this stuff for us :)

Changes:

* new "spapr-pci-vfio-host-bridge" device is introduced
(inherits from spapr-pci-host-bridge);

* device#1->group->container->device#2,.. creation scheme is changed to
group->container->devices scheme (on ppc64);

* removed vfio_iommu_spapr_tce_dma_map/vfio_iommu_spapr_tce_dma_unmap
as they are not really necessary now and it does not look like we will
need them in the nearest future.

Comments?


Alexey Kardashevskiy (3):
  vfio: make some of VFIO API public
  vfio: add support for spapr platform (powerpc64 server)
  spapr vfio: add spapr-pci-vfio-host-bridge to support vfio

 hw/misc/vfio.c              |   94 ++++++++++++++++++++--
 hw/ppc/spapr_iommu.c        |  151 +++++++++++++++++++++++++++--------
 hw/ppc/spapr_pci.c          |  186 +++++++++++++++++++++++++++++++++++++++++--
 include/hw/misc/vfio.h      |   20 +++++
 include/hw/pci-host/spapr.h |   12 +++
 include/hw/ppc/spapr.h      |    3 +
 trace-events                |    4 +
 7 files changed, 422 insertions(+), 48 deletions(-)
 create mode 100644 include/hw/misc/vfio.h

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 1/3] vfio: make some of VFIO API public
  2013-04-22  8:02 [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alexey Kardashevskiy
@ 2013-04-22  8:02 ` Alexey Kardashevskiy
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 2/3] vfio: add support for spapr platform (powerpc64 server) Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Alexey Kardashevskiy @ 2013-04-22  8:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

The patch makes vfio_dma_map/vfio_dma_unmap functions public
for explicit use by VFIO clients rather that just being
called from MemoryListener.

The patch also adds a "connect" parameter to the vfio_get_group() function
to allow control over group and container creation and connection before
any device from the group is actially created in QEMU.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/misc/vfio.c         |   15 ++++++++-------
 include/hw/misc/vfio.h |   17 +++++++++++++++++
 2 files changed, 25 insertions(+), 7 deletions(-)
 create mode 100644 include/hw/misc/vfio.h

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 693a9ff..503125e 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -39,6 +39,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+#include "hw/misc/vfio.h"
 
 /* #define DEBUG_VFIO */
 #ifdef DEBUG_VFIO
@@ -1877,8 +1878,8 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr,
 /*
  * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86
  */
-static int vfio_dma_unmap(VFIOContainer *container,
-                          hwaddr iova, ram_addr_t size)
+int vfio_dma_unmap(VFIOContainer *container,
+                   hwaddr iova, ram_addr_t size)
 {
     struct vfio_iommu_type1_dma_unmap unmap = {
         .argsz = sizeof(unmap),
@@ -1895,8 +1896,8 @@ static int vfio_dma_unmap(VFIOContainer *container,
     return 0;
 }
 
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
-                        ram_addr_t size, void *vaddr, bool readonly)
+int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+                 ram_addr_t size, void *vaddr, bool readonly)
 {
     struct vfio_iommu_type1_dma_map map = {
         .argsz = sizeof(map),
@@ -2706,7 +2707,7 @@ static void vfio_disconnect_container(VFIOGroup *group)
     }
 }
 
-static VFIOGroup *vfio_get_group(int groupid)
+VFIOGroup *vfio_get_group(int groupid, bool connect)
 {
     VFIOGroup *group;
     char path[32];
@@ -2747,7 +2748,7 @@ static VFIOGroup *vfio_get_group(int groupid)
     group->groupid = groupid;
     QLIST_INIT(&group->device_list);
 
-    if (vfio_connect_container(group)) {
+    if (connect && vfio_connect_container(group)) {
         error_report("vfio: failed to setup container for group %d", groupid);
         close(group->fd);
         g_free(group);
@@ -2980,7 +2981,7 @@ static int vfio_initfn(PCIDevice *pdev)
     DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain,
             vdev->host.bus, vdev->host.slot, vdev->host.function, groupid);
 
-    group = vfio_get_group(groupid);
+    group = vfio_get_group(groupid, true);
     if (!group) {
         error_report("vfio: failed to get group %d", groupid);
         return -ENOENT;
diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
new file mode 100644
index 0000000..f39aef8
--- /dev/null
+++ b/include/hw/misc/vfio.h
@@ -0,0 +1,17 @@
+#ifndef VFIO_API_H
+#define VFIO_API_H
+
+#include <linux/vfio.h>
+
+struct VFIOContainer;
+struct VFIOGroup;
+
+extern int vfio_dma_unmap(struct VFIOContainer *container,
+                          hwaddr iova, ram_addr_t size);
+
+extern int vfio_dma_map(struct VFIOContainer *container, hwaddr iova,
+                        ram_addr_t size, void *vaddr, bool readonly);
+
+extern struct VFIOGroup *vfio_get_group(int groupid, bool connect);
+
+#endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 2/3] vfio: add support for spapr platform (powerpc64 server)
  2013-04-22  8:02 [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alexey Kardashevskiy
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 1/3] vfio: make some of VFIO API public Alexey Kardashevskiy
@ 2013-04-22  8:02 ` Alexey Kardashevskiy
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 3/3] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
  2013-04-22 18:39 ` [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alex Williamson
  3 siblings, 0 replies; 8+ messages in thread
From: Alexey Kardashevskiy @ 2013-04-22  8:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

The existing code assumes that containers are capable of
multiple groups support which is not the case for POWERPC64
where groups are defined by the hardware configuration and
cannot share one container.

The earlier proposal for PPC64 support was:
- spapr_phb_vfio_init(): create a spapr-pci-vfio-host-bridge,
	IOMMU ID is already known at this stage;
- spapr_pci_vfio_scan(): start scanning the group and adding devices;
- vfio_initfn(): the VFIO device init function allocates a group struct
	(if not yet);
- vfio_get_group():  a group struct allocator allocates a container
	(if not yet);
- vfio_connect_container(): connects a group to a container, requests
	POWERPC64 specific properties and calls a callback to spapr_pci.c
	passing those properties.

As the group ID is knows from the very beginning (as it is supplied via
the command line), it makes more sense to create group and container
structs first and then add devices from this group.

In order to achieve this, this patch adds a vfio_container_spapr_alloc()
function which allocates sPAPR container and returns its properties which
are a DMA window size and start.

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/misc/vfio.c         |   79 ++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/misc/vfio.h |    3 ++
 2 files changed, 82 insertions(+)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 503125e..adcbb80 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -2684,6 +2684,85 @@ static int vfio_connect_container(VFIOGroup *group)
     return 0;
 }
 
+VFIOContainer *vfio_container_spapr_alloc(VFIOGroup *group,
+                                          struct vfio_iommu_spapr_tce_info *info)
+{
+    VFIOContainer *container;
+    int ret, fd;
+
+    if (group->container) {
+        return NULL;
+    }
+
+    fd = qemu_open("/dev/vfio/vfio", O_RDWR);
+    if (fd < 0) {
+        error_report("vfio: failed to open /dev/vfio/vfio: %m");
+        return NULL;
+    }
+
+    ret = ioctl(fd, VFIO_GET_API_VERSION);
+    if (ret != VFIO_API_VERSION) {
+        error_report("vfio: supported vfio version: %d, "
+                     "reported version: %d", VFIO_API_VERSION, ret);
+        close(fd);
+        return NULL;
+    }
+
+    container = g_malloc0(sizeof(*container));
+    container->fd = fd;
+
+    if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %s",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return NULL;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %s",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return NULL;
+        }
+
+        ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
+        if (ret) {
+            error_report("vfio: failed to get iommu info for container: %s",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return NULL;
+        }
+
+        ret = ioctl(container->fd, VFIO_IOMMU_ENABLE);
+        if (ret) {
+            error_report("vfio: failed to enable container: %s",
+                         strerror(errno));
+            g_free(container);
+            close(fd);
+            return NULL;
+        }
+    } else {
+        error_report("vfio: No available IOMMU models");
+        g_free(container);
+        close(fd);
+        return NULL;
+    }
+
+    QLIST_INIT(&container->group_list);
+    QLIST_INSERT_HEAD(&container_list, container, next);
+
+    group->container = container;
+    QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+    return container;
+}
+
 static void vfio_disconnect_container(VFIOGroup *group)
 {
     VFIOContainer *container = group->container;
diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
index f39aef8..d1c7ff1 100644
--- a/include/hw/misc/vfio.h
+++ b/include/hw/misc/vfio.h
@@ -14,4 +14,7 @@ extern int vfio_dma_map(struct VFIOContainer *container, hwaddr iova,
 
 extern struct VFIOGroup *vfio_get_group(int groupid, bool connect);
 
+extern struct VFIOContainer *vfio_container_spapr_alloc(struct VFIOGroup *group,
+                                                        struct vfio_iommu_spapr_tce_info *info);
+
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Qemu-devel] [PATCH 3/3] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio
  2013-04-22  8:02 [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alexey Kardashevskiy
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 1/3] vfio: make some of VFIO API public Alexey Kardashevskiy
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 2/3] vfio: add support for spapr platform (powerpc64 server) Alexey Kardashevskiy
@ 2013-04-22  8:02 ` Alexey Kardashevskiy
  2013-04-22 18:39 ` [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alex Williamson
  3 siblings, 0 replies; 8+ messages in thread
From: Alexey Kardashevskiy @ 2013-04-22  8:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel, David Gibson

The patch adds a spapr-pci-vfio-host-bridge device type
which is a PCI Host Bridge with VFIO support. The new device
inherits from the spapr-pci-host-bridge device and adds
the following properties:
	iommu - IOMMU group ID which represents a Partitionable
	 	Endpoint, QEMU/ppc64 uses a separate PHB for
		an IOMMU group so the guest kernel has to have
		PCI Domain support enabled.
	forceaddr (optional, 0 by default) - forces QEMU to copy
		device:function from the host address as
		certain guest drivers expect devices to appear in
		particular locations;
	mf (optional, 0 by default) - forces multifunction bit for
		the function #0 of a found device, only makes sense
		for multifunction devices and only with the forceaddr
		property set. It would not be required if there
		was a way to know in advance whether a device is
		multifunctional or not.
	scan (optional, 1 by default) - if non-zero, the new PHB walks
		through all non-bridge devices in the group and tries
		adding them to the PHB; if zero, all devices in the group
		have to be configured manually via the QEMU command line.

The patch also adds a VFIO IOMMU type support to the existing
sPAPR TCE list in spapr_iommu.c.

Examples:
1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6:
	-device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6

2) Configure and Add 3 functions of a multifunctional device to QEMU:
(the NEC PCI USB card is used as an example here):
	-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \
	-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
	-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
	-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB

Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 hw/ppc/spapr_iommu.c        |  151 +++++++++++++++++++++++++++--------
 hw/ppc/spapr_pci.c          |  186 +++++++++++++++++++++++++++++++++++++++++--
 include/hw/pci-host/spapr.h |   12 +++
 include/hw/ppc/spapr.h      |    3 +
 trace-events                |    4 +
 5 files changed, 315 insertions(+), 41 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 1bc8cb6..eabe56d 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -16,14 +16,17 @@
  * You should have received a copy of the GNU Lesser General Public
  * License along with this library; if not, see <http://www.gnu.org/licenses/>.
  */
+
 #include "hw/hw.h"
 #include "sysemu/kvm.h"
 #include "hw/qdev.h"
 #include "kvm_ppc.h"
 #include "sysemu/dma.h"
 #include "exec/address-spaces.h"
+#include "trace.h"
 
 #include "hw/ppc/spapr.h"
+#include "hw/misc/vfio.h"
 
 #include <libfdt.h>
 
@@ -41,14 +44,21 @@ typedef struct sPAPRTCETable sPAPRTCETable;
 struct sPAPRTCETable {
     DMAContext dma;
     uint32_t liobn;
-    uint32_t window_size;
-    sPAPRTCE *table;
-    bool bypass;
     int fd;
+    enum { TCETYPE_EMULATED,TCETYPE_VFIO } type;
+    union {
+        struct {
+            uint32_t window_size;
+            sPAPRTCE *table;
+            bool bypass;
+        };
+        struct {
+            struct VFIOContainer *container;
+        };
+    };
     QLIST_ENTRY(sPAPRTCETable) list;
 };
 
-
 QLIST_HEAD(spapr_tce_tables, sPAPRTCETable) spapr_tce_tables;
 
 static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
@@ -159,6 +169,7 @@ DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
     tcet = g_malloc0(sizeof(*tcet));
     dma_context_init(&tcet->dma, &address_space_memory, spapr_tce_translate, NULL, NULL);
 
+    tcet->type = TCETYPE_EMULATED;
     tcet->liobn = liobn;
     tcet->window_size = window_size;
 
@@ -240,6 +251,47 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba,
     return H_SUCCESS;
 }
 
+DMAContext *spapr_vfio_new_dma(uint32_t liobn, int iommu_id,
+                               struct VFIOContainer *container)
+{
+    sPAPRTCETable *tcet;
+
+    tcet = g_malloc0(sizeof(*tcet));
+    tcet->type = TCETYPE_VFIO;
+    tcet->container = container;
+    tcet->liobn = liobn;
+
+    QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list);
+
+    return &tcet->dma;
+}
+
+static int put_tce_vfio(sPAPRTCETable *tcet, target_ulong ioba,
+                        target_ulong tce)
+{
+    int ret;
+
+    if (tce & SPAPR_TCE_PAGE_MASK) {
+        ret = vfio_dma_map(tcet->container, ioba, SPAPR_TCE_PAGE_SIZE,
+                           qemu_get_ram_ptr(tce & ~SPAPR_TCE_PAGE_MASK),
+                           (tce & SPAPR_TCE_PAGE_MASK) == SPAPR_TCE_RO);
+        trace_spapr_iommu("vfio map", tcet->liobn, ioba, tce, ret);
+        if (ret < 0) {
+            perror("spapr_tce map");
+            return H_PARAMETER;
+        }
+    } else {
+        ret = vfio_dma_unmap(tcet->container, ioba, SPAPR_TCE_PAGE_SIZE);
+        trace_spapr_iommu("vfio unmap", tcet->liobn, ioba, 0, ret);
+        if (ret < 0) {
+            perror("spapr_tce unmap");
+            return H_PARAMETER;
+        }
+    }
+
+    return H_SUCCESS;
+}
+
 static target_ulong h_put_tce_indirect(PowerPCCPU *cpu,
                                        sPAPREnvironment *spapr,
                                        target_ulong opcode, target_ulong *args)
@@ -252,24 +304,37 @@ static target_ulong h_put_tce_indirect(PowerPCCPU *cpu,
     target_ulong ret = 0;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
-    if (tcet) {
-        for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
-            target_ulong tce = ldq_phys((tce_list & ~SPAPR_TCE_PAGE_MASK) +
-                                        i * sizeof(target_ulong));
-            ret = put_tce_emu(tcet, ioba, tce);
-            if (ret) {
-                break;
-            }
+    if (!tcet) {
+        return H_PARAMETER;
+    }
+
+    if (liobn & 0xFFFFFFFF00000000ULL) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN "
+                      TARGET_FMT_lx "\n", liobn);
+        return H_PARAMETER;
+    }
+
+    for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
+        target_ulong tce = ldq_phys((tce_list & ~SPAPR_TCE_PAGE_MASK) +
+                                    i * sizeof(target_ulong));
+        switch (tcet->type) {
+        case TCETYPE_EMULATED: ret = put_tce_emu(tcet, ioba, tce); break;
+        case TCETYPE_VFIO: ret = put_tce_vfio(tcet, ioba, tce); break;
+        default: ret = H_PARAMETER; break;
+        }
+        if (ret) {
+            break;
         }
-        return ret;
     }
+
 #ifdef DEBUG_TCE
     fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
-            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx "\n",
-            __func__, liobn, ioba, tce);
+            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx
+            " ret = " TARGET_FMT_ld "\n",
+            __func__, liobn, ioba, tce, ret);
 #endif
 
-    return H_PARAMETER;
+    return ret;
 }
 
 static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr,
@@ -283,24 +348,36 @@ static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     target_ulong ret = 0;
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
 
+    if (!tcet) {
+        return H_PARAMETER;
+    }
+
+    if (liobn & 0xFFFFFFFF00000000ULL) {
+        hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN "
+                      TARGET_FMT_lx "\n", liobn);
+        return H_PARAMETER;
+    }
+
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
-    if (tcet) {
-        for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
-            ret = put_tce_emu(tcet, ioba, tce_value);
-            if (ret) {
-                break;
-            }
+    for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) {
+        switch (tcet->type) {
+        case TCETYPE_EMULATED: ret = put_tce_emu(tcet, ioba, tce_value); break;
+        case TCETYPE_VFIO: ret = put_tce_vfio(tcet, ioba, tce_value); break;
+        default: ret = H_PARAMETER; break;
+        }
+        if (ret) {
+            break;
         }
-        return ret;
     }
 #ifdef DEBUG_TCE
-    fprintf(stderr, "%s on liobn=" TARGET_FMT_lx /*%s*/
-            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx "\n",
-            __func__, liobn, /*dev->qdev.id, */ioba, tce);
+    fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
+            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx
+            " ret = " TARGET_FMT_ld "\n",
+            __func__, liobn, ioba, tce, ret);
 #endif
 
-    return H_PARAMETER;
+    return ret;
 }
 
 static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr,
@@ -310,19 +387,27 @@ static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr,
     target_ulong ioba = args[1];
     target_ulong tce = args[2];
     sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn);
+    int ret;
+
+    if (!tcet) {
+        return H_PARAMETER;
+    }
 
     ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1);
 
-    if (tcet) {
-        return put_tce_emu(tcet, ioba, tce);
+    switch (tcet->type) {
+    case TCETYPE_EMULATED: ret = put_tce_emu(tcet, ioba, tce); break;
+    case TCETYPE_VFIO: ret = put_tce_vfio(tcet, ioba, tce); break;
+    default: ret = H_PARAMETER; break;
     }
 #ifdef DEBUG_TCE
-    fprintf(stderr, "%s on liobn=" TARGET_FMT_lx /*%s*/
-            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx "\n",
-            __func__, liobn, /*dev->qdev.id, */ioba, tce);
+    fprintf(stderr, "%s on liobn=" TARGET_FMT_lx
+            "  ioba 0x" TARGET_FMT_lx "  TCE 0x" TARGET_FMT_lx
+            " ret = " TARGET_FMT_ld "\n",
+            __func__, liobn, ioba, tce, ret);
 #endif
 
-    return H_PARAMETER;
+    return ret;
 }
 
 void spapr_iommu_init(void)
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 1da02ae..06b491e 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -22,6 +22,9 @@
  * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
  * THE SOFTWARE.
  */
+#include <sys/types.h>
+#include <dirent.h>
+
 #include "hw/hw.h"
 #include "hw/pci/pci.h"
 #include "hw/pci/msi.h"
@@ -34,6 +37,7 @@
 #include "trace.h"
 
 #include "hw/pci/pci_bus.h"
+#include "hw/misc/vfio.h"
 
 /* Copied from the kernel arch/powerpc/platforms/pseries/msi.c */
 #define RTAS_QUERY_FN           0
@@ -514,7 +518,7 @@ static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
     return phb->dma;
 }
 
-static int spapr_phb_init(SysBusDevice *s)
+static int _spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
     PCIHostState *phb = PCI_HOST_BRIDGE(s);
@@ -644,13 +648,6 @@ static int spapr_phb_init(SysBusDevice *s)
                            PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS);
     phb->bus = bus;
 
-    sphb->dma_window_start = 0;
-    sphb->dma_window_size = 0x40000000;
-    sphb->dma = spapr_tce_new_dma_context(sphb->dma_liobn, sphb->dma_window_size);
-    if (!sphb->dma) {
-        fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
-        return -1;
-    }
     pci_setup_iommu(bus, spapr_pci_dma_context_fn, sphb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
@@ -670,6 +667,27 @@ static int spapr_phb_init(SysBusDevice *s)
     return 0;
 }
 
+static int spapr_phb_init(SysBusDevice *s)
+{
+    sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
+    int ret;
+
+    ret = _spapr_phb_init(s);
+    if (ret)
+        return ret;
+
+    sphb->dma_window_start = 0;
+    sphb->dma_window_size = 0x40000000;
+    sphb->dma = spapr_tce_new_dma_context(sphb->dma_liobn,
+                                          sphb->dma_window_size);
+    if (!sphb->dma) {
+        fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname);
+        return -1;
+    }
+
+    return 0;
+}
+
 static void spapr_phb_reset(DeviceState *qdev)
 {
     SysBusDevice *s = SYS_BUS_DEVICE(qdev);
@@ -770,6 +788,153 @@ PCIHostState *spapr_create_phb(sPAPREnvironment *spapr, int index)
     return PCI_HOST_BRIDGE(dev);
 }
 
+/* sPAPR VFIO */
+static Property spapr_phb_vfio_properties[] = {
+    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1),
+    DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0),
+    DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static int spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb)
+{
+    PCIHostState *phb = PCI_HOST_BRIDGE(svphb);
+    char *iommupath;
+    DIR *dirp;
+    struct dirent *entry;
+
+    if (!svphb->scan) {
+        trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname);
+        return 0;
+    }
+
+    iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/",
+                                svphb->iommugroupid);
+    if (!iommupath) {
+        return -ENOMEM;
+    }
+
+    dirp = opendir(iommupath);
+    if (!dirp) {
+        fprintf(stderr, "failed to scan group=%d\n", svphb->iommugroupid);
+        g_free(iommupath);
+        return -1;
+    }
+
+    while ((entry = readdir(dirp)) != NULL) {
+        char *tmp;
+        FILE *deviceclassfile;
+        unsigned deviceclass = 0, domainid, busid, devid, fnid;
+        char addr[32];
+        DeviceState *dev;
+
+        if (sscanf(entry->d_name, "%X:%X:%X.%x",
+                   &domainid, &busid, &devid, &fnid) != 4) {
+            continue;
+        }
+
+        tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name);
+        trace_spapr_pci("Reading device class from ", tmp);
+
+        deviceclassfile = fopen(tmp, "r");
+        if (deviceclassfile) {
+            int ret = fscanf(deviceclassfile, "%x", &deviceclass);
+            fclose(deviceclassfile);
+            if (ret != 1) {
+                continue;
+            }
+        }
+        g_free(tmp);
+
+        if (!deviceclass) {
+            continue;
+        }
+        if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) {
+            /* Skip bridges */
+            continue;
+        }
+        trace_spapr_pci("Creating device from ", entry->d_name);
+
+        dev = qdev_create(&phb->bus->qbus, "vfio-pci");
+        if (!dev) {
+            fprintf(stderr, "failed to create vfio-pci\n");
+            continue;
+        }
+        qdev_prop_parse(dev, "host", entry->d_name);
+        if (svphb->force_addr) {
+            snprintf(addr, sizeof(addr), "%x.%x", devid, fnid);
+            qdev_prop_parse(dev, "addr", addr);
+        }
+        if (svphb->enable_multifunction) {
+            qdev_prop_set_bit(dev, "multifunction", 1);
+        }
+        qdev_init_nofail(dev);
+    }
+    closedir(dirp);
+    g_free(iommupath);
+
+    return 0;
+}
+
+static int spapr_phb_vfio_init(SysBusDevice *s)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(s);
+    struct VFIOGroup *group;
+    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
+    int ret;
+
+    if (svphb->iommugroupid == -1) {
+        fprintf(stderr, "Wrong IOMMU group ID %d\n", svphb->iommugroupid);
+        return -1;
+    }
+
+    ret = _spapr_phb_init(s);
+    if (ret) {
+        return ret;
+    }
+
+    group = vfio_get_group(svphb->iommugroupid, false);
+    if (!group) {
+        return -EFAULT;
+    }
+
+    svphb->container = vfio_container_spapr_alloc(group, &info);
+    if (!svphb->container) {
+        return -EFAULT;
+    }
+
+    svphb->phb.dma_window_start = info.dma32_window_start;
+    svphb->phb.dma_window_size = info.dma32_window_size;
+    svphb->phb.dma = spapr_vfio_new_dma(svphb->phb.dma_liobn,
+                                        svphb->iommugroupid,
+                                        svphb->container);
+
+    ret = spapr_pci_vfio_scan(svphb);
+    /* dma_window_xxxx will be initialized from
+       spapr_register_vfio_container() when VFIO will create the very first
+       device in the group */
+
+    return ret;
+}
+
+static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
+{
+    SysBusDeviceClass *sdc = SYS_BUS_DEVICE_CLASS(klass);
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    sdc->init = spapr_phb_vfio_init;
+    dc->props = spapr_phb_vfio_properties;
+    dc->vmsd = &vmstate_spapr_pci;
+}
+
+static const TypeInfo spapr_phb_vfio_info = {
+    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
+    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .class_init    = spapr_phb_vfio_class_init,
+};
+
 /* Macros to operate with address in OF binding to PCI */
 #define b_x(x, p, l)    (((x) & ((1<<(l))-1)) << (p))
 #define b_n(x)          b_x((x), 31, 1) /* 0 if relocatable */
@@ -860,6 +1025,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb,
     _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map,
                      sizeof(interrupt_map)));
 
+    if (!phb->dma_window_size) {
+        fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n");
+        exit(1);
+    }
     spapr_dma_dt(fdt, bus_off, "ibm,dma-window",
                  phb->dma_liobn, phb->dma_window_start,
                  phb->dma_window_size);
@@ -883,6 +1052,7 @@ void spapr_pci_rtas_init(void)
 static void spapr_pci_register_types(void)
 {
     type_register_static(&spapr_phb_info);
+    type_register_static(&spapr_phb_vfio_info);
 }
 
 type_init(spapr_pci_register_types)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 99ee921..f1eec00 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -30,10 +30,14 @@
 #define SPAPR_MSIX_MAX_DEVS 32
 
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
+#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
+    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
+
 typedef struct sPAPRPHBState {
     PCIHostState parent_obj;
 
@@ -64,6 +68,14 @@ typedef struct sPAPRPHBState {
     QLIST_ENTRY(sPAPRPHBState) list;
 } sPAPRPHBState;
 
+typedef struct sPAPRPHBVFIOState {
+    sPAPRPHBState phb;
+
+    struct VFIOContainer *container;
+    int32_t iommugroupid;
+    uint8_t scan, enable_multifunction, force_addr;
+} sPAPRPHBVFIOState;
+
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
 
 #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 72495b5..b830183 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -360,6 +360,9 @@ void spapr_iommu_init(void);
 void spapr_events_init(sPAPREnvironment *spapr);
 void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq);
 DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size);
+struct VFIOContainer;
+DMAContext *spapr_vfio_new_dma(uint32_t liobn, int iommu_id,
+                               struct VFIOContainer *container);
 void spapr_tce_free(DMAContext *dma);
 void spapr_tce_reset(DMAContext *dma);
 void spapr_tce_set_bypass(DMAContext *dma, bool bypass);
diff --git a/trace-events b/trace-events
index 581d67a..e4da4aa 100644
--- a/trace-events
+++ b/trace-events
@@ -1073,6 +1073,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride,
 qxl_render_update_area_done(void *cookie) "%p"
 
 # hw/spapr_pci.c
+spapr_pci(const char *msg1, const char *msg2) "%s%s"
 spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)"
 spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64
 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u"
@@ -1118,3 +1119,6 @@ virtio_ccw_new_device(int cssid, int ssid, int schid, int devno, const char *dev
 
 # migration.c
 migrate_set_state(int new_state) "new state %d"
+
+# hw/spapr_iommu.c
+spapr_iommu(const char *op, uint32_t liobn, uint64_t ioba, uint64_t tce, int ret) "%s %x ioba=%"PRIx64" tce=%"PRIx64" ret=%d"
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64
  2013-04-22  8:02 [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2013-04-22  8:02 ` [Qemu-devel] [PATCH 3/3] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
@ 2013-04-22 18:39 ` Alex Williamson
  2013-04-22 23:44   ` [Qemu-devel] [Qemu-ppc] " David Gibson
  3 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2013-04-22 18:39 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, David Gibson

On Mon, 2013-04-22 at 18:02 +1000, Alexey Kardashevskiy wrote:
> Yes, we are still tuning this stuff for us :)

Yay!

> Changes:
> 
> * new "spapr-pci-vfio-host-bridge" device is introduced
> (inherits from spapr-pci-host-bridge);
> 
> * device#1->group->container->device#2,.. creation scheme is changed to
> group->container->devices scheme (on ppc64);
> 
> * removed vfio_iommu_spapr_tce_dma_map/vfio_iommu_spapr_tce_dma_unmap
> as they are not really necessary now and it does not look like we will
> need them in the nearest future.
> 
> Comments?

Not happy with this... :(

> Alexey Kardashevskiy (3):
>   vfio: make some of VFIO API public

Exports and conditionalizes vfio_get_group for no good reason other than
to get a pointer to a group that's not connected to a container.  Not
only is the bool parameter ugly, it has a poor interface - call it once
with "false" and get an unconnected group, call it for the same group
with "true", and still get back an unconnected group.  I'd say
vfio_get_group should be split to have group and container connect
separate, but I don't think you even need that...

>   vfio: add support for spapr platform (powerpc64 server)

...because this patch ignores the extensibility of
vfio_connect_container.  There is absolutely no reason for
vfio_container_spapr_alloc to exist.  It duplicates most of
vfio_connect_container, apparently using the attempt to reuse containers
as an excuse, even though this should happily fall through unless the
host code is broken.  vfio supports containers capable of multiple
groups, it does not assume it or require it.

>   spapr vfio: add spapr-pci-vfio-host-bridge to support vfio

And obviously this is the consumer of all that, which really just wants
some parameters from the iommu info ioctl and a way to call map and
unmap for dma.  What if instead of all of the above, we simply:

1) export vfio_dma_map/vfio_dma_unmap
2) add spapr support to vfio_connect_container (skip the info and enable
chunks)
2) add to vfio.c and export:

VFIOContainer *vfio_spapr_php_init(int groupid, struct vfio_iommu_spapr_tce_info *info)
{
    vfio_get_group(groupid); // including connect
    ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
    ioctl(container->fd, VFIO_IOMMU_ENABLE);
    return group->container;
}

Isn't that all you really need?  Thanks,

Alex

>  hw/misc/vfio.c              |   94 ++++++++++++++++++++--
>  hw/ppc/spapr_iommu.c        |  151 +++++++++++++++++++++++++++--------
>  hw/ppc/spapr_pci.c          |  186 +++++++++++++++++++++++++++++++++++++++++--
>  include/hw/misc/vfio.h      |   20 +++++
>  include/hw/pci-host/spapr.h |   12 +++
>  include/hw/ppc/spapr.h      |    3 +
>  trace-events                |    4 +
>  7 files changed, 422 insertions(+), 48 deletions(-)
>  create mode 100644 include/hw/misc/vfio.h
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64
  2013-04-22 18:39 ` [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alex Williamson
@ 2013-04-22 23:44   ` David Gibson
  2013-04-23  1:58     ` Alex Williamson
  0 siblings, 1 reply; 8+ messages in thread
From: David Gibson @ 2013-04-22 23:44 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5761 bytes --]

On Mon, Apr 22, 2013 at 12:39:26PM -0600, Alex Williamson wrote:
> On Mon, 2013-04-22 at 18:02 +1000, Alexey Kardashevskiy wrote:
> > Yes, we are still tuning this stuff for us :)
> 
> Yay!
> 
> > Changes:
> > 
> > * new "spapr-pci-vfio-host-bridge" device is introduced
> > (inherits from spapr-pci-host-bridge);
> > 
> > * device#1->group->container->device#2,.. creation scheme is changed to
> > group->container->devices scheme (on ppc64);
> > 
> > * removed vfio_iommu_spapr_tce_dma_map/vfio_iommu_spapr_tce_dma_unmap
> > as they are not really necessary now and it does not look like we will
> > need them in the nearest future.
> > 
> > Comments?
> 
> Not happy with this... :(

So, part of that will be my fault, not Alexey's, because of how I told
him to do this.

> > Alexey Kardashevskiy (3):
> >   vfio: make some of VFIO API public
> 
> Exports and conditionalizes vfio_get_group for no good reason other than
> to get a pointer to a group that's not connected to a container.  Not
> only is the bool parameter ugly, it has a poor interface - call it once
> with "false" and get an unconnected group, call it for the same group
> with "true", and still get back an unconnected group.  I'd say
> vfio_get_group should be split to have group and container connect
> separate, but I don't think you even need that...

The boolean is ugly, yes.  I was suggesting splitting a group_create()
function from get_group().

> >   vfio: add support for spapr platform (powerpc64 server)
> 
> ...because this patch ignores the extensibility of
> vfio_connect_container.  There is absolutely no reason for
> vfio_container_spapr_alloc to exist.  It duplicates most of
> vfio_connect_container, apparently using the attempt to reuse containers
> as an excuse, even though this should happily fall through unless the
> host code is broken.  vfio supports containers capable of multiple
> groups, it does not assume it or require it.

So, initially Alexey did extend vfio_connect_container(), but I think
this is wrong.  vfio_connect_container() does something different from
what we want on pseries, and not just because of details of the IOMMU
interface.

Apart from the construction of the container object (which could be
done by a shared helper), the guts of vfio_connect_container() does
two things:
	a) Finds a container for the group.  But it assumes any
container is suitable, if it will accept the group.  That works for
x86 because every container has the same mappings - all of RAM.  Even
if we were able to extend the kernel interface to support multiple
groups per container it would be wrong for pseries though: the PAPR
interface requires that each guest host bridge have independent DMA
mappings.  So the group *must* be bound to the container associated
with this device's guest host bridge, no other choice can be correct.

	b) It wires up the container to the MemorListener, so that it
maps all RAM.  Our model doesn't do that.

Note that both these considerations would become relevant for x86,
too, if and when a guest-visible IOMMU interface is implemented: you
wouldn't want to wire up a memory listener, and the (guest) device
would have to be wired to the container associated with its guest
IOMMU domain, not any accepting container.

The fundamental point is we have two very different models of using
the IOMMU, the "map all" and "guest visible" models..  And while it
happens that the PAPR IOMMU has constraints which mean it can't be
used in map all mode, and we never currently use the x86 IOMMUs in
guest visible mode, there's nothing to prevent a future IOMMU being
usable in either mode.  In fact there's nothing preventing using Type1
in either mode right now, except for the exercise of coding the guest
IOMMU interface into qemu.  So the model of IOMMU usage should not be
treated as simply a detail of the IOMMU type.

Now, what I do wonder is if we ought to have two different qemu device
types for a vfio device in map all mode, which will find and prepare
its own container and one in guest visible mode which must be told
which container to use, and fail if it can't add itself to that
container.

> >   spapr vfio: add spapr-pci-vfio-host-bridge to support vfio
> 
> And obviously this is the consumer of all that, which really just wants
> some parameters from the iommu info ioctl and a way to call map and
> unmap for dma.  What if instead of all of the above, we simply:
> 
> 1) export vfio_dma_map/vfio_dma_unmap
> 2) add spapr support to vfio_connect_container (skip the info and enable
> chunks)
> 2) add to vfio.c and export:
> 
> VFIOContainer *vfio_spapr_php_init(int groupid, struct vfio_iommu_spapr_tce_info *info)
> {
>     vfio_get_group(groupid); // including connect
>     ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
>     ioctl(container->fd, VFIO_IOMMU_ENABLE);
>     return group->container;
> }
> 
> Isn't that all you really need?  Thanks,
> 
> Alex
> 
> >  hw/misc/vfio.c              |   94 ++++++++++++++++++++--
> >  hw/ppc/spapr_iommu.c        |  151 +++++++++++++++++++++++++++--------
> >  hw/ppc/spapr_pci.c          |  186 +++++++++++++++++++++++++++++++++++++++++--
> >  include/hw/misc/vfio.h      |   20 +++++
> >  include/hw/pci-host/spapr.h |   12 +++
> >  include/hw/ppc/spapr.h      |    3 +
> >  trace-events                |    4 +
> >  7 files changed, 422 insertions(+), 48 deletions(-)
> >  create mode 100644 include/hw/misc/vfio.h
> > 
> 
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64
  2013-04-22 23:44   ` [Qemu-devel] [Qemu-ppc] " David Gibson
@ 2013-04-23  1:58     ` Alex Williamson
  2013-04-23  3:14       ` David Gibson
  0 siblings, 1 reply; 8+ messages in thread
From: Alex Williamson @ 2013-04-23  1:58 UTC (permalink / raw)
  To: David Gibson; +Cc: Alexey Kardashevskiy, qemu-ppc, qemu-devel

On Tue, 2013-04-23 at 09:44 +1000, David Gibson wrote:
> On Mon, Apr 22, 2013 at 12:39:26PM -0600, Alex Williamson wrote:
> > On Mon, 2013-04-22 at 18:02 +1000, Alexey Kardashevskiy wrote:
> > > Yes, we are still tuning this stuff for us :)
> > 
> > Yay!
> > 
> > > Changes:
> > > 
> > > * new "spapr-pci-vfio-host-bridge" device is introduced
> > > (inherits from spapr-pci-host-bridge);
> > > 
> > > * device#1->group->container->device#2,.. creation scheme is changed to
> > > group->container->devices scheme (on ppc64);
> > > 
> > > * removed vfio_iommu_spapr_tce_dma_map/vfio_iommu_spapr_tce_dma_unmap
> > > as they are not really necessary now and it does not look like we will
> > > need them in the nearest future.
> > > 
> > > Comments?
> > 
> > Not happy with this... :(
> 
> So, part of that will be my fault, not Alexey's, because of how I told
> him to do this.
> 
> > > Alexey Kardashevskiy (3):
> > >   vfio: make some of VFIO API public
> > 
> > Exports and conditionalizes vfio_get_group for no good reason other than
> > to get a pointer to a group that's not connected to a container.  Not
> > only is the bool parameter ugly, it has a poor interface - call it once
> > with "false" and get an unconnected group, call it for the same group
> > with "true", and still get back an unconnected group.  I'd say
> > vfio_get_group should be split to have group and container connect
> > separate, but I don't think you even need that...
> 
> The boolean is ugly, yes.  I was suggesting splitting a group_create()
> function from get_group().
> 
> > >   vfio: add support for spapr platform (powerpc64 server)
> > 
> > ...because this patch ignores the extensibility of
> > vfio_connect_container.  There is absolutely no reason for
> > vfio_container_spapr_alloc to exist.  It duplicates most of
> > vfio_connect_container, apparently using the attempt to reuse containers
> > as an excuse, even though this should happily fall through unless the
> > host code is broken.  vfio supports containers capable of multiple
> > groups, it does not assume it or require it.
> 
> So, initially Alexey did extend vfio_connect_container(), but I think
> this is wrong.  vfio_connect_container() does something different from
> what we want on pseries, and not just because of details of the IOMMU
> interface.
> 
> Apart from the construction of the container object (which could be
> done by a shared helper), the guts of vfio_connect_container() does
> two things:
> 	a) Finds a container for the group.  But it assumes any
> container is suitable, if it will accept the group.  That works for
> x86 because every container has the same mappings - all of RAM.  Even
> if we were able to extend the kernel interface to support multiple
> groups per container it would be wrong for pseries though: the PAPR
> interface requires that each guest host bridge have independent DMA
> mappings.  So the group *must* be bound to the container associated
> with this device's guest host bridge, no other choice can be correct.

This is not true.  vfio_connect_container _tests_ whether any of the
other containers are compatible, it makes no assumptions.  The
VFIO_GROUP_SET_CONTAINER must fail if the group is not compatible with
the existing container.  I'm not asking you to change that, I'm asking
that you implement exactly that meaning that this loop:

    QLIST_FOREACH(container, &container_list, next) {
        if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
            group->container = container;
            QLIST_INSERT_HEAD(&container->group_list, group, container_next);
            return 0;
        }
    }

never reaches the inner branch on spapr.  If the spapr vfio iommu driver
doesn't do this and allows, but does not support, multiple groups to be
connected to a container, it's a bug.  A group is only going to do this
once and it's only at setup, so I don't think there's a valid
performance argument here either.

> 	b) It wires up the container to the MemorListener, so that it
> maps all RAM.  Our model doesn't do that.

This is done within the type1 specific setup, ignore it.

> Note that both these considerations would become relevant for x86,
> too, if and when a guest-visible IOMMU interface is implemented: you
> wouldn't want to wire up a memory listener, and the (guest) device
> would have to be wired to the container associated with its guest
> IOMMU domain, not any accepting container.

At that point the problem is actually different because we potentially
have multiple viable iommu backends on the same host that will need to
behave a little differently.  That's not the case for spapr.  There's
only one viable iommu backend and it exactly fits the model that
vfio_connect_container was designed for.

> The fundamental point is we have two very different models of using
> the IOMMU, the "map all" and "guest visible" models..  And while it
> happens that the PAPR IOMMU has constraints which mean it can't be
> used in map all mode, and we never currently use the x86 IOMMUs in
> guest visible mode, there's nothing to prevent a future IOMMU being
> usable in either mode.  In fact there's nothing preventing using Type1
> in either mode right now, except for the exercise of coding the guest
> IOMMU interface into qemu.  So the model of IOMMU usage should not be
> treated as simply a detail of the IOMMU type.

If and when we have multiple viable iommu backends on a single platform,
I agree, we may need to be smarter about when groups get merged into the
same container.  We're not there yet.  spapr doesn't plan to support
multiple groups within a container, so whether we test it or not should
be a non-issue.  The current kernel implementation needs to disallow it.
The memory listener setup is completely contained within type1 setup, so
that should also be a non-issue.

> Now, what I do wonder is if we ought to have two different qemu device
> types for a vfio device in map all mode, which will find and prepare
> its own container and one in guest visible mode which must be told
> which container to use, and fail if it can't add itself to that
> container.

A different vfio- device type is an option, but I'm not seeing a
sufficient arguments above for why the existing model doesn't work.
Thanks,

Alex

> > >   spapr vfio: add spapr-pci-vfio-host-bridge to support vfio
> > 
> > And obviously this is the consumer of all that, which really just wants
> > some parameters from the iommu info ioctl and a way to call map and
> > unmap for dma.  What if instead of all of the above, we simply:
> > 
> > 1) export vfio_dma_map/vfio_dma_unmap
> > 2) add spapr support to vfio_connect_container (skip the info and enable
> > chunks)
> > 2) add to vfio.c and export:
> > 
> > VFIOContainer *vfio_spapr_php_init(int groupid, struct vfio_iommu_spapr_tce_info *info)
> > {
> >     vfio_get_group(groupid); // including connect
> >     ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
> >     ioctl(container->fd, VFIO_IOMMU_ENABLE);
> >     return group->container;
> > }
> > 
> > Isn't that all you really need?  Thanks,
> > 
> > Alex
> > 
> > >  hw/misc/vfio.c              |   94 ++++++++++++++++++++--
> > >  hw/ppc/spapr_iommu.c        |  151 +++++++++++++++++++++++++++--------
> > >  hw/ppc/spapr_pci.c          |  186 +++++++++++++++++++++++++++++++++++++++++--
> > >  include/hw/misc/vfio.h      |   20 +++++
> > >  include/hw/pci-host/spapr.h |   12 +++
> > >  include/hw/ppc/spapr.h      |    3 +
> > >  trace-events                |    4 +
> > >  7 files changed, 422 insertions(+), 48 deletions(-)
> > >  create mode 100644 include/hw/misc/vfio.h
> > > 
> > 
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH 0/3 QEMU v4] vfio on ppc64
  2013-04-23  1:58     ` Alex Williamson
@ 2013-04-23  3:14       ` David Gibson
  0 siblings, 0 replies; 8+ messages in thread
From: David Gibson @ 2013-04-23  3:14 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 8413 bytes --]

On Mon, Apr 22, 2013 at 07:58:12PM -0600, Alex Williamson wrote:
> On Tue, 2013-04-23 at 09:44 +1000, David Gibson wrote:
> > On Mon, Apr 22, 2013 at 12:39:26PM -0600, Alex Williamson wrote:
> > > On Mon, 2013-04-22 at 18:02 +1000, Alexey Kardashevskiy wrote:
> > > > Yes, we are still tuning this stuff for us :)
> > > 
> > > Yay!
> > > 
> > > > Changes:
> > > > 
> > > > * new "spapr-pci-vfio-host-bridge" device is introduced
> > > > (inherits from spapr-pci-host-bridge);
> > > > 
> > > > * device#1->group->container->device#2,.. creation scheme is changed to
> > > > group->container->devices scheme (on ppc64);
> > > > 
> > > > * removed vfio_iommu_spapr_tce_dma_map/vfio_iommu_spapr_tce_dma_unmap
> > > > as they are not really necessary now and it does not look like we will
> > > > need them in the nearest future.
> > > > 
> > > > Comments?
> > > 
> > > Not happy with this... :(
> > 
> > So, part of that will be my fault, not Alexey's, because of how I told
> > him to do this.
> > 
> > > > Alexey Kardashevskiy (3):
> > > >   vfio: make some of VFIO API public
> > > 
> > > Exports and conditionalizes vfio_get_group for no good reason other than
> > > to get a pointer to a group that's not connected to a container.  Not
> > > only is the bool parameter ugly, it has a poor interface - call it once
> > > with "false" and get an unconnected group, call it for the same group
> > > with "true", and still get back an unconnected group.  I'd say
> > > vfio_get_group should be split to have group and container connect
> > > separate, but I don't think you even need that...
> > 
> > The boolean is ugly, yes.  I was suggesting splitting a group_create()
> > function from get_group().
> > 
> > > >   vfio: add support for spapr platform (powerpc64 server)
> > > 
> > > ...because this patch ignores the extensibility of
> > > vfio_connect_container.  There is absolutely no reason for
> > > vfio_container_spapr_alloc to exist.  It duplicates most of
> > > vfio_connect_container, apparently using the attempt to reuse containers
> > > as an excuse, even though this should happily fall through unless the
> > > host code is broken.  vfio supports containers capable of multiple
> > > groups, it does not assume it or require it.
> > 
> > So, initially Alexey did extend vfio_connect_container(), but I think
> > this is wrong.  vfio_connect_container() does something different from
> > what we want on pseries, and not just because of details of the IOMMU
> > interface.
> > 
> > Apart from the construction of the container object (which could be
> > done by a shared helper), the guts of vfio_connect_container() does
> > two things:
> > 	a) Finds a container for the group.  But it assumes any
> > container is suitable, if it will accept the group.  That works for
> > x86 because every container has the same mappings - all of RAM.  Even
> > if we were able to extend the kernel interface to support multiple
> > groups per container it would be wrong for pseries though: the PAPR
> > interface requires that each guest host bridge have independent DMA
> > mappings.  So the group *must* be bound to the container associated
> > with this device's guest host bridge, no other choice can be correct.
> 
> This is not true.

It is absolutely true.

>  vfio_connect_container _tests_ whether any of the
> other containers are compatible, it makes no assumptions.  The
> VFIO_GROUP_SET_CONTAINER must fail if the group is not compatible with
> the existing container.  I'm not asking you to change that, I'm asking
> that you implement exactly that meaning that this loop:

I know what the loop does.  But the loops "compatibility" test is
based on whether the kernel can add the group to the container.  That
only tests, and can only test whether the hardware and driver supports
having both groups in a container.  What it *can't* test is what set
of IO address spaces we wish to present to the guest.  That's a
constraint that only qemu has the information to apply correctly.

Now it happens that because we only support one group per container
this will always create a new container, which is what we need.  But
that's basically just getting the right answer by accident.

>     QLIST_FOREACH(container, &container_list, next) {
>         if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) {
>             group->container = container;
>             QLIST_INSERT_HEAD(&container->group_list, group, container_next);
>             return 0;
>         }
>     }
> 
> never reaches the inner branch on spapr.  If the spapr vfio iommu driver
> doesn't do this and allows, but does not support, multiple groups to be
> connected to a container, it's a bug.  A group is only going to do this
> once and it's only at setup, so I don't think there's a valid
> performance argument here either.

> > 	b) It wires up the container to the MemorListener, so that it
> > maps all RAM.  Our model doesn't do that.
> 
> This is done within the type1 specific setup, ignore it.

Yes, but my point is that connect_container at present has no clear
semantic, other than the type 1 semantic.  What needs to be done
depends on the IOMMU usage model, and only qemu can choose that.

> > Note that both these considerations would become relevant for x86,
> > too, if and when a guest-visible IOMMU interface is implemented: you
> > wouldn't want to wire up a memory listener, and the (guest) device
> > would have to be wired to the container associated with its guest
> > IOMMU domain, not any accepting container.
> 
> At that point the problem is actually different because we potentially
> have multiple viable iommu backends on the same host that will need to
> behave a little differently.  That's not the case for spapr.  There's
> only one viable iommu backend and it exactly fits the model that
> vfio_connect_container was designed for.

We shouldn't design the code around the coincidence that at present we
only support one backend per platform and only one usage model per
backend.

Perhaps you can explain the model that vfio_connect_container() is
supposed to implement.  Because the only model I can figure out for it
is the Type1 one and that absolutely does not fit the spapr model.

> > The fundamental point is we have two very different models of using
> > the IOMMU, the "map all" and "guest visible" models..  And while it
> > happens that the PAPR IOMMU has constraints which mean it can't be
> > used in map all mode, and we never currently use the x86 IOMMUs in
> > guest visible mode, there's nothing to prevent a future IOMMU being
> > usable in either mode.  In fact there's nothing preventing using Type1
> > in either mode right now, except for the exercise of coding the guest
> > IOMMU interface into qemu.  So the model of IOMMU usage should not be
> > treated as simply a detail of the IOMMU type.
> 
> If and when we have multiple viable iommu backends on a single platform,
> I agree, we may need to be smarter about when groups get merged into the
> same container.  We're not there yet.  spapr doesn't plan to support
> multiple groups within a container, so whether we test it or not should
> be a non-issue.  The current kernel implementation needs to disallow it.
> The memory listener setup is completely contained within type1 setup, so
> that should also be a non-issue.
> 
> > Now, what I do wonder is if we ought to have two different qemu device
> > types for a vfio device in map all mode, which will find and prepare
> > its own container and one in guest visible mode which must be told
> > which container to use, and fail if it can't add itself to that
> > container.
> 
> A different vfio- device type is an option, but I'm not seeing a
> sufficient arguments above for why the existing model doesn't work.

It works, but only by accident.  It is conceptually broken for the
guest visible usage model.  Well, or the map everything model, you can
pick either one depending on how you define what the semantics should
be.  But no set of coherent semantics makes it work for both cases.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-04-23  3:14 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-22  8:02 [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alexey Kardashevskiy
2013-04-22  8:02 ` [Qemu-devel] [PATCH 1/3] vfio: make some of VFIO API public Alexey Kardashevskiy
2013-04-22  8:02 ` [Qemu-devel] [PATCH 2/3] vfio: add support for spapr platform (powerpc64 server) Alexey Kardashevskiy
2013-04-22  8:02 ` [Qemu-devel] [PATCH 3/3] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
2013-04-22 18:39 ` [Qemu-devel] [PATCH 0/3 QEMU v4] vfio on ppc64 Alex Williamson
2013-04-22 23:44   ` [Qemu-devel] [Qemu-ppc] " David Gibson
2013-04-23  1:58     ` Alex Williamson
2013-04-23  3:14       ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).