[Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64
@ 2014-06-05  5:49 Alexey Kardashevskiy
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional Alexey Kardashevskiy
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05  5:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	Gavin Shan

Yet another try with VFIO on SPAPR (server PPC64).

This adds VFIO support on SPAPR for the existing VFIO-SPAPR-TCE driver
in the upstream kernel.

Individual patches have more detailed commit logs.

While patch #1 is questionable, others are pretty much ready to but
I do not know via which maintainer tree - Alex or Alex? :)

Please comment. Thanks!


Changes:
v7:
* cleaned and rebased on agraf/ppc-next tree (which is on its way to upstream)

v6:
* initial set was split into 3

v5:
* rebase on top of the current upstream

v4:
* addressed all comments from Alex Williamson
* moved spapr-pci-phb-vfio-phb to new file
* split spapr-pci-phb-vfio to many smaller patches



Alexey Kardashevskiy (4):
  spapr_iommu: Make in-kernel TCE table optional
  vfio: Add vfio_container_spapr_get_info()
  spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio
  vfio: Enable for spapr

 hw/misc/vfio.c              | 64 +++++++++++++++++++++++++++++++
 hw/ppc/Makefile.objs        |  3 ++
 hw/ppc/spapr_iommu.c        |  6 ++-
 hw/ppc/spapr_pci.c          |  2 +-
 hw/ppc/spapr_pci_vfio.c     | 93 +++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_vio.c          |  2 +-
 include/hw/misc/vfio.h      | 11 ++++++
 include/hw/pci-host/spapr.h | 11 ++++++
 include/hw/ppc/spapr.h      |  4 +-
 9 files changed, 191 insertions(+), 5 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_vfio.c
 create mode 100644 include/hw/misc/vfio.h

-- 
2.0.0

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05  5:49 [Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64 Alexey Kardashevskiy
@ 2014-06-05  5:49 ` Alexey Kardashevskiy
  2014-06-05  6:43   ` Alexey Kardashevskiy
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info() Alexey Kardashevskiy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05  5:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	Gavin Shan

POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows allocating
TCE tables in the host kernel memory and handle H_PUT_TCE requests
targeted to specific LIOBN (logical bus number) right in the host without
switching to QEMU. At the moment this is used for emulated devices only
and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
handler finds a LIOBN and corresponding table, it will put a TCE to
the table and complete hypercall execution. The user space will not be
notified.

Upcoming VFIO support is going to use the same sPAPRTCETable device class
so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
tables for VFIO are going to be allocated in the host as well.
However VFIO operates with real IOMMU tables and simple copying of
a TCE to the real hardware TCE table will not work as guest physical
to host physical address translation is requited.

So until the host kernel gets VFIO support for H_PUT_TCE, we better not
to register VFIO's TCE in the host.

This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
that sPAPRTCETable should not try allocating TCE table in the host kernel.
Instead, the table will be created in QEMU.

This adds an kvm_accel parameter to spapr_tce_new_table() to let users
choose whether to use acceleration or not. At the moment it is enabled
for VIO and emulated PCI. Upcoming VFIO support will set it to false.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is a workaround but it lets me have one IOMMU device for VIO, emulated
PCI and VFIO which is a good thing.

The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
this needs kernel update.
---
 hw/ppc/spapr_iommu.c   | 6 ++++--
 hw/ppc/spapr_pci.c     | 2 +-
 hw/ppc/spapr_vio.c     | 2 +-
 include/hw/ppc/spapr.h | 4 +++-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
index 3b6e373..bfd3701 100644
--- a/hw/ppc/spapr_iommu.c
+++ b/hw/ppc/spapr_iommu.c
@@ -115,7 +115,7 @@ static int spapr_tce_table_realize(DeviceState *dev)
 {
     sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
 
-    if (kvm_enabled()) {
+    if (tcet->kvm_accel && kvm_enabled()) {
         tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
                                               tcet->nb_table <<
                                               tcet->page_shift,
@@ -143,7 +143,8 @@ static int spapr_tce_table_realize(DeviceState *dev)
 sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
                                    uint64_t bus_offset,
                                    uint32_t page_shift,
-                                   uint32_t nb_table)
+                                   uint32_t nb_table,
+                                   bool kvm_accel)
 {
     sPAPRTCETable *tcet;
 
@@ -162,6 +163,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
     tcet->bus_offset = bus_offset;
     tcet->page_shift = page_shift;
     tcet->nb_table = nb_table;
+    tcet->kvm_accel = kvm_accel;
 
     object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL);
 
diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index ddfd8bb..6021f35 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -658,7 +658,7 @@ static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
     tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
                                0,
                                SPAPR_TCE_PAGE_SHIFT,
-                               0x40000000 >> SPAPR_TCE_PAGE_SHIFT);
+                               0x40000000 >> SPAPR_TCE_PAGE_SHIFT, true);
     if (!tcet) {
         error_setg(errp, "Unable to create TCE table for %s",
                    sphb->dtbusname);
diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
index 48b0125..16385e4 100644
--- a/hw/ppc/spapr_vio.c
+++ b/hw/ppc/spapr_vio.c
@@ -460,7 +460,7 @@ static int spapr_vio_busdev_init(DeviceState *qdev)
                                         0,
                                         SPAPR_TCE_PAGE_SHIFT,
                                         pc->rtce_window_size >>
-                                        SPAPR_TCE_PAGE_SHIFT);
+                                        SPAPR_TCE_PAGE_SHIFT, true);
         address_space_init(&dev->as, spapr_tce_get_iommu(dev->tcet), qdev->id);
     }
 
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 4ffb903..7db34ff 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -402,6 +402,7 @@ struct sPAPRTCETable {
     uint32_t page_shift;
     uint64_t *table;
     bool bypass;
+    bool kvm_accel;
     int fd;
     MemoryRegion iommu;
     QLIST_ENTRY(sPAPRTCETable) list;
@@ -413,7 +414,8 @@ int spapr_h_cas_compose_response(target_ulong addr, target_ulong size);
 sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
                                    uint64_t bus_offset,
                                    uint32_t page_shift,
-                                   uint32_t nb_table);
+                                   uint32_t nb_table,
+                                   bool kvm_accel);
 MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
 void spapr_tce_set_bypass(sPAPRTCETable *tcet, bool bypass);
 int spapr_dma_dt(void *fdt, int node_off, const char *propname,
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info()
  2014-06-05  5:49 [Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64 Alexey Kardashevskiy
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional Alexey Kardashevskiy
@ 2014-06-05  5:49 ` Alexey Kardashevskiy
  2014-06-05 19:27   ` Alex Williamson
  2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
  2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr Alexey Kardashevskiy
  3 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05  5:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	Gavin Shan

To perform DMA mapping via TCE table correctly, the guest must
know where DMA window is located on the PCI bus. A hypervisor is
expected to provide such information. Since QEMU has no control
over this setting, we need a way to obtain a start address and size
from the host VFIO driver.

This adds a helper which returns the default DMA window properties
for the specific IOMMU group. The upstream kernel implements this ioctl
already.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v7:
* do not return a group fd from the helper

v6:
* added dup() to protect group_fd from accidental disposal

v5:
* reworked to reflect change in vfio_get_group() from one
of previous patches change

v4:
* fixed possible leaks on error paths
---
 hw/misc/vfio.c         | 36 ++++++++++++++++++++++++++++++++++++
 include/hw/misc/vfio.h | 11 +++++++++++
 2 files changed, 47 insertions(+)
 create mode 100644 include/hw/misc/vfio.h

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 7437c2e..99141f3 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -39,6 +39,7 @@
 #include "qemu/range.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+#include "hw/misc/vfio.h"
 
 /* #define DEBUG_VFIO */
 #ifdef DEBUG_VFIO
@@ -4318,3 +4319,38 @@ static void register_vfio_pci_dev_type(void)
 }
 
 type_init(register_vfio_pci_dev_type)
+
+int vfio_container_spapr_get_info(AddressSpace *as,
+                                  int32_t groupid,
+                                  struct vfio_iommu_spapr_tce_info *info)
+{
+    VFIOGroup *group;
+    VFIOContainer *container;
+    int ret, fd;
+
+    group = vfio_get_group(groupid, as);
+    if (!group) {
+        return -1;
+    }
+    container = group->container;
+    if (!group->container) {
+        goto put_group_exit;
+    }
+    fd = container->fd;
+    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        goto put_group_exit;
+    }
+    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
+    if (ret) {
+        error_report("vfio: failed to get iommu info for container: %s",
+                     strerror(errno));
+        goto put_group_exit;
+    }
+
+    return 0;
+
+put_group_exit:
+    vfio_put_group(group);
+
+    return -1;
+}
diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
new file mode 100644
index 0000000..e82f5a3
--- /dev/null
+++ b/include/hw/misc/vfio.h
@@ -0,0 +1,11 @@
+#ifndef VFIO_API_H
+#define VFIO_API_H
+
+#include "qemu/typedefs.h"
+#include <linux/vfio.h>
+
+extern int vfio_container_spapr_get_info(AddressSpace *as,
+                                         int32_t groupid,
+                                         struct vfio_iommu_spapr_tce_info *info);
+
+#endif
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio
  2014-06-05  5:49 [Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64 Alexey Kardashevskiy
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional Alexey Kardashevskiy
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info() Alexey Kardashevskiy
@ 2014-06-05  5:50 ` Alexey Kardashevskiy
  2014-06-05 13:34   ` Alexander Graf
  2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr Alexey Kardashevskiy
  3 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05  5:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	Gavin Shan

The patch adds a spapr-pci-vfio-host-bridge device type
which is a PCI Host Bridge with VFIO support. The new device
inherits from the spapr-pci-host-bridge device and adds an "iommu"
property which is an IOMMU id. This ID represents a minimal entity
for which IOMMU isolation can be guaranteed. In SPAPR architecture IOMMU
group is called a Partitionable Endpoint (PE).

Current implementation supports one IOMMU id per QEMU VFIO PHB. Since
SPAPR allows multiple PHB for no extra cost, this does not seem to
be a problem. This limitation may change in the future though.

Example of use:
Configure and Add 3 functions of a multifunctional device to QEMU:
(the NEC PCI USB card is used as an example here):
-device spapr-pci-vfio-host-bridge,id=USB,iommu=4,index=7 \
-device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
-device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
-device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB

where:
* index=7 is a QEMU PHB index (used as source for MMIO/MSI/IO windows
offset);
* iommu=4 is an IOMMU id which can be found in sysfs:
[aik@vpl2 ~]$ cd /sys/bus/pci/devices/0004:00:00.0/
[aik@vpl2 0004:00:00.0]$ ls -l iommu_group
lrwxrwxrwx 1 root root 0 Jun  5 12:49 iommu_group -> ../../../kernel/iommu_groups/4

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v7:
* remove bunch of properties from VFIO PHB such as "scan", "multifunction",
"force_addr" - let management softwsare deal with it
* removed traces used in scan() (which is also removed)
* updated license
* disables in-kernel TCE table ("false" in spapr_tce_new_table())

v5:
* added handling of possible failure of spapr_vfio_new_table()

v4:
* moved IOMMU changes to separate patches
* moved spapr-pci-vfio-host-bridge to new file
---
 hw/ppc/Makefile.objs        |  3 ++
 hw/ppc/spapr_pci_vfio.c     | 93 +++++++++++++++++++++++++++++++++++++++++++++
 include/hw/pci-host/spapr.h | 11 ++++++
 3 files changed, 107 insertions(+)
 create mode 100644 hw/ppc/spapr_pci_vfio.c

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index ea747f0..edd44d0 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -4,6 +4,9 @@ obj-y += ppc.o ppc_booke.o
 obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
 obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
 obj-$(CONFIG_PSERIES) += spapr_pci.o
+ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
+obj-y += spapr_pci_vfio.o
+endif
 # PowerPC 4xx boards
 obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
 obj-y += ppc4xx_pci.o
diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
new file mode 100644
index 0000000..ec52a18
--- /dev/null
+++ b/hw/ppc/spapr_pci_vfio.c
@@ -0,0 +1,93 @@
+/*
+ * QEMU sPAPR PCI host for VFIO
+ *
+ * Copyright (c) 2011-2014 Alexey Kardashevskiy, IBM Corporation.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License,
+ *  or (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "hw/ppc/spapr.h"
+#include "hw/pci-host/spapr.h"
+#include "hw/misc/vfio.h"
+
+static Property spapr_phb_vfio_properties[] = {
+    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
+{
+    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
+    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
+    int ret;
+    sPAPRTCETable *tcet;
+    uint32_t liobn = svphb->phb.dma_liobn;
+
+    if (svphb->iommugroupid == -1) {
+        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
+        return;
+    }
+
+    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
+                                        svphb->iommugroupid,
+                                        &info);
+    if (ret) {
+        error_setg_errno(errp, -ret,
+                         "spapr-vfio: get info from container failed");
+        return;
+    }
+
+    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, info.dma32_window_start,
+                               SPAPR_TCE_PAGE_SHIFT,
+                               info.dma32_window_size >> SPAPR_TCE_PAGE_SHIFT,
+                               false);
+    if (!tcet) {
+        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
+        return;
+    }
+
+    /* Register default 32bit DMA window */
+    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
+                                spapr_tce_get_iommu(tcet));
+}
+
+static void spapr_phb_vfio_reset(DeviceState *qdev)
+{
+    /* Do nothing */
+}
+
+static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
+
+    dc->props = spapr_phb_vfio_properties;
+    dc->reset = spapr_phb_vfio_reset;
+    spc->finish_realize = spapr_phb_vfio_finish_realize;
+}
+
+static const TypeInfo spapr_phb_vfio_info = {
+    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
+    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
+    .instance_size = sizeof(sPAPRPHBVFIOState),
+    .class_init    = spapr_phb_vfio_class_init,
+    .class_size    = sizeof(sPAPRPHBClass),
+};
+
+static void spapr_pci_vfio_register_types(void)
+{
+    type_register_static(&spapr_phb_vfio_info);
+}
+
+type_init(spapr_pci_vfio_register_types)
diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
index 0934518..6808e96 100644
--- a/include/hw/pci-host/spapr.h
+++ b/include/hw/pci-host/spapr.h
@@ -30,10 +30,14 @@
 #define SPAPR_MSIX_MAX_DEVS 32
 
 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
+#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
 
 #define SPAPR_PCI_HOST_BRIDGE(obj) \
     OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
 
+#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
+    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
+
 #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
      OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
 #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
@@ -41,6 +45,7 @@
 
 typedef struct sPAPRPHBClass sPAPRPHBClass;
 typedef struct sPAPRPHBState sPAPRPHBState;
+typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
 
 struct sPAPRPHBClass {
     PCIHostBridgeClass parent_class;
@@ -76,6 +81,12 @@ struct sPAPRPHBState {
     QLIST_ENTRY(sPAPRPHBState) list;
 };
 
+struct sPAPRPHBVFIOState {
+    sPAPRPHBState phb;
+
+    int32_t iommugroupid;
+};
+
 #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
 
 #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr
  2014-06-05  5:49 [Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64 Alexey Kardashevskiy
                   ` (2 preceding siblings ...)
  2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
@ 2014-06-05  5:50 ` Alexey Kardashevskiy
  2014-06-05 19:31   ` Alex Williamson
  3 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05  5:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, Alexander Graf,
	Gavin Shan

This turns the sPAPR support on and enables VFIO container use
in the kernel.

This extends vfio_connect_container to support VFIO_SPAPR_TCE_IOMMU type
in the host kernel.

This registers a memory listener which sPAPR IOMMU will notify when
executing H_PUT_TCE/etc DMA calls. The listener then will notify the host
kernel about DMA map/unmap operation via VFIO_IOMMU_MAP_DMA/
VFIO_IOMMU_UNMAP_DMA ioctls.

This executes VFIO_IOMMU_ENABLE ioctl to make sure that the IOMMU is free
of mappings and can be exclusively given to the user. At the moment SPAPR
is the only platform requiring this call to be implemented.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v7:
* added more details in commit log

v5:
* multiple returns converted to gotos

v4:
* fixed format string to use %m which is a glibc extension:
"Print output of strerror(errno). No argument is required."
---
 hw/misc/vfio.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
index 99141f3..c8e3aff 100644
--- a/hw/misc/vfio.c
+++ b/hw/misc/vfio.c
@@ -3650,6 +3650,34 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
 
         container->iommu_data.type1.initialized = true;
 
+    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
+        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
+        if (ret) {
+            error_report("vfio: failed to set group container: %m");
+            ret = -errno;
+            goto free_container_exit;
+        }
+
+        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
+        if (ret) {
+            error_report("vfio: failed to set iommu for container: %m");
+            ret = -errno;
+            goto free_container_exit;
+        }
+
+        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
+        if (ret) {
+            error_report("vfio: failed to enable container: %m");
+            ret = -errno;
+            goto free_container_exit;
+        }
+
+        container->iommu_data.type1.listener = vfio_memory_listener;
+        container->iommu_data.release = vfio_listener_release;
+
+        memory_listener_register(&container->iommu_data.type1.listener,
+                                 container->space->as);
+
     } else {
         error_report("vfio: No available IOMMU models");
         ret = -EINVAL;
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional Alexey Kardashevskiy
@ 2014-06-05  6:43   ` Alexey Kardashevskiy
  2014-06-05 13:06     ` Alexander Graf
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05  6:43 UTC (permalink / raw)
  To: qemu-devel; +Cc: Alex Williamson, qemu-ppc, Alexander Graf, Gavin Shan

On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows allocating
> TCE tables in the host kernel memory and handle H_PUT_TCE requests
> targeted to specific LIOBN (logical bus number) right in the host without
> switching to QEMU. At the moment this is used for emulated devices only
> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
> handler finds a LIOBN and corresponding table, it will put a TCE to
> the table and complete hypercall execution. The user space will not be
> notified.
> 
> Upcoming VFIO support is going to use the same sPAPRTCETable device class
> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
> tables for VFIO are going to be allocated in the host as well.
> However VFIO operates with real IOMMU tables and simple copying of
> a TCE to the real hardware TCE table will not work as guest physical
> to host physical address translation is requited.
> 
> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
> to register VFIO's TCE in the host.
> 
> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
> that sPAPRTCETable should not try allocating TCE table in the host kernel.
> Instead, the table will be created in QEMU.
> 
> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
> choose whether to use acceleration or not. At the moment it is enabled
> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> 
> This is a workaround but it lets me have one IOMMU device for VIO, emulated
> PCI and VFIO which is a good thing.
> 
> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
> this needs kernel update.


Never mind, I'll make it a capability. I'll post capability reservation
patch separately.


> ---
>  hw/ppc/spapr_iommu.c   | 6 ++++--
>  hw/ppc/spapr_pci.c     | 2 +-
>  hw/ppc/spapr_vio.c     | 2 +-
>  include/hw/ppc/spapr.h | 4 +++-
>  4 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c
> index 3b6e373..bfd3701 100644
> --- a/hw/ppc/spapr_iommu.c
> +++ b/hw/ppc/spapr_iommu.c
> @@ -115,7 +115,7 @@ static int spapr_tce_table_realize(DeviceState *dev)
>  {
>      sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev);
>  
> -    if (kvm_enabled()) {
> +    if (tcet->kvm_accel && kvm_enabled()) {
>          tcet->table = kvmppc_create_spapr_tce(tcet->liobn,
>                                                tcet->nb_table <<
>                                                tcet->page_shift,
> @@ -143,7 +143,8 @@ static int spapr_tce_table_realize(DeviceState *dev)
>  sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>                                     uint64_t bus_offset,
>                                     uint32_t page_shift,
> -                                   uint32_t nb_table)
> +                                   uint32_t nb_table,
> +                                   bool kvm_accel)
>  {
>      sPAPRTCETable *tcet;
>  
> @@ -162,6 +163,7 @@ sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>      tcet->bus_offset = bus_offset;
>      tcet->page_shift = page_shift;
>      tcet->nb_table = nb_table;
> +    tcet->kvm_accel = kvm_accel;
>  
>      object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL);
>  
> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
> index ddfd8bb..6021f35 100644
> --- a/hw/ppc/spapr_pci.c
> +++ b/hw/ppc/spapr_pci.c
> @@ -658,7 +658,7 @@ static void spapr_phb_finish_realize(sPAPRPHBState *sphb, Error **errp)
>      tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn,
>                                 0,
>                                 SPAPR_TCE_PAGE_SHIFT,
> -                               0x40000000 >> SPAPR_TCE_PAGE_SHIFT);
> +                               0x40000000 >> SPAPR_TCE_PAGE_SHIFT, true);
>      if (!tcet) {
>          error_setg(errp, "Unable to create TCE table for %s",
>                     sphb->dtbusname);
> diff --git a/hw/ppc/spapr_vio.c b/hw/ppc/spapr_vio.c
> index 48b0125..16385e4 100644
> --- a/hw/ppc/spapr_vio.c
> +++ b/hw/ppc/spapr_vio.c
> @@ -460,7 +460,7 @@ static int spapr_vio_busdev_init(DeviceState *qdev)
>                                          0,
>                                          SPAPR_TCE_PAGE_SHIFT,
>                                          pc->rtce_window_size >>
> -                                        SPAPR_TCE_PAGE_SHIFT);
> +                                        SPAPR_TCE_PAGE_SHIFT, true);
>          address_space_init(&dev->as, spapr_tce_get_iommu(dev->tcet), qdev->id);
>      }
>  
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 4ffb903..7db34ff 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -402,6 +402,7 @@ struct sPAPRTCETable {
>      uint32_t page_shift;
>      uint64_t *table;
>      bool bypass;
> +    bool kvm_accel;
>      int fd;
>      MemoryRegion iommu;
>      QLIST_ENTRY(sPAPRTCETable) list;
> @@ -413,7 +414,8 @@ int spapr_h_cas_compose_response(target_ulong addr, target_ulong size);
>  sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn,
>                                     uint64_t bus_offset,
>                                     uint32_t page_shift,
> -                                   uint32_t nb_table);
> +                                   uint32_t nb_table,
> +                                   bool kvm_accel);
>  MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet);
>  void spapr_tce_set_bypass(sPAPRTCETable *tcet, bool bypass);
>  int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05  6:43   ` Alexey Kardashevskiy
@ 2014-06-05 13:06     ` Alexander Graf
  2014-06-05 13:10       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Graf @ 2014-06-05 13:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan


On 05.06.14 08:43, Alexey Kardashevskiy wrote:
> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows allocating
>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>> targeted to specific LIOBN (logical bus number) right in the host without
>> switching to QEMU. At the moment this is used for emulated devices only
>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>> handler finds a LIOBN and corresponding table, it will put a TCE to
>> the table and complete hypercall execution. The user space will not be
>> notified.
>>
>> Upcoming VFIO support is going to use the same sPAPRTCETable device class
>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>> tables for VFIO are going to be allocated in the host as well.
>> However VFIO operates with real IOMMU tables and simple copying of
>> a TCE to the real hardware TCE table will not work as guest physical
>> to host physical address translation is requited.
>>
>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>> to register VFIO's TCE in the host.
>>
>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>> that sPAPRTCETable should not try allocating TCE table in the host kernel.
>> Instead, the table will be created in QEMU.
>>
>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>> choose whether to use acceleration or not. At the moment it is enabled
>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>>
>> This is a workaround but it lets me have one IOMMU device for VIO, emulated
>> PCI and VFIO which is a good thing.
>>
>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
>> this needs kernel update.
>
> Never mind, I'll make it a capability. I'll post capability reservation
> patch separately.

Just rename the flag from "kvm_accel" to "vfio_accel", set it to true 
for vfio and false for emulated devices. Then the spapr_iommu file can 
check on the capability (and default to false for now, since it doesn't 
exist yet).

That way you don't have to reserve a CAP today.


Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 13:06     ` Alexander Graf
@ 2014-06-05 13:10       ` Alexey Kardashevskiy
  2014-06-05 13:15         ` Alexander Graf
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 13:10 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan

On 06/05/2014 11:06 PM, Alexander Graf wrote:
> 
> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows allocating
>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>> targeted to specific LIOBN (logical bus number) right in the host without
>>> switching to QEMU. At the moment this is used for emulated devices only
>>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>> the table and complete hypercall execution. The user space will not be
>>> notified.
>>>
>>> Upcoming VFIO support is going to use the same sPAPRTCETable device class
>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>> tables for VFIO are going to be allocated in the host as well.
>>> However VFIO operates with real IOMMU tables and simple copying of
>>> a TCE to the real hardware TCE table will not work as guest physical
>>> to host physical address translation is requited.
>>>
>>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>>> to register VFIO's TCE in the host.
>>>
>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>> that sPAPRTCETable should not try allocating TCE table in the host kernel.
>>> Instead, the table will be created in QEMU.
>>>
>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>>> choose whether to use acceleration or not. At the moment it is enabled
>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>>
>>> This is a workaround but it lets me have one IOMMU device for VIO, emulated
>>> PCI and VFIO which is a good thing.
>>>
>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
>>> this needs kernel update.
>>
>> Never mind, I'll make it a capability. I'll post capability reservation
>> patch separately.
> 
> Just rename the flag from "kvm_accel" to "vfio_accel", set it to true for
> vfio and false for emulated devices. Then the spapr_iommu file can check on
> the capability (and default to false for now, since it doesn't exist yet).

Is that ok if the flag does not have to do anything with VFIO per se? :)


> That way you don't have to reserve a CAP today.

Why exactly cannot we do that today?

How do we proceed with the rest of this patchset? Thanks!


-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 13:10       ` Alexey Kardashevskiy
@ 2014-06-05 13:15         ` Alexander Graf
  2014-06-05 13:33           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Graf @ 2014-06-05 13:15 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan


On 05.06.14 15:10, Alexey Kardashevskiy wrote:
> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows allocating
>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>> targeted to specific LIOBN (logical bus number) right in the host without
>>>> switching to QEMU. At the moment this is used for emulated devices only
>>>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>> the table and complete hypercall execution. The user space will not be
>>>> notified.
>>>>
>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device class
>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>> tables for VFIO are going to be allocated in the host as well.
>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>> to host physical address translation is requited.
>>>>
>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>>>> to register VFIO's TCE in the host.
>>>>
>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>> that sPAPRTCETable should not try allocating TCE table in the host kernel.
>>>> Instead, the table will be created in QEMU.
>>>>
>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>>>> choose whether to use acceleration or not. At the moment it is enabled
>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>> ---
>>>>
>>>> This is a workaround but it lets me have one IOMMU device for VIO, emulated
>>>> PCI and VFIO which is a good thing.
>>>>
>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
>>>> this needs kernel update.
>>> Never mind, I'll make it a capability. I'll post capability reservation
>>> patch separately.
>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to true for
>> vfio and false for emulated devices. Then the spapr_iommu file can check on
>> the capability (and default to false for now, since it doesn't exist yet).
> Is that ok if the flag does not have to do anything with VFIO per se? :)

The flag means "use in-kernel acceleration if the vfio coupling 
capability is available", no?

>
>
>> That way you don't have to reserve a CAP today.
> Why exactly cannot we do that today?

Because the CAP namespace isn't a garbage bin we can just throw IDs at. 
Maybe we realize during patch review that we need completely different CAPs.


Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 13:15         ` Alexander Graf
@ 2014-06-05 13:33           ` Alexey Kardashevskiy
  2014-06-05 13:36             ` Alexander Graf
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 13:33 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan

On 06/05/2014 11:15 PM, Alexander Graf wrote:
> 
> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>> allocating
>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>> targeted to specific LIOBN (logical bus number) right in the host without
>>>>> switching to QEMU. At the moment this is used for emulated devices only
>>>>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>> the table and complete hypercall execution. The user space will not be
>>>>> notified.
>>>>>
>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device class
>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>>> to host physical address translation is requited.
>>>>>
>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>>>>> to register VFIO's TCE in the host.
>>>>>
>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>> kernel.
>>>>> Instead, the table will be created in QEMU.
>>>>>
>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>>>>> choose whether to use acceleration or not. At the moment it is enabled
>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>> ---
>>>>>
>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>> emulated
>>>>> PCI and VFIO which is a good thing.
>>>>>
>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
>>>>> this needs kernel update.
>>>> Never mind, I'll make it a capability. I'll post capability reservation
>>>> patch separately.
>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to true for
>>> vfio and false for emulated devices. Then the spapr_iommu file can check on
>>> the capability (and default to false for now, since it doesn't exist yet).
>> Is that ok if the flag does not have to do anything with VFIO per se? :)
> 
> The flag means "use in-kernel acceleration if the vfio coupling capability
> is available", no?

It is a flag of sPAPRTCETable which is not supposed to know about VFIO at
all, it is just an IOMMU. But if you are ok with it, I have no reason to be
unhappy either :)



>>> That way you don't have to reserve a CAP today.
>> Why exactly cannot we do that today?
> 
> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
> Maybe we realize during patch review that we need completely different CAPs.

That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be available in
the kernel.



-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio
  2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
@ 2014-06-05 13:34   ` Alexander Graf
  2014-06-05 14:37     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Graf @ 2014-06-05 13:34 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan


On 05.06.14 07:50, Alexey Kardashevskiy wrote:
> The patch adds a spapr-pci-vfio-host-bridge device type
> which is a PCI Host Bridge with VFIO support. The new device
> inherits from the spapr-pci-host-bridge device and adds an "iommu"
> property which is an IOMMU id. This ID represents a minimal entity
> for which IOMMU isolation can be guaranteed. In SPAPR architecture IOMMU
> group is called a Partitionable Endpoint (PE).
>
> Current implementation supports one IOMMU id per QEMU VFIO PHB. Since
> SPAPR allows multiple PHB for no extra cost, this does not seem to
> be a problem. This limitation may change in the future though.
>
> Example of use:
> Configure and Add 3 functions of a multifunctional device to QEMU:
> (the NEC PCI USB card is used as an example here):
> -device spapr-pci-vfio-host-bridge,id=USB,iommu=4,index=7 \
> -device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
> -device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
> -device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB
>
> where:
> * index=7 is a QEMU PHB index (used as source for MMIO/MSI/IO windows
> offset);
> * iommu=4 is an IOMMU id which can be found in sysfs:
> [aik@vpl2 ~]$ cd /sys/bus/pci/devices/0004:00:00.0/
> [aik@vpl2 0004:00:00.0]$ ls -l iommu_group
> lrwxrwxrwx 1 root root 0 Jun  5 12:49 iommu_group -> ../../../kernel/iommu_groups/4
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v7:
> * remove bunch of properties from VFIO PHB such as "scan", "multifunction",
> "force_addr" - let management softwsare deal with it
> * removed traces used in scan() (which is also removed)
> * updated license
> * disables in-kernel TCE table ("false" in spapr_tce_new_table())
>
> v5:
> * added handling of possible failure of spapr_vfio_new_table()
>
> v4:
> * moved IOMMU changes to separate patches
> * moved spapr-pci-vfio-host-bridge to new file
> ---
>   hw/ppc/Makefile.objs        |  3 ++
>   hw/ppc/spapr_pci_vfio.c     | 93 +++++++++++++++++++++++++++++++++++++++++++++
>   include/hw/pci-host/spapr.h | 11 ++++++
>   3 files changed, 107 insertions(+)
>   create mode 100644 hw/ppc/spapr_pci_vfio.c
>
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index ea747f0..edd44d0 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -4,6 +4,9 @@ obj-y += ppc.o ppc_booke.o
>   obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
>   obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>   obj-$(CONFIG_PSERIES) += spapr_pci.o
> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
> +obj-y += spapr_pci_vfio.o
> +endif
>   # PowerPC 4xx boards
>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>   obj-y += ppc4xx_pci.o
> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
> new file mode 100644
> index 0000000..ec52a18
> --- /dev/null
> +++ b/hw/ppc/spapr_pci_vfio.c
> @@ -0,0 +1,93 @@
> +/*
> + * QEMU sPAPR PCI host for VFIO
> + *
> + * Copyright (c) 2011-2014 Alexey Kardashevskiy, IBM Corporation.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License,
> + *  or (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + *  GNU General Public License for more details.
> + *
> + *  You should have received a copy of the GNU General Public License
> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +#include "hw/ppc/spapr.h"
> +#include "hw/pci-host/spapr.h"
> +#include "hw/misc/vfio.h"
> +
> +static Property spapr_phb_vfio_properties[] = {
> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error **errp)
> +{
> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
> +    int ret;
> +    sPAPRTCETable *tcet;
> +    uint32_t liobn = svphb->phb.dma_liobn;
> +
> +    if (svphb->iommugroupid == -1) {
> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
> +        return;
> +    }
> +
> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
> +                                        svphb->iommugroupid,
> +                                        &info);
> +    if (ret) {
> +        error_setg_errno(errp, -ret,
> +                         "spapr-vfio: get info from container failed");
> +        return;
> +    }
> +
> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn, info.dma32_window_start,

So what happens when the device gets a different offset? Will it write 
into other devices or will DMAs just get ignored?


Alex

> +                               SPAPR_TCE_PAGE_SHIFT,
> +                               info.dma32_window_size >> SPAPR_TCE_PAGE_SHIFT,
> +                               false);
> +    if (!tcet) {
> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
> +        return;
> +    }
> +
> +    /* Register default 32bit DMA window */
> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
> +                                spapr_tce_get_iommu(tcet));
> +}
> +
> +static void spapr_phb_vfio_reset(DeviceState *qdev)
> +{
> +    /* Do nothing */
> +}
> +
> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
> +
> +    dc->props = spapr_phb_vfio_properties;
> +    dc->reset = spapr_phb_vfio_reset;
> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
> +}
> +
> +static const TypeInfo spapr_phb_vfio_info = {
> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
> +    .instance_size = sizeof(sPAPRPHBVFIOState),
> +    .class_init    = spapr_phb_vfio_class_init,
> +    .class_size    = sizeof(sPAPRPHBClass),
> +};
> +
> +static void spapr_pci_vfio_register_types(void)
> +{
> +    type_register_static(&spapr_phb_vfio_info);
> +}
> +
> +type_init(spapr_pci_vfio_register_types)
> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
> index 0934518..6808e96 100644
> --- a/include/hw/pci-host/spapr.h
> +++ b/include/hw/pci-host/spapr.h
> @@ -30,10 +30,14 @@
>   #define SPAPR_MSIX_MAX_DEVS 32
>   
>   #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
>   
>   #define SPAPR_PCI_HOST_BRIDGE(obj) \
>       OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>   
> +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
> +
>   #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
>        OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass), TYPE_SPAPR_PCI_HOST_BRIDGE)
>   #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
> @@ -41,6 +45,7 @@
>   
>   typedef struct sPAPRPHBClass sPAPRPHBClass;
>   typedef struct sPAPRPHBState sPAPRPHBState;
> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
>   
>   struct sPAPRPHBClass {
>       PCIHostBridgeClass parent_class;
> @@ -76,6 +81,12 @@ struct sPAPRPHBState {
>       QLIST_ENTRY(sPAPRPHBState) list;
>   };
>   
> +struct sPAPRPHBVFIOState {
> +    sPAPRPHBState phb;
> +
> +    int32_t iommugroupid;
> +};
> +
>   #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>   
>   #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 13:33           ` Alexey Kardashevskiy
@ 2014-06-05 13:36             ` Alexander Graf
  2014-06-05 14:33               ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Graf @ 2014-06-05 13:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan


On 05.06.14 15:33, Alexey Kardashevskiy wrote:
> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>> allocating
>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>> targeted to specific LIOBN (logical bus number) right in the host without
>>>>>> switching to QEMU. At the moment this is used for emulated devices only
>>>>>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>> the table and complete hypercall execution. The user space will not be
>>>>>> notified.
>>>>>>
>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device class
>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>>>> to host physical address translation is requited.
>>>>>>
>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>>>>>> to register VFIO's TCE in the host.
>>>>>>
>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>> kernel.
>>>>>> Instead, the table will be created in QEMU.
>>>>>>
>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>>>>>> choose whether to use acceleration or not. At the moment it is enabled
>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>>>
>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>> ---
>>>>>>
>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>> emulated
>>>>>> PCI and VFIO which is a good thing.
>>>>>>
>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO capability but
>>>>>> this needs kernel update.
>>>>> Never mind, I'll make it a capability. I'll post capability reservation
>>>>> patch separately.
>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to true for
>>>> vfio and false for emulated devices. Then the spapr_iommu file can check on
>>>> the capability (and default to false for now, since it doesn't exist yet).
>>> Is that ok if the flag does not have to do anything with VFIO per se? :)
>> The flag means "use in-kernel acceleration if the vfio coupling capability
>> is available", no?
> It is a flag of sPAPRTCETable which is not supposed to know about VFIO at
> all, it is just an IOMMU. But if you are ok with it, I have no reason to be
> unhappy either :)
>
>
>
>>>> That way you don't have to reserve a CAP today.
>>> Why exactly cannot we do that today?
>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>> Maybe we realize during patch review that we need completely different CAPs.
> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be available in
> the kernel.

So all you need are 64bit TCEs with bus_offset? What about the missing 
in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's 
what this is really about.


Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 13:36             ` Alexander Graf
@ 2014-06-05 14:33               ` Alexey Kardashevskiy
  2014-06-05 16:51                 ` Alexander Graf
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 14:33 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan

On 06/05/2014 11:36 PM, Alexander Graf wrote:
> 
> On 05.06.14 15:33, Alexey Kardashevskiy wrote:
>> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>>> allocating
>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>>> targeted to specific LIOBN (logical bus number) right in the host
>>>>>>> without
>>>>>>> switching to QEMU. At the moment this is used for emulated devices only
>>>>>>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>>> the table and complete hypercall execution. The user space will not be
>>>>>>> notified.
>>>>>>>
>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device
>>>>>>> class
>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>>>>> to host physical address translation is requited.
>>>>>>>
>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>>>>>>> to register VFIO's TCE in the host.
>>>>>>>
>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>>> kernel.
>>>>>>> Instead, the table will be created in QEMU.
>>>>>>>
>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>>>>>>> choose whether to use acceleration or not. At the moment it is enabled
>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>>>>
>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>> ---
>>>>>>>
>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>>> emulated
>>>>>>> PCI and VFIO which is a good thing.
>>>>>>>
>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO
>>>>>>> capability but
>>>>>>> this needs kernel update.
>>>>>> Never mind, I'll make it a capability. I'll post capability reservation
>>>>>> patch separately.
>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to true for
>>>>> vfio and false for emulated devices. Then the spapr_iommu file can
>>>>> check on
>>>>> the capability (and default to false for now, since it doesn't exist
>>>>> yet).
>>>> Is that ok if the flag does not have to do anything with VFIO per se? :)
>>> The flag means "use in-kernel acceleration if the vfio coupling capability
>>> is available", no?
>> It is a flag of sPAPRTCETable which is not supposed to know about VFIO at
>> all, it is just an IOMMU. But if you are ok with it, I have no reason to be
>> unhappy either :)
>>
>>
>>
>>>>> That way you don't have to reserve a CAP today.
>>>> Why exactly cannot we do that today?
>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>>> Maybe we realize during patch review that we need completely different
>>> CAPs.
>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be available in
>> the kernel.
> 
> So all you need are 64bit TCEs with bus_offset? 


No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is
just 1 or 2GB and it is mapped at 0 on PCI bus.

TCEs are 64 bit already.



> What about the missing
> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's
> what this is really about.

This I do not understand :(


-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio
  2014-06-05 13:34   ` Alexander Graf
@ 2014-06-05 14:37     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 14:37 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan

On 06/05/2014 11:34 PM, Alexander Graf wrote:
> 
> On 05.06.14 07:50, Alexey Kardashevskiy wrote:
>> The patch adds a spapr-pci-vfio-host-bridge device type
>> which is a PCI Host Bridge with VFIO support. The new device
>> inherits from the spapr-pci-host-bridge device and adds an "iommu"
>> property which is an IOMMU id. This ID represents a minimal entity
>> for which IOMMU isolation can be guaranteed. In SPAPR architecture IOMMU
>> group is called a Partitionable Endpoint (PE).
>>
>> Current implementation supports one IOMMU id per QEMU VFIO PHB. Since
>> SPAPR allows multiple PHB for no extra cost, this does not seem to
>> be a problem. This limitation may change in the future though.
>>
>> Example of use:
>> Configure and Add 3 functions of a multifunctional device to QEMU:
>> (the NEC PCI USB card is used as an example here):
>> -device spapr-pci-vfio-host-bridge,id=USB,iommu=4,index=7 \
>> -device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true
>> -device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB
>> -device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB
>>
>> where:
>> * index=7 is a QEMU PHB index (used as source for MMIO/MSI/IO windows
>> offset);
>> * iommu=4 is an IOMMU id which can be found in sysfs:
>> [aik@vpl2 ~]$ cd /sys/bus/pci/devices/0004:00:00.0/
>> [aik@vpl2 0004:00:00.0]$ ls -l iommu_group
>> lrwxrwxrwx 1 root root 0 Jun  5 12:49 iommu_group ->
>> ../../../kernel/iommu_groups/4
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v7:
>> * remove bunch of properties from VFIO PHB such as "scan", "multifunction",
>> "force_addr" - let management softwsare deal with it
>> * removed traces used in scan() (which is also removed)
>> * updated license
>> * disables in-kernel TCE table ("false" in spapr_tce_new_table())
>>
>> v5:
>> * added handling of possible failure of spapr_vfio_new_table()
>>
>> v4:
>> * moved IOMMU changes to separate patches
>> * moved spapr-pci-vfio-host-bridge to new file
>> ---
>>   hw/ppc/Makefile.objs        |  3 ++
>>   hw/ppc/spapr_pci_vfio.c     | 93
>> +++++++++++++++++++++++++++++++++++++++++++++
>>   include/hw/pci-host/spapr.h | 11 ++++++
>>   3 files changed, 107 insertions(+)
>>   create mode 100644 hw/ppc/spapr_pci_vfio.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index ea747f0..edd44d0 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -4,6 +4,9 @@ obj-y += ppc.o ppc_booke.o
>>   obj-$(CONFIG_PSERIES) += spapr.o spapr_vio.o spapr_events.o
>>   obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>>   obj-$(CONFIG_PSERIES) += spapr_pci.o
>> +ifeq ($(CONFIG_PCI)$(CONFIG_PSERIES)$(CONFIG_LINUX), yyy)
>> +obj-y += spapr_pci_vfio.o
>> +endif
>>   # PowerPC 4xx boards
>>   obj-y += ppc405_boards.o ppc4xx_devs.o ppc405_uc.o ppc440_bamboo.o
>>   obj-y += ppc4xx_pci.o
>> diff --git a/hw/ppc/spapr_pci_vfio.c b/hw/ppc/spapr_pci_vfio.c
>> new file mode 100644
>> index 0000000..ec52a18
>> --- /dev/null
>> +++ b/hw/ppc/spapr_pci_vfio.c
>> @@ -0,0 +1,93 @@
>> +/*
>> + * QEMU sPAPR PCI host for VFIO
>> + *
>> + * Copyright (c) 2011-2014 Alexey Kardashevskiy, IBM Corporation.
>> + *
>> + *  This program is free software; you can redistribute it and/or modify
>> + *  it under the terms of the GNU General Public License as published by
>> + *  the Free Software Foundation; either version 2 of the License,
>> + *  or (at your option) any later version.
>> + *
>> + *  This program is distributed in the hope that it will be useful,
>> + *  but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + *  GNU General Public License for more details.
>> + *
>> + *  You should have received a copy of the GNU General Public License
>> + *  along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/pci-host/spapr.h"
>> +#include "hw/misc/vfio.h"
>> +
>> +static Property spapr_phb_vfio_properties[] = {
>> +    DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void spapr_phb_vfio_finish_realize(sPAPRPHBState *sphb, Error
>> **errp)
>> +{
>> +    sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(sphb);
>> +    struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) };
>> +    int ret;
>> +    sPAPRTCETable *tcet;
>> +    uint32_t liobn = svphb->phb.dma_liobn;
>> +
>> +    if (svphb->iommugroupid == -1) {
>> +        error_setg(errp, "Wrong IOMMU group ID %d", svphb->iommugroupid);
>> +        return;
>> +    }
>> +
>> +    ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as,
>> +                                        svphb->iommugroupid,
>> +                                        &info);
>> +    if (ret) {
>> +        error_setg_errno(errp, -ret,
>> +                         "spapr-vfio: get info from container failed");
>> +        return;
>> +    }
>> +
>> +    tcet = spapr_tce_new_table(DEVICE(sphb), liobn,
>> info.dma32_window_start,
> 
> So what happens when the device gets a different offset? Will it write into
> other devices or will DMAs just get ignored?


The device gets just an address which it will do DMA to/from. "Window"
means here that PHB (or PCI bridge) is configured not to block access to
that window and block accesses outside the window. And the window can only
be configured on the host size where real PHB is available for programming,
we do this in powernv code (IODA, IODA2, P7IOC, P7IOC2 - these PHBs).




> 
> 
> Alex
> 
>> +                               SPAPR_TCE_PAGE_SHIFT,
>> +                               info.dma32_window_size >>
>> SPAPR_TCE_PAGE_SHIFT,
>> +                               false);
>> +    if (!tcet) {
>> +        error_setg(errp, "spapr-vfio: failed to create VFIO TCE table");
>> +        return;
>> +    }
>> +
>> +    /* Register default 32bit DMA window */
>> +    memory_region_add_subregion(&sphb->iommu_root, tcet->bus_offset,
>> +                                spapr_tce_get_iommu(tcet));
>> +}
>> +
>> +static void spapr_phb_vfio_reset(DeviceState *qdev)
>> +{
>> +    /* Do nothing */
>> +}
>> +
>> +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    sPAPRPHBClass *spc = SPAPR_PCI_HOST_BRIDGE_CLASS(klass);
>> +
>> +    dc->props = spapr_phb_vfio_properties;
>> +    dc->reset = spapr_phb_vfio_reset;
>> +    spc->finish_realize = spapr_phb_vfio_finish_realize;
>> +}
>> +
>> +static const TypeInfo spapr_phb_vfio_info = {
>> +    .name          = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE,
>> +    .parent        = TYPE_SPAPR_PCI_HOST_BRIDGE,
>> +    .instance_size = sizeof(sPAPRPHBVFIOState),
>> +    .class_init    = spapr_phb_vfio_class_init,
>> +    .class_size    = sizeof(sPAPRPHBClass),
>> +};
>> +
>> +static void spapr_pci_vfio_register_types(void)
>> +{
>> +    type_register_static(&spapr_phb_vfio_info);
>> +}
>> +
>> +type_init(spapr_pci_vfio_register_types)
>> diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h
>> index 0934518..6808e96 100644
>> --- a/include/hw/pci-host/spapr.h
>> +++ b/include/hw/pci-host/spapr.h
>> @@ -30,10 +30,14 @@
>>   #define SPAPR_MSIX_MAX_DEVS 32
>>     #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge"
>> +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge"
>>     #define SPAPR_PCI_HOST_BRIDGE(obj) \
>>       OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE)
>>   +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \
>> +    OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE)
>> +
>>   #define SPAPR_PCI_HOST_BRIDGE_CLASS(klass) \
>>        OBJECT_CLASS_CHECK(sPAPRPHBClass, (klass),
>> TYPE_SPAPR_PCI_HOST_BRIDGE)
>>   #define SPAPR_PCI_HOST_BRIDGE_GET_CLASS(obj) \
>> @@ -41,6 +45,7 @@
>>     typedef struct sPAPRPHBClass sPAPRPHBClass;
>>   typedef struct sPAPRPHBState sPAPRPHBState;
>> +typedef struct sPAPRPHBVFIOState sPAPRPHBVFIOState;
>>     struct sPAPRPHBClass {
>>       PCIHostBridgeClass parent_class;
>> @@ -76,6 +81,12 @@ struct sPAPRPHBState {
>>       QLIST_ENTRY(sPAPRPHBState) list;
>>   };
>>   +struct sPAPRPHBVFIOState {
>> +    sPAPRPHBState phb;
>> +
>> +    int32_t iommugroupid;
>> +};
>> +
>>   #define SPAPR_PCI_BASE_BUID          0x800000020000000ULL
>>     #define SPAPR_PCI_WINDOW_BASE        0x10000000000ULL
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 14:33               ` Alexey Kardashevskiy
@ 2014-06-05 16:51                 ` Alexander Graf
  2014-06-05 23:17                   ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alexander Graf @ 2014-06-05 16:51 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan


On 05.06.14 16:33, Alexey Kardashevskiy wrote:
> On 06/05/2014 11:36 PM, Alexander Graf wrote:
>> On 05.06.14 15:33, Alexey Kardashevskiy wrote:
>>> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>>>> allocating
>>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>>>> targeted to specific LIOBN (logical bus number) right in the host
>>>>>>>> without
>>>>>>>> switching to QEMU. At the moment this is used for emulated devices only
>>>>>>>> and the handler only puts TCE to the table. If the in-kernel H_PUT_TCE
>>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>>>> the table and complete hypercall execution. The user space will not be
>>>>>>>> notified.
>>>>>>>>
>>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device
>>>>>>>> class
>>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>>>>>> to host physical address translation is requited.
>>>>>>>>
>>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we better not
>>>>>>>> to register VFIO's TCE in the host.
>>>>>>>>
>>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>>>> kernel.
>>>>>>>> Instead, the table will be created in QEMU.
>>>>>>>>
>>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let users
>>>>>>>> choose whether to use acceleration or not. At the moment it is enabled
>>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>>>>>
>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>>>> emulated
>>>>>>>> PCI and VFIO which is a good thing.
>>>>>>>>
>>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO
>>>>>>>> capability but
>>>>>>>> this needs kernel update.
>>>>>>> Never mind, I'll make it a capability. I'll post capability reservation
>>>>>>> patch separately.
>>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to true for
>>>>>> vfio and false for emulated devices. Then the spapr_iommu file can
>>>>>> check on
>>>>>> the capability (and default to false for now, since it doesn't exist
>>>>>> yet).
>>>>> Is that ok if the flag does not have to do anything with VFIO per se? :)
>>>> The flag means "use in-kernel acceleration if the vfio coupling capability
>>>> is available", no?
>>> It is a flag of sPAPRTCETable which is not supposed to know about VFIO at
>>> all, it is just an IOMMU. But if you are ok with it, I have no reason to be
>>> unhappy either :)
>>>
>>>
>>>
>>>>>> That way you don't have to reserve a CAP today.
>>>>> Why exactly cannot we do that today?
>>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>>>> Maybe we realize during patch review that we need completely different
>>>> CAPs.
>>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be available in
>>> the kernel.
>> So all you need are 64bit TCEs with bus_offset?
>
> No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is
> just 1 or 2GB and it is mapped at 0 on PCI bus.
>
> TCEs are 64 bit already.

Ok, so the guest has to tell the PCI device to write to a specific 
window. That's a shame :).

>
>> What about the missing
>> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's
>> what this is really about.
> This I do not understand :(

How does real mode H_PUT_TCE emulation know that it needs to notify user 
space to establish the map?


Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info()
  2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info() Alexey Kardashevskiy
@ 2014-06-05 19:27   ` Alex Williamson
  2014-06-05 23:40     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2014-06-05 19:27 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Gavin Shan, Alexander Graf

On Thu, 2014-06-05 at 15:49 +1000, Alexey Kardashevskiy wrote:
> To perform DMA mapping via TCE table correctly, the guest must
> know where DMA window is located on the PCI bus. A hypervisor is
> expected to provide such information. Since QEMU has no control
> over this setting, we need a way to obtain a start address and size
> from the host VFIO driver.
> 
> This adds a helper which returns the default DMA window properties
> for the specific IOMMU group. The upstream kernel implements this ioctl
> already.

Couldn't this be done with Gavin's vfio_pci_container_ioctl()?  Thanks,

Alex


> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v7:
> * do not return a group fd from the helper
> 
> v6:
> * added dup() to protect group_fd from accidental disposal
> 
> v5:
> * reworked to reflect change in vfio_get_group() from one
> of previous patches change
> 
> v4:
> * fixed possible leaks on error paths
> ---
>  hw/misc/vfio.c         | 36 ++++++++++++++++++++++++++++++++++++
>  include/hw/misc/vfio.h | 11 +++++++++++
>  2 files changed, 47 insertions(+)
>  create mode 100644 include/hw/misc/vfio.h
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 7437c2e..99141f3 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -39,6 +39,7 @@
>  #include "qemu/range.h"
>  #include "sysemu/kvm.h"
>  #include "sysemu/sysemu.h"
> +#include "hw/misc/vfio.h"
>  
>  /* #define DEBUG_VFIO */
>  #ifdef DEBUG_VFIO
> @@ -4318,3 +4319,38 @@ static void register_vfio_pci_dev_type(void)
>  }
>  
>  type_init(register_vfio_pci_dev_type)
> +
> +int vfio_container_spapr_get_info(AddressSpace *as,
> +                                  int32_t groupid,
> +                                  struct vfio_iommu_spapr_tce_info *info)
> +{
> +    VFIOGroup *group;
> +    VFIOContainer *container;
> +    int ret, fd;
> +
> +    group = vfio_get_group(groupid, as);
> +    if (!group) {
> +        return -1;
> +    }
> +    container = group->container;
> +    if (!group->container) {
> +        goto put_group_exit;
> +    }
> +    fd = container->fd;
> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +        goto put_group_exit;
> +    }
> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
> +    if (ret) {
> +        error_report("vfio: failed to get iommu info for container: %s",
> +                     strerror(errno));
> +        goto put_group_exit;
> +    }
> +
> +    return 0;
> +
> +put_group_exit:
> +    vfio_put_group(group);
> +
> +    return -1;
> +}
> diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
> new file mode 100644
> index 0000000..e82f5a3
> --- /dev/null
> +++ b/include/hw/misc/vfio.h
> @@ -0,0 +1,11 @@
> +#ifndef VFIO_API_H
> +#define VFIO_API_H
> +
> +#include "qemu/typedefs.h"
> +#include <linux/vfio.h>
> +
> +extern int vfio_container_spapr_get_info(AddressSpace *as,
> +                                         int32_t groupid,
> +                                         struct vfio_iommu_spapr_tce_info *info);
> +
> +#endif

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr
  2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr Alexey Kardashevskiy
@ 2014-06-05 19:31   ` Alex Williamson
  2014-06-05 23:39     ` Alexey Kardashevskiy
  0 siblings, 1 reply; 24+ messages in thread
From: Alex Williamson @ 2014-06-05 19:31 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-ppc, qemu-devel, Gavin Shan, Alexander Graf

On Thu, 2014-06-05 at 15:50 +1000, Alexey Kardashevskiy wrote:
> This turns the sPAPR support on and enables VFIO container use
> in the kernel.
> 
> This extends vfio_connect_container to support VFIO_SPAPR_TCE_IOMMU type
> in the host kernel.
> 
> This registers a memory listener which sPAPR IOMMU will notify when
> executing H_PUT_TCE/etc DMA calls. The listener then will notify the host
> kernel about DMA map/unmap operation via VFIO_IOMMU_MAP_DMA/
> VFIO_IOMMU_UNMAP_DMA ioctls.
> 
> This executes VFIO_IOMMU_ENABLE ioctl to make sure that the IOMMU is free
> of mappings and can be exclusively given to the user. At the moment SPAPR
> is the only platform requiring this call to be implemented.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> Changes:
> v7:
> * added more details in commit log
> 
> v5:
> * multiple returns converted to gotos
> 
> v4:
> * fixed format string to use %m which is a glibc extension:
> "Print output of strerror(errno). No argument is required."
> ---
>  hw/misc/vfio.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
> index 99141f3..c8e3aff 100644
> --- a/hw/misc/vfio.c
> +++ b/hw/misc/vfio.c
> @@ -3650,6 +3650,34 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>  
>          container->iommu_data.type1.initialized = true;
>  
> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
> +        if (ret) {
> +            error_report("vfio: failed to set group container: %m");
> +            ret = -errno;
> +            goto free_container_exit;
> +        }
> +
> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
> +        if (ret) {
> +            error_report("vfio: failed to set iommu for container: %m");
> +            ret = -errno;
> +            goto free_container_exit;
> +        }
> +
> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);

Where's the matching DISABLE?  Do we need a different release wrapper
that includes a disable?  Thanks,

Alex

> +        if (ret) {
> +            error_report("vfio: failed to enable container: %m");
> +            ret = -errno;
> +            goto free_container_exit;
> +        }
> +
> +        container->iommu_data.type1.listener = vfio_memory_listener;
> +        container->iommu_data.release = vfio_listener_release;
> +
> +        memory_listener_register(&container->iommu_data.type1.listener,
> +                                 container->space->as);
> +
>      } else {
>          error_report("vfio: No available IOMMU models");
>          ret = -EINVAL;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 16:51                 ` Alexander Graf
@ 2014-06-05 23:17                   ` Alexey Kardashevskiy
  2014-06-05 23:36                     ` Alexander Graf
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 23:17 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan

On 06/06/2014 02:51 AM, Alexander Graf wrote:
> 
> On 05.06.14 16:33, Alexey Kardashevskiy wrote:
>> On 06/05/2014 11:36 PM, Alexander Graf wrote:
>>> On 05.06.14 15:33, Alexey Kardashevskiy wrote:
>>>> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>>>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>>>>> allocating
>>>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>>>>> targeted to specific LIOBN (logical bus number) right in the host
>>>>>>>>> without
>>>>>>>>> switching to QEMU. At the moment this is used for emulated devices
>>>>>>>>> only
>>>>>>>>> and the handler only puts TCE to the table. If the in-kernel
>>>>>>>>> H_PUT_TCE
>>>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>>>>> the table and complete hypercall execution. The user space will
>>>>>>>>> not be
>>>>>>>>> notified.
>>>>>>>>>
>>>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device
>>>>>>>>> class
>>>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>>>>>>> to host physical address translation is requited.
>>>>>>>>>
>>>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we
>>>>>>>>> better not
>>>>>>>>> to register VFIO's TCE in the host.
>>>>>>>>>
>>>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>>>>> kernel.
>>>>>>>>> Instead, the table will be created in QEMU.
>>>>>>>>>
>>>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let
>>>>>>>>> users
>>>>>>>>> choose whether to use acceleration or not. At the moment it is
>>>>>>>>> enabled
>>>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>>>>> emulated
>>>>>>>>> PCI and VFIO which is a good thing.
>>>>>>>>>
>>>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO
>>>>>>>>> capability but
>>>>>>>>> this needs kernel update.
>>>>>>>> Never mind, I'll make it a capability. I'll post capability
>>>>>>>> reservation
>>>>>>>> patch separately.
>>>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to
>>>>>>> true for
>>>>>>> vfio and false for emulated devices. Then the spapr_iommu file can
>>>>>>> check on
>>>>>>> the capability (and default to false for now, since it doesn't exist
>>>>>>> yet).
>>>>>> Is that ok if the flag does not have to do anything with VFIO per se? :)
>>>>> The flag means "use in-kernel acceleration if the vfio coupling
>>>>> capability
>>>>> is available", no?
>>>> It is a flag of sPAPRTCETable which is not supposed to know about VFIO at
>>>> all, it is just an IOMMU. But if you are ok with it, I have no reason
>>>> to be
>>>> unhappy either :)
>>>>
>>>>
>>>>
>>>>>>> That way you don't have to reserve a CAP today.
>>>>>> Why exactly cannot we do that today?
>>>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>>>>> Maybe we realize during patch review that we need completely different
>>>>> CAPs.
>>>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be available in
>>>> the kernel.
>>> So all you need are 64bit TCEs with bus_offset?
>>
>> No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is
>> just 1 or 2GB and it is mapped at 0 on PCI bus.
>>
>> TCEs are 64 bit already.
> 
> Ok, so the guest has to tell the PCI device to write to a specific window.
> That's a shame :).

No. Guest tells the device some address, that's it.  Guest allocates those
addresses from some window which host, guest and PHB know about but not the
device. What is a shame here?


> 
>>
>>> What about the missing
>>> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's
>>> what this is really about.
>> This I do not understand :(
> 
> How does real mode H_PUT_TCE emulation know that it needs to notify user
> space to establish the map?

If it wants to pass control to the user space, it returns H_TOO_HARD. This
happens, for example, if LIOBN was not registered in KVM.



-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 23:17                   ` Alexey Kardashevskiy
@ 2014-06-05 23:36                     ` Alexander Graf
  2014-06-05 23:48                       ` Alexey Kardashevskiy
  2014-06-06  3:38                       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 24+ messages in thread
From: Alexander Graf @ 2014-06-05 23:36 UTC (permalink / raw)
  To: Alexey Kardashevskiy, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan


On 06.06.14 01:17, Alexey Kardashevskiy wrote:
> On 06/06/2014 02:51 AM, Alexander Graf wrote:
>> On 05.06.14 16:33, Alexey Kardashevskiy wrote:
>>> On 06/05/2014 11:36 PM, Alexander Graf wrote:
>>>> On 05.06.14 15:33, Alexey Kardashevskiy wrote:
>>>>> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>>>>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>>>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>>>>>> allocating
>>>>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>>>>>> targeted to specific LIOBN (logical bus number) right in the host
>>>>>>>>>> without
>>>>>>>>>> switching to QEMU. At the moment this is used for emulated devices
>>>>>>>>>> only
>>>>>>>>>> and the handler only puts TCE to the table. If the in-kernel
>>>>>>>>>> H_PUT_TCE
>>>>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>>>>>> the table and complete hypercall execution. The user space will
>>>>>>>>>> not be
>>>>>>>>>> notified.
>>>>>>>>>>
>>>>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device
>>>>>>>>>> class
>>>>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means that TCE
>>>>>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>>>>>> a TCE to the real hardware TCE table will not work as guest physical
>>>>>>>>>> to host physical address translation is requited.
>>>>>>>>>>
>>>>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we
>>>>>>>>>> better not
>>>>>>>>>> to register VFIO's TCE in the host.
>>>>>>>>>>
>>>>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device telling
>>>>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>>>>>> kernel.
>>>>>>>>>> Instead, the table will be created in QEMU.
>>>>>>>>>>
>>>>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let
>>>>>>>>>> users
>>>>>>>>>> choose whether to use acceleration or not. At the moment it is
>>>>>>>>>> enabled
>>>>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to false.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>>>>>> emulated
>>>>>>>>>> PCI and VFIO which is a good thing.
>>>>>>>>>>
>>>>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO
>>>>>>>>>> capability but
>>>>>>>>>> this needs kernel update.
>>>>>>>>> Never mind, I'll make it a capability. I'll post capability
>>>>>>>>> reservation
>>>>>>>>> patch separately.
>>>>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to
>>>>>>>> true for
>>>>>>>> vfio and false for emulated devices. Then the spapr_iommu file can
>>>>>>>> check on
>>>>>>>> the capability (and default to false for now, since it doesn't exist
>>>>>>>> yet).
>>>>>>> Is that ok if the flag does not have to do anything with VFIO per se? :)
>>>>>> The flag means "use in-kernel acceleration if the vfio coupling
>>>>>> capability
>>>>>> is available", no?
>>>>> It is a flag of sPAPRTCETable which is not supposed to know about VFIO at
>>>>> all, it is just an IOMMU. But if you are ok with it, I have no reason
>>>>> to be
>>>>> unhappy either :)
>>>>>
>>>>>
>>>>>
>>>>>>>> That way you don't have to reserve a CAP today.
>>>>>>> Why exactly cannot we do that today?
>>>>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>>>>>> Maybe we realize during patch review that we need completely different
>>>>>> CAPs.
>>>>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be available in
>>>>> the kernel.
>>>> So all you need are 64bit TCEs with bus_offset?
>>> No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is
>>> just 1 or 2GB and it is mapped at 0 on PCI bus.
>>>
>>> TCEs are 64 bit already.
>> Ok, so the guest has to tell the PCI device to write to a specific window.
>> That's a shame :).
> No. Guest tells the device some address, that's it.  Guest allocates those
> addresses from some window which host, guest and PHB know about but not the
> device. What is a shame here?

It would be nicer if the guest had full control over the virtual address 
range of a PCI device.

>
>
>>>> What about the missing
>>>> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's
>>>> what this is really about.
>>> This I do not understand :(
>> How does real mode H_PUT_TCE emulation know that it needs to notify user
>> space to establish the map?
> If it wants to pass control to the user space, it returns H_TOO_HARD. This
> happens, for example, if LIOBN was not registered in KVM.

So how does KVM_CAP_SPAPR_TCE_64 help here? With KVM_CAP_SPAPR_TCE_64 we 
can still not map VFIO devices' TCE tables because we're missing all the 
magic to link the virtual TCE table to a physical TCE table.


Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr
  2014-06-05 19:31   ` Alex Williamson
@ 2014-06-05 23:39     ` Alexey Kardashevskiy
  0 siblings, 0 replies; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 23:39 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-ppc, qemu-devel, Gavin Shan, Alexander Graf

On 06/06/2014 05:31 AM, Alex Williamson wrote:
> On Thu, 2014-06-05 at 15:50 +1000, Alexey Kardashevskiy wrote:
>> This turns the sPAPR support on and enables VFIO container use
>> in the kernel.
>>
>> This extends vfio_connect_container to support VFIO_SPAPR_TCE_IOMMU type
>> in the host kernel.
>>
>> This registers a memory listener which sPAPR IOMMU will notify when
>> executing H_PUT_TCE/etc DMA calls. The listener then will notify the host
>> kernel about DMA map/unmap operation via VFIO_IOMMU_MAP_DMA/
>> VFIO_IOMMU_UNMAP_DMA ioctls.
>>
>> This executes VFIO_IOMMU_ENABLE ioctl to make sure that the IOMMU is free
>> of mappings and can be exclusively given to the user. At the moment SPAPR
>> is the only platform requiring this call to be implemented.
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v7:
>> * added more details in commit log
>>
>> v5:
>> * multiple returns converted to gotos
>>
>> v4:
>> * fixed format string to use %m which is a glibc extension:
>> "Print output of strerror(errno). No argument is required."
>> ---
>>  hw/misc/vfio.c | 28 ++++++++++++++++++++++++++++
>>  1 file changed, 28 insertions(+)
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 99141f3..c8e3aff 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -3650,6 +3650,34 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as)
>>  
>>          container->iommu_data.type1.initialized = true;
>>  
>> +    } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +        ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd);
>> +        if (ret) {
>> +            error_report("vfio: failed to set group container: %m");
>> +            ret = -errno;
>> +            goto free_container_exit;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU);
>> +        if (ret) {
>> +            error_report("vfio: failed to set iommu for container: %m");
>> +            ret = -errno;
>> +            goto free_container_exit;
>> +        }
>> +
>> +        ret = ioctl(fd, VFIO_IOMMU_ENABLE);
> 
> Where's the matching DISABLE?  Do we need a different release wrapper
> that includes a disable?  Thanks,


It happens in the host kernel when container fd is closed.
I added VFIO_IOMMU_DISABLE for symmetry but do not really need it.


> 
> Alex
> 
>> +        if (ret) {
>> +            error_report("vfio: failed to enable container: %m");
>> +            ret = -errno;
>> +            goto free_container_exit;
>> +        }
>> +
>> +        container->iommu_data.type1.listener = vfio_memory_listener;
>> +        container->iommu_data.release = vfio_listener_release;
>> +
>> +        memory_listener_register(&container->iommu_data.type1.listener,
>> +                                 container->space->as);
>> +
>>      } else {
>>          error_report("vfio: No available IOMMU models");
>>          ret = -EINVAL;
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info()
  2014-06-05 19:27   ` Alex Williamson
@ 2014-06-05 23:40     ` Alexey Kardashevskiy
  2014-06-06  1:32       ` Gavin Shan
  0 siblings, 1 reply; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 23:40 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-ppc, qemu-devel, Gavin Shan, Alexander Graf

On 06/06/2014 05:27 AM, Alex Williamson wrote:
> On Thu, 2014-06-05 at 15:49 +1000, Alexey Kardashevskiy wrote:
>> To perform DMA mapping via TCE table correctly, the guest must
>> know where DMA window is located on the PCI bus. A hypervisor is
>> expected to provide such information. Since QEMU has no control
>> over this setting, we need a way to obtain a start address and size
>> from the host VFIO driver.
>>
>> This adds a helper which returns the default DMA window properties
>> for the specific IOMMU group. The upstream kernel implements this ioctl
>> already.
> 
> Couldn't this be done with Gavin's vfio_pci_container_ioctl()?  Thanks,


Good point, I missed that. I'll merge two helpers into one and repost, then
Gavin will use it.


> 
> Alex
> 
> 
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> ---
>> Changes:
>> v7:
>> * do not return a group fd from the helper
>>
>> v6:
>> * added dup() to protect group_fd from accidental disposal
>>
>> v5:
>> * reworked to reflect change in vfio_get_group() from one
>> of previous patches change
>>
>> v4:
>> * fixed possible leaks on error paths
>> ---
>>  hw/misc/vfio.c         | 36 ++++++++++++++++++++++++++++++++++++
>>  include/hw/misc/vfio.h | 11 +++++++++++
>>  2 files changed, 47 insertions(+)
>>  create mode 100644 include/hw/misc/vfio.h
>>
>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>> index 7437c2e..99141f3 100644
>> --- a/hw/misc/vfio.c
>> +++ b/hw/misc/vfio.c
>> @@ -39,6 +39,7 @@
>>  #include "qemu/range.h"
>>  #include "sysemu/kvm.h"
>>  #include "sysemu/sysemu.h"
>> +#include "hw/misc/vfio.h"
>>  
>>  /* #define DEBUG_VFIO */
>>  #ifdef DEBUG_VFIO
>> @@ -4318,3 +4319,38 @@ static void register_vfio_pci_dev_type(void)
>>  }
>>  
>>  type_init(register_vfio_pci_dev_type)
>> +
>> +int vfio_container_spapr_get_info(AddressSpace *as,
>> +                                  int32_t groupid,
>> +                                  struct vfio_iommu_spapr_tce_info *info)
>> +{
>> +    VFIOGroup *group;
>> +    VFIOContainer *container;
>> +    int ret, fd;
>> +
>> +    group = vfio_get_group(groupid, as);
>> +    if (!group) {
>> +        return -1;
>> +    }
>> +    container = group->container;
>> +    if (!group->container) {
>> +        goto put_group_exit;
>> +    }
>> +    fd = container->fd;
>> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>> +        goto put_group_exit;
>> +    }
>> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
>> +    if (ret) {
>> +        error_report("vfio: failed to get iommu info for container: %s",
>> +                     strerror(errno));
>> +        goto put_group_exit;
>> +    }
>> +
>> +    return 0;
>> +
>> +put_group_exit:
>> +    vfio_put_group(group);
>> +
>> +    return -1;
>> +}
>> diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
>> new file mode 100644
>> index 0000000..e82f5a3
>> --- /dev/null
>> +++ b/include/hw/misc/vfio.h
>> @@ -0,0 +1,11 @@
>> +#ifndef VFIO_API_H
>> +#define VFIO_API_H
>> +
>> +#include "qemu/typedefs.h"
>> +#include <linux/vfio.h>
>> +
>> +extern int vfio_container_spapr_get_info(AddressSpace *as,
>> +                                         int32_t groupid,
>> +                                         struct vfio_iommu_spapr_tce_info *info);
>> +
>> +#endif
> 
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 23:36                     ` Alexander Graf
@ 2014-06-05 23:48                       ` Alexey Kardashevskiy
  2014-06-06  3:38                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 24+ messages in thread
From: Alexey Kardashevskiy @ 2014-06-05 23:48 UTC (permalink / raw)
  To: Alexander Graf, qemu-devel; +Cc: Alex Williamson, qemu-ppc, Gavin Shan

On 06/06/2014 09:36 AM, Alexander Graf wrote:
> 
> On 06.06.14 01:17, Alexey Kardashevskiy wrote:
>> On 06/06/2014 02:51 AM, Alexander Graf wrote:
>>> On 05.06.14 16:33, Alexey Kardashevskiy wrote:
>>>> On 06/05/2014 11:36 PM, Alexander Graf wrote:
>>>>> On 05.06.14 15:33, Alexey Kardashevskiy wrote:
>>>>>> On 06/05/2014 11:15 PM, Alexander Graf wrote:
>>>>>>> On 05.06.14 15:10, Alexey Kardashevskiy wrote:
>>>>>>>> On 06/05/2014 11:06 PM, Alexander Graf wrote:
>>>>>>>>> On 05.06.14 08:43, Alexey Kardashevskiy wrote:
>>>>>>>>>> On 06/05/2014 03:49 PM, Alexey Kardashevskiy wrote:
>>>>>>>>>>> POWER KVM supports an KVM_CAP_SPAPR_TCE capability which allows
>>>>>>>>>>> allocating
>>>>>>>>>>> TCE tables in the host kernel memory and handle H_PUT_TCE requests
>>>>>>>>>>> targeted to specific LIOBN (logical bus number) right in the host
>>>>>>>>>>> without
>>>>>>>>>>> switching to QEMU. At the moment this is used for emulated devices
>>>>>>>>>>> only
>>>>>>>>>>> and the handler only puts TCE to the table. If the in-kernel
>>>>>>>>>>> H_PUT_TCE
>>>>>>>>>>> handler finds a LIOBN and corresponding table, it will put a TCE to
>>>>>>>>>>> the table and complete hypercall execution. The user space will
>>>>>>>>>>> not be
>>>>>>>>>>> notified.
>>>>>>>>>>>
>>>>>>>>>>> Upcoming VFIO support is going to use the same sPAPRTCETable device
>>>>>>>>>>> class
>>>>>>>>>>> so KVM_CAP_SPAPR_TCE is going to be used as well. That means
>>>>>>>>>>> that TCE
>>>>>>>>>>> tables for VFIO are going to be allocated in the host as well.
>>>>>>>>>>> However VFIO operates with real IOMMU tables and simple copying of
>>>>>>>>>>> a TCE to the real hardware TCE table will not work as guest
>>>>>>>>>>> physical
>>>>>>>>>>> to host physical address translation is requited.
>>>>>>>>>>>
>>>>>>>>>>> So until the host kernel gets VFIO support for H_PUT_TCE, we
>>>>>>>>>>> better not
>>>>>>>>>>> to register VFIO's TCE in the host.
>>>>>>>>>>>
>>>>>>>>>>> This adds a bool @kvm_accel flag to the sPAPRTCETable device
>>>>>>>>>>> telling
>>>>>>>>>>> that sPAPRTCETable should not try allocating TCE table in the host
>>>>>>>>>>> kernel.
>>>>>>>>>>> Instead, the table will be created in QEMU.
>>>>>>>>>>>
>>>>>>>>>>> This adds an kvm_accel parameter to spapr_tce_new_table() to let
>>>>>>>>>>> users
>>>>>>>>>>> choose whether to use acceleration or not. At the moment it is
>>>>>>>>>>> enabled
>>>>>>>>>>> for VIO and emulated PCI. Upcoming VFIO support will set it to
>>>>>>>>>>> false.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>> This is a workaround but it lets me have one IOMMU device for VIO,
>>>>>>>>>>> emulated
>>>>>>>>>>> PCI and VFIO which is a good thing.
>>>>>>>>>>>
>>>>>>>>>>> The other way around would be a new KVM_CAP_SPAPR_TCE_VFIO
>>>>>>>>>>> capability but
>>>>>>>>>>> this needs kernel update.
>>>>>>>>>> Never mind, I'll make it a capability. I'll post capability
>>>>>>>>>> reservation
>>>>>>>>>> patch separately.
>>>>>>>>> Just rename the flag from "kvm_accel" to "vfio_accel", set it to
>>>>>>>>> true for
>>>>>>>>> vfio and false for emulated devices. Then the spapr_iommu file can
>>>>>>>>> check on
>>>>>>>>> the capability (and default to false for now, since it doesn't exist
>>>>>>>>> yet).
>>>>>>>> Is that ok if the flag does not have to do anything with VFIO per
>>>>>>>> se? :)
>>>>>>> The flag means "use in-kernel acceleration if the vfio coupling
>>>>>>> capability
>>>>>>> is available", no?
>>>>>> It is a flag of sPAPRTCETable which is not supposed to know about
>>>>>> VFIO at
>>>>>> all, it is just an IOMMU. But if you are ok with it, I have no reason
>>>>>> to be
>>>>>> unhappy either :)
>>>>>>
>>>>>>
>>>>>>
>>>>>>>>> That way you don't have to reserve a CAP today.
>>>>>>>> Why exactly cannot we do that today?
>>>>>>> Because the CAP namespace isn't a garbage bin we can just throw IDs at.
>>>>>>> Maybe we realize during patch review that we need completely different
>>>>>>> CAPs.
>>>>>> That was my first plan - to wait for KVM_CAP_SPAPR_TCE_64 be
>>>>>> available in
>>>>>> the kernel.
>>>>> So all you need are 64bit TCEs with bus_offset?
>>>> No. I need 64bit IOBAs a.k.a. PCI bus addresses. The default DMA window is
>>>> just 1 or 2GB and it is mapped at 0 on PCI bus.
>>>>
>>>> TCEs are 64 bit already.
>>> Ok, so the guest has to tell the PCI device to write to a specific window.
>>> That's a shame :).
>> No. Guest tells the device some address, that's it.  Guest allocates those
>> addresses from some window which host, guest and PHB know about but not the
>> device. What is a shame here?
> 
> It would be nicer if the guest had full control over the virtual address
> range of a PCI device.
>
>>>>> What about the missing
>>>>> in-kernel modification of the shadow TCEs on H_PUT_TCE? I thought that's
>>>>> what this is really about.
>>>> This I do not understand :(
>>> How does real mode H_PUT_TCE emulation know that it needs to notify user
>>> space to establish the map?
>> If it wants to pass control to the user space, it returns H_TOO_HARD. This
>> happens, for example, if LIOBN was not registered in KVM.
> 
> So how does KVM_CAP_SPAPR_TCE_64 help here? With KVM_CAP_SPAPR_TCE_64 we
> can still not map VFIO devices' TCE tables because we're missing all the
> magic to link the virtual TCE table to a physical TCE table.


It does not help here indeeed, I did not say it would ;) I just wanted to
do the preparations first, and this means I need to reserve capability
numbers (which is normally very tough process). Since one capability is
straightforward to implement, I included this into the set.



-- 
Alexey

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info()
  2014-06-05 23:40     ` Alexey Kardashevskiy
@ 2014-06-06  1:32       ` Gavin Shan
  0 siblings, 0 replies; 24+ messages in thread
From: Gavin Shan @ 2014-06-06  1:32 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Alex Williamson, qemu-ppc, qemu-devel, Gavin Shan, Alexander Graf

On Fri, Jun 06, 2014 at 09:40:56AM +1000, Alexey Kardashevskiy wrote:
>On 06/06/2014 05:27 AM, Alex Williamson wrote:
>> On Thu, 2014-06-05 at 15:49 +1000, Alexey Kardashevskiy wrote:
>>> To perform DMA mapping via TCE table correctly, the guest must
>>> know where DMA window is located on the PCI bus. A hypervisor is
>>> expected to provide such information. Since QEMU has no control
>>> over this setting, we need a way to obtain a start address and size
>>> from the host VFIO driver.
>>>
>>> This adds a helper which returns the default DMA window properties
>>> for the specific IOMMU group. The upstream kernel implements this ioctl
>>> already.
>> 
>> Couldn't this be done with Gavin's vfio_pci_container_ioctl()?  Thanks,
>
>
>Good point, I missed that. I'll merge two helpers into one and repost, then
>Gavin will use it.
>

Sure, I'll rebase my code on top of Alexey's. But Alex raised more comments
about the function. I just copy & paste so that we don't miss the comments:

- "fd == 0" is valid
- In addition to fd 0 being valid, there's some white space issues here.

  Passing an integer option is not very extensible, maybe a void* that
  gets cast to an int* for VFIO_EEH_PE_OP would be better.  It's a qemu
  internal API though, so I'm not going to sweat saving that problem for
  the next user.  Thanks,

Thanks,
Gavin

>> 
>> Alex
>> 
>> 
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>> Changes:
>>> v7:
>>> * do not return a group fd from the helper
>>>
>>> v6:
>>> * added dup() to protect group_fd from accidental disposal
>>>
>>> v5:
>>> * reworked to reflect change in vfio_get_group() from one
>>> of previous patches change
>>>
>>> v4:
>>> * fixed possible leaks on error paths
>>> ---
>>>  hw/misc/vfio.c         | 36 ++++++++++++++++++++++++++++++++++++
>>>  include/hw/misc/vfio.h | 11 +++++++++++
>>>  2 files changed, 47 insertions(+)
>>>  create mode 100644 include/hw/misc/vfio.h
>>>
>>> diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c
>>> index 7437c2e..99141f3 100644
>>> --- a/hw/misc/vfio.c
>>> +++ b/hw/misc/vfio.c
>>> @@ -39,6 +39,7 @@
>>>  #include "qemu/range.h"
>>>  #include "sysemu/kvm.h"
>>>  #include "sysemu/sysemu.h"
>>> +#include "hw/misc/vfio.h"
>>>  
>>>  /* #define DEBUG_VFIO */
>>>  #ifdef DEBUG_VFIO
>>> @@ -4318,3 +4319,38 @@ static void register_vfio_pci_dev_type(void)
>>>  }
>>>  
>>>  type_init(register_vfio_pci_dev_type)
>>> +
>>> +int vfio_container_spapr_get_info(AddressSpace *as,
>>> +                                  int32_t groupid,
>>> +                                  struct vfio_iommu_spapr_tce_info *info)
>>> +{
>>> +    VFIOGroup *group;
>>> +    VFIOContainer *container;
>>> +    int ret, fd;
>>> +
>>> +    group = vfio_get_group(groupid, as);
>>> +    if (!group) {
>>> +        return -1;
>>> +    }
>>> +    container = group->container;
>>> +    if (!group->container) {
>>> +        goto put_group_exit;
>>> +    }
>>> +    fd = container->fd;
>>> +    if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) {
>>> +        goto put_group_exit;
>>> +    }
>>> +    ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info);
>>> +    if (ret) {
>>> +        error_report("vfio: failed to get iommu info for container: %s",
>>> +                     strerror(errno));
>>> +        goto put_group_exit;
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +put_group_exit:
>>> +    vfio_put_group(group);
>>> +
>>> +    return -1;
>>> +}
>>> diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h
>>> new file mode 100644
>>> index 0000000..e82f5a3
>>> --- /dev/null
>>> +++ b/include/hw/misc/vfio.h
>>> @@ -0,0 +1,11 @@
>>> +#ifndef VFIO_API_H
>>> +#define VFIO_API_H
>>> +
>>> +#include "qemu/typedefs.h"
>>> +#include <linux/vfio.h>
>>> +
>>> +extern int vfio_container_spapr_get_info(AddressSpace *as,
>>> +                                         int32_t groupid,
>>> +                                         struct vfio_iommu_spapr_tce_info *info);
>>> +
>>> +#endif
>> 
>> 
>> 
>
>
>-- 
>Alexey
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional
  2014-06-05 23:36                     ` Alexander Graf
  2014-06-05 23:48                       ` Alexey Kardashevskiy
@ 2014-06-06  3:38                       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 24+ messages in thread
From: Benjamin Herrenschmidt @ 2014-06-06  3:38 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Alexey Kardashevskiy, Alex Williamson, qemu-ppc, qemu-devel,
	Gavin Shan

On Fri, 2014-06-06 at 01:36 +0200, Alexander Graf wrote:
> 
> It would be nicer if the guest had full control over the virtual
> address range of a PCI device.

It does ... within a HW window which can be different between P7 and
P8. 

On P7 all PEs on a PHB share a single DMA address space that gets sliced
up, I won't get into details on what kind of slices are available,
suffice to say we provide a single smallish window in 32-bit space for
each PE and the guest controls the 4k TCEs in there.

On P8, each PE has its own DMA address space which has 2 wnidows, one at
0 and one at 0x0800_0000_0000_0000.

By default we configure the 0 window for 32-bit/4K TCE remapping (we set
it to 2G window) for compatibility with existing PAPR expectations.

The high window is used in the host as a bypass. We disable TCEs and use
a direct mapping to physical memory through it instead to allow the host
drivers that are 64-bit DMA capable to have the fastest possible access
to memory.

When we pass-through a device today we disable that second window for
obvious reasons.

With Alexey patches, we'll be able to control it which will in turn
allow us to implement the PAPR "DDW" extension which allows the guest to
populate that second window. Typically the guest will use it to create a
full mapping of its entire address space in 64-bit space using the
largest possible TCE size (whose size is constrained by the page size
used to back the guest memory).

Here too, within those windows, the guest has control of the mappings.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-06-06  3:38 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-05  5:49 [Qemu-devel] [PATCH v7 0/4] vfio on spapr-ppc64 Alexey Kardashevskiy
2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 1/4] spapr_iommu: Make in-kernel TCE table optional Alexey Kardashevskiy
2014-06-05  6:43   ` Alexey Kardashevskiy
2014-06-05 13:06     ` Alexander Graf
2014-06-05 13:10       ` Alexey Kardashevskiy
2014-06-05 13:15         ` Alexander Graf
2014-06-05 13:33           ` Alexey Kardashevskiy
2014-06-05 13:36             ` Alexander Graf
2014-06-05 14:33               ` Alexey Kardashevskiy
2014-06-05 16:51                 ` Alexander Graf
2014-06-05 23:17                   ` Alexey Kardashevskiy
2014-06-05 23:36                     ` Alexander Graf
2014-06-05 23:48                       ` Alexey Kardashevskiy
2014-06-06  3:38                       ` Benjamin Herrenschmidt
2014-06-05  5:49 ` [Qemu-devel] [PATCH v7 2/4] vfio: Add vfio_container_spapr_get_info() Alexey Kardashevskiy
2014-06-05 19:27   ` Alex Williamson
2014-06-05 23:40     ` Alexey Kardashevskiy
2014-06-06  1:32       ` Gavin Shan
2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 3/4] spapr_pci_vfio: Add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy
2014-06-05 13:34   ` Alexander Graf
2014-06-05 14:37     ` Alexey Kardashevskiy
2014-06-05  5:50 ` [Qemu-devel] [PATCH v7 4/4] vfio: Enable for spapr Alexey Kardashevskiy
2014-06-05 19:31   ` Alex Williamson
2014-06-05 23:39     ` Alexey Kardashevskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).