* [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself @ 2013-08-07 8:21 Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space Alexey Kardashevskiy ` (7 more replies) 0 siblings, 8 replies; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson This patch series represents a third attempt at better integration of the vfio code with qemu's handling of guest IOMMUs. David posted two previous series, I am posting now his rework after last posting. This also contains a working VFIO driver for QEMU which depends on MSIX rework and IRQFD patches (which are not in upstream yet) and won't work as isbut I am pretty sure there is still enough to discuss :) Thanks! Alexey Kardashevskiy (4): vfio: Add guest side IOMMU support spapr vfio: add vfio_container_spapr_get_info() spapr vfio: add spapr-pci-vfio-host-bridge to support vfio spapr vfio: enable for spapr David Gibson (4): pci: Introduce helper to retrieve a PCI device's DMA address space memory: Sanity check that no listeners remain on a destroyed AddressSpace vfio: Introduce VFIO address spaces vfio: Create VFIOAddressSpace objects as needed hw/misc/vfio.c | 300 +++++++++++++++++++++++++++++++++++++++----- hw/pci/pci.c | 9 +- hw/ppc/spapr_iommu.c | 176 ++++++++++++++++++++++---- hw/ppc/spapr_pci.c | 209 +++++++++++++++++++++++++++--- include/hw/misc/vfio.h | 11 ++ include/hw/pci-host/spapr.h | 12 ++ include/hw/pci/pci.h | 1 + include/hw/ppc/spapr.h | 19 +++ memory.c | 7 ++ target-ppc/kvm.c | 33 +++++ target-ppc/kvm_ppc.h | 12 ++ trace-events | 4 + 12 files changed, 724 insertions(+), 69 deletions(-) create mode 100644 include/hw/misc/vfio.h -- 1.8.3.2 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-07 10:27 ` Michael S. Tsirkin 2013-08-07 8:21 ` [Qemu-devel] [PATCH 2/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy ` (6 subsequent siblings) 7 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, aik, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson From: David Gibson <david@gibson.dropbear.id.au> A PCI device's DMA address space (possibly an IOMMU) is returned by a method on the PCIBus. At the moment that only has one caller, so the method is simply open coded. We'll need another caller for VFIO, so this patch introduces a helper/wrapper function. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- hw/pci/pci.c | 9 ++++++++- include/hw/pci/pci.h | 1 + 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/hw/pci/pci.c b/hw/pci/pci.c index 4bce3e7..3cea25f 100644 --- a/hw/pci/pci.c +++ b/hw/pci/pci.c @@ -813,7 +813,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, pci_dev->bus = bus; if (bus->iommu_fn) { - dma_as = bus->iommu_fn(bus, bus->iommu_opaque, devfn); + dma_as = pci_iommu_as(pci_dev); } else { /* FIXME: inherit memory region from bus creator */ dma_as = &address_space_memory; @@ -2248,6 +2248,13 @@ static void pci_device_class_init(ObjectClass *klass, void *data) k->props = pci_props; } +AddressSpace *pci_iommu_as(PCIDevice *dev) +{ + PCIBus *bus = PCI_BUS(dev->bus); + + return bus->iommu_fn(bus, bus->iommu_opaque, dev->devfn); +} + void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque) { bus->iommu_fn = fn; diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h index b6ad9e4..614f809 100644 --- a/include/hw/pci/pci.h +++ b/include/hw/pci/pci.h @@ -407,6 +407,7 @@ void pci_device_deassert_intx(PCIDevice *dev); typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int); +AddressSpace *pci_iommu_as(PCIDevice *dev); void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque); static inline void -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space 2013-08-07 8:21 ` [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space Alexey Kardashevskiy @ 2013-08-07 10:27 ` Michael S. Tsirkin 0 siblings, 0 replies; 21+ messages in thread From: Michael S. Tsirkin @ 2013-08-07 10:27 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson On Wed, Aug 07, 2013 at 06:21:25PM +1000, Alexey Kardashevskiy wrote: > From: David Gibson <david@gibson.dropbear.id.au> > > A PCI device's DMA address space (possibly an IOMMU) is returned by a > method on the PCIBus. At the moment that only has one caller, so the > method is simply open coded. We'll need another caller for VFIO, so > this patch introduces a helper/wrapper function. > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > hw/pci/pci.c | 9 ++++++++- > include/hw/pci/pci.h | 1 + > 2 files changed, 9 insertions(+), 1 deletion(-) > > diff --git a/hw/pci/pci.c b/hw/pci/pci.c > index 4bce3e7..3cea25f 100644 > --- a/hw/pci/pci.c > +++ b/hw/pci/pci.c > @@ -813,7 +813,7 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus, > > pci_dev->bus = bus; > if (bus->iommu_fn) { > - dma_as = bus->iommu_fn(bus, bus->iommu_opaque, devfn); > + dma_as = pci_iommu_as(pci_dev); > } else { > /* FIXME: inherit memory region from bus creator */ > dma_as = &address_space_memory; > @@ -2248,6 +2248,13 @@ static void pci_device_class_init(ObjectClass *klass, void *data) > k->props = pci_props; > } > > +AddressSpace *pci_iommu_as(PCIDevice *dev) > +{ > + PCIBus *bus = PCI_BUS(dev->bus); > + > + return bus->iommu_fn(bus, bus->iommu_opaque, dev->devfn); > +} > + > void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque) > { > bus->iommu_fn = fn; > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h > index b6ad9e4..614f809 100644 > --- a/include/hw/pci/pci.h > +++ b/include/hw/pci/pci.h > @@ -407,6 +407,7 @@ void pci_device_deassert_intx(PCIDevice *dev); > > typedef AddressSpace *(*PCIIOMMUFunc)(PCIBus *, void *, int); > > +AddressSpace *pci_iommu_as(PCIDevice *dev); I prefer full names, and it needs to mention "device". How about: AddressSpace *pci_device_iommu_address_space(PCIDevice *dev); > void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, void *opaque); > > static inline void > -- > 1.8.3.2 ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 2/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces Alexey Kardashevskiy ` (5 subsequent siblings) 7 siblings, 0 replies; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, aik, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson From: David Gibson <david@gibson.dropbear.id.au> At the moment, most AddressSpace objects last as long as the guest system in practice, but that could well change in future. In addition, for VFIO we will be introducing some private per-AdressSpace information, which must be disposed of before the AddressSpace itself is destroyed. To reduce the chances of subtle bugs in this area, this patch adds asssertions to ensure that when an AddressSpace is destroyed, there are no remaining MemoryListeners using that AS as a filter. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- memory.c | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/memory.c b/memory.c index 886f838..ffedde7 100644 --- a/memory.c +++ b/memory.c @@ -1726,11 +1726,18 @@ void address_space_init(AddressSpace *as, MemoryRegion *root, const char *name) void address_space_destroy(AddressSpace *as) { + MemoryListener *listener; + /* Flush out anything from MemoryListeners listening in on this */ memory_region_transaction_begin(); as->root = NULL; memory_region_transaction_commit(); QTAILQ_REMOVE(&address_spaces, as, address_spaces_link); + + QTAILQ_FOREACH(listener, &memory_listeners, link) { + assert(listener->address_space_filter != as); + } + address_space_destroy_dispatch(as); flatview_unref(as->current_map); g_free(as->name); -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 2/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-12 22:07 ` Alex Williamson 2013-08-07 8:21 ` [Qemu-devel] [PATCH 4/8] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy ` (4 subsequent siblings) 7 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, aik, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson From: David Gibson <david@gibson.dropbear.id.au> The only model so far supported for VFIO passthrough devices is the model usually used on x86, where all of the guest's RAM is mapped into the (host) IOMMU and there is no IOMMU visible in the guest. This patch begins to relax this model, introducing the notion of a VFIOAddressSpace. This represents a logical DMA address space which will be visible to one or more VFIO devices by appropriate mapping in the (host) IOMMU. Thus the currently global list of containers becomes local to a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO group to multiple address spaces. For now, only one VFIOAddressSpace is created and used, corresponding to main system memory, that will change in future patches. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- hw/misc/vfio.c | 57 +++++++++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 45 insertions(+), 12 deletions(-) diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index adcd23d..bcd67a0 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -130,9 +130,17 @@ enum { VFIO_INT_MSIX = 3, }; +typedef struct VFIOAddressSpace { + AddressSpace *as; + QLIST_HEAD(, VFIOContainer) containers; +} VFIOAddressSpace; + +static VFIOAddressSpace vfio_address_space_memory; + struct VFIOGroup; typedef struct VFIOContainer { + VFIOAddressSpace *space; int fd; /* /dev/vfio/vfio, empowered by the attached groups */ struct { /* enable abstraction to support various iommu backends */ @@ -197,9 +205,6 @@ typedef struct VFIOGroup { #define MSIX_CAP_LENGTH 12 -static QLIST_HEAD(, VFIOContainer) - container_list = QLIST_HEAD_INITIALIZER(container_list); - static QLIST_HEAD(, VFIOGroup) group_list = QLIST_HEAD_INITIALIZER(group_list); @@ -2604,16 +2609,28 @@ static int vfio_load_rom(VFIODevice *vdev) return 0; } -static int vfio_connect_container(VFIOGroup *group) +static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as) +{ + space->as = as; + QLIST_INIT(&space->containers); +} + +static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space) { VFIOContainer *container; int ret, fd; if (group->container) { - return 0; + if (group->container->space == space) { + return 0; + } else { + error_report("vfio: group %d used in multiple address spaces", + group->groupid); + return -EBUSY; + } } - QLIST_FOREACH(container, &container_list, next) { + QLIST_FOREACH(container, &space->containers, next) { if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); @@ -2636,6 +2653,7 @@ static int vfio_connect_container(VFIOGroup *group) } container = g_malloc0(sizeof(*container)); + container->space = space; container->fd = fd; if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { @@ -2658,7 +2676,8 @@ static int vfio_connect_container(VFIOGroup *group) container->iommu_data.listener = vfio_memory_listener; container->iommu_data.release = vfio_listener_release; - memory_listener_register(&container->iommu_data.listener, &address_space_memory); + memory_listener_register(&container->iommu_data.listener, + container->space->as); } else { error_report("vfio: No available IOMMU models"); g_free(container); @@ -2667,7 +2686,7 @@ static int vfio_connect_container(VFIOGroup *group) } QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&container_list, container, next); + QLIST_INSERT_HEAD(&space->containers, container, next); group->container = container; QLIST_INSERT_HEAD(&container->group_list, group, container_next); @@ -2698,7 +2717,7 @@ static void vfio_disconnect_container(VFIOGroup *group) } } -static VFIOGroup *vfio_get_group(int groupid) +static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *space) { VFIOGroup *group; char path[32]; @@ -2706,7 +2725,15 @@ static VFIOGroup *vfio_get_group(int groupid) QLIST_FOREACH(group, &group_list, next) { if (group->groupid == groupid) { - return group; + /* Found it. Now is it already in the right context? */ + assert(group->container); + if (group->container->space == space) { + return group; + } else { + error_report("vfio: group %d used in multiple address spaces", + group->groupid); + return NULL; + } } } @@ -2739,7 +2766,7 @@ static VFIOGroup *vfio_get_group(int groupid) group->groupid = groupid; QLIST_INIT(&group->device_list); - if (vfio_connect_container(group)) { + if (vfio_connect_container(group, space)) { error_report("vfio: failed to setup container for group %d", groupid); close(group->fd); g_free(group); @@ -3093,7 +3120,12 @@ static int vfio_initfn(PCIDevice *pdev) DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function, groupid); - group = vfio_get_group(groupid); + if (pci_iommu_as(pdev) != &address_space_memory) { + error_report("vfio: DMA address space must be system memory"); + return -ENXIO; + } + + group = vfio_get_group(groupid, &vfio_address_space_memory); if (!group) { error_report("vfio: failed to get group %d", groupid); return -ENOENT; @@ -3316,6 +3348,7 @@ static const TypeInfo vfio_pci_dev_info = { static void register_vfio_pci_dev_type(void) { + vfio_address_space_init(&vfio_address_space_memory, &address_space_memory); type_register_static(&vfio_pci_dev_info); } -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces 2013-08-07 8:21 ` [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces Alexey Kardashevskiy @ 2013-08-12 22:07 ` Alex Williamson 2013-08-19 13:15 ` Paolo Bonzini 0 siblings, 1 reply; 21+ messages in thread From: Alex Williamson @ 2013-08-12 22:07 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, Alexander Graf, qemu-devel, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson On Wed, 2013-08-07 at 18:21 +1000, Alexey Kardashevskiy wrote: > From: David Gibson <david@gibson.dropbear.id.au> > > The only model so far supported for VFIO passthrough devices is the model > usually used on x86, where all of the guest's RAM is mapped into the > (host) IOMMU and there is no IOMMU visible in the guest. > > This patch begins to relax this model, introducing the notion of a > VFIOAddressSpace. This represents a logical DMA address space which will > be visible to one or more VFIO devices by appropriate mapping in the (host) > IOMMU. Thus the currently global list of containers becomes local to > a VFIOAddressSpace, and we verify that we don't attempt to add a VFIO > group to multiple address spaces. > > For now, only one VFIOAddressSpace is created and used, corresponding to > main system memory, that will change in future patches. > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > hw/misc/vfio.c | 57 +++++++++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 45 insertions(+), 12 deletions(-) > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c > index adcd23d..bcd67a0 100644 > --- a/hw/misc/vfio.c > +++ b/hw/misc/vfio.c > @@ -130,9 +130,17 @@ enum { > VFIO_INT_MSIX = 3, > }; > > +typedef struct VFIOAddressSpace { > + AddressSpace *as; > + QLIST_HEAD(, VFIOContainer) containers; > +} VFIOAddressSpace; > + > +static VFIOAddressSpace vfio_address_space_memory; > + > struct VFIOGroup; > > typedef struct VFIOContainer { > + VFIOAddressSpace *space; > int fd; /* /dev/vfio/vfio, empowered by the attached groups */ > struct { > /* enable abstraction to support various iommu backends */ > @@ -197,9 +205,6 @@ typedef struct VFIOGroup { > > #define MSIX_CAP_LENGTH 12 > > -static QLIST_HEAD(, VFIOContainer) > - container_list = QLIST_HEAD_INITIALIZER(container_list); > - > static QLIST_HEAD(, VFIOGroup) > group_list = QLIST_HEAD_INITIALIZER(group_list); > > @@ -2604,16 +2609,28 @@ static int vfio_load_rom(VFIODevice *vdev) > return 0; > } > > -static int vfio_connect_container(VFIOGroup *group) > +static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as) > +{ > + space->as = as; > + QLIST_INIT(&space->containers); > +} > + > +static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space) > { > VFIOContainer *container; > int ret, fd; > > if (group->container) { > - return 0; > + if (group->container->space == space) { > + return 0; > + } else { > + error_report("vfio: group %d used in multiple address spaces", > + group->groupid); > + return -EBUSY; > + } > } This original group->container test seems bogus to me, I don't think we can get here with a container already attached to a group. > > - QLIST_FOREACH(container, &container_list, next) { > + QLIST_FOREACH(container, &space->containers, next) { > if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { > group->container = container; > QLIST_INSERT_HEAD(&container->group_list, group, container_next); > @@ -2636,6 +2653,7 @@ static int vfio_connect_container(VFIOGroup *group) > } > > container = g_malloc0(sizeof(*container)); > + container->space = space; > container->fd = fd; > > if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { > @@ -2658,7 +2676,8 @@ static int vfio_connect_container(VFIOGroup *group) > container->iommu_data.listener = vfio_memory_listener; > container->iommu_data.release = vfio_listener_release; > > - memory_listener_register(&container->iommu_data.listener, &address_space_memory); > + memory_listener_register(&container->iommu_data.listener, > + container->space->as); > } else { > error_report("vfio: No available IOMMU models"); > g_free(container); > @@ -2667,7 +2686,7 @@ static int vfio_connect_container(VFIOGroup *group) > } > > QLIST_INIT(&container->group_list); > - QLIST_INSERT_HEAD(&container_list, container, next); > + QLIST_INSERT_HEAD(&space->containers, container, next); > > group->container = container; > QLIST_INSERT_HEAD(&container->group_list, group, container_next); > @@ -2698,7 +2717,7 @@ static void vfio_disconnect_container(VFIOGroup *group) > } > } > > -static VFIOGroup *vfio_get_group(int groupid) > +static VFIOGroup *vfio_get_group(int groupid, VFIOAddressSpace *space) > { > VFIOGroup *group; > char path[32]; > @@ -2706,7 +2725,15 @@ static VFIOGroup *vfio_get_group(int groupid) > > QLIST_FOREACH(group, &group_list, next) { > if (group->groupid == groupid) { > - return group; > + /* Found it. Now is it already in the right context? */ > + assert(group->container); How would a group w/o a container exist? (I really don't like asserts - note the only assert in vfio was added by Avi). > + if (group->container->space == space) { > + return group; > + } else { > + error_report("vfio: group %d used in multiple address spaces", > + group->groupid); > + return NULL; > + } > } > } > > @@ -2739,7 +2766,7 @@ static VFIOGroup *vfio_get_group(int groupid) > group->groupid = groupid; > QLIST_INIT(&group->device_list); > > - if (vfio_connect_container(group)) { > + if (vfio_connect_container(group, space)) { > error_report("vfio: failed to setup container for group %d", groupid); > close(group->fd); > g_free(group); > @@ -3093,7 +3120,12 @@ static int vfio_initfn(PCIDevice *pdev) > DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain, > vdev->host.bus, vdev->host.slot, vdev->host.function, groupid); > > - group = vfio_get_group(groupid); > + if (pci_iommu_as(pdev) != &address_space_memory) { > + error_report("vfio: DMA address space must be system memory"); > + return -ENXIO; -EFAULT? It's a bad address of sorts. > + } > + > + group = vfio_get_group(groupid, &vfio_address_space_memory); > if (!group) { > error_report("vfio: failed to get group %d", groupid); > return -ENOENT; > @@ -3316,6 +3348,7 @@ static const TypeInfo vfio_pci_dev_info = { > > static void register_vfio_pci_dev_type(void) > { > + vfio_address_space_init(&vfio_address_space_memory, &address_space_memory); Not a fan of this here, but it's short lived, so ok. > type_register_static(&vfio_pci_dev_info); > } > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces 2013-08-12 22:07 ` Alex Williamson @ 2013-08-19 13:15 ` Paolo Bonzini 0 siblings, 0 replies; 21+ messages in thread From: Paolo Bonzini @ 2013-08-19 13:15 UTC (permalink / raw) To: Alex Williamson Cc: Anthony Liguori, Michael S . Tsirkin, Alexey Kardashevskiy, qemu-devel, Alexander Graf, qemu-ppc, Paul Mackerras, David Gibson Il 13/08/2013 00:07, Alex Williamson ha scritto: >> > + if (pci_iommu_as(pdev) != &address_space_memory) { >> > + error_report("vfio: DMA address space must be system memory"); >> > + return -ENXIO; > -EFAULT? It's a bad address of sorts. > Accessing it would SIGSEGV, so it is not really EFAULT. I would just use EINVAL, the numeric error code will go away as soon as initfn is changed to use "Error *" (which is needed to propagate sensible error messages to the QMP client). Paolo ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 4/8] vfio: Create VFIOAddressSpace objects as needed 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy ` (2 preceding siblings ...) 2013-08-07 8:21 ` [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support Alexey Kardashevskiy ` (3 subsequent siblings) 7 siblings, 0 replies; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, aik, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson From: David Gibson <david@gibson.dropbear.id.au> So far, VFIO has a notion of different logical DMA address spaces, but only ever uses one (system memory). This patch extends this, creating new VFIOAddressSpace objects as necessary, according to the AddressSpace reported by the PCI subsystem for this device's DMAs. This isn't enough yet to support guest side IOMMUs with VFIO, but it does mean we could now support VFIO devices on, for example, a guest side PCI host bridge which maps system memory at somewhere other than 0 in PCI space. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- hw/misc/vfio.c | 42 ++++++++++++++++++++++++++++++++++-------- 1 file changed, 34 insertions(+), 8 deletions(-) diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index bcd67a0..e1ee56e 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -133,9 +133,10 @@ enum { typedef struct VFIOAddressSpace { AddressSpace *as; QLIST_HEAD(, VFIOContainer) containers; + QLIST_ENTRY(VFIOAddressSpace) list; } VFIOAddressSpace; -static VFIOAddressSpace vfio_address_space_memory; +QLIST_HEAD(, VFIOAddressSpace) vfio_address_spaces; struct VFIOGroup; @@ -2609,10 +2610,33 @@ static int vfio_load_rom(VFIODevice *vdev) return 0; } -static void vfio_address_space_init(VFIOAddressSpace *space, AddressSpace *as) +static VFIOAddressSpace *vfio_get_address_space(AddressSpace *as) { + VFIOAddressSpace *space; + + QLIST_FOREACH(space, &vfio_address_spaces, list) { + if (space->as == as) + return space; + } + + /* No suitable VFIOAddressSpace, create a new one */ + space = g_malloc0(sizeof(*space)); space->as = as; QLIST_INIT(&space->containers); + + QLIST_INSERT_HEAD(&vfio_address_spaces, space, list); + + return space; +} + +static void vfio_put_address_space(VFIOAddressSpace *space) +{ + if (!QLIST_EMPTY(&space->containers)) { + return; + } + + QLIST_REMOVE(space, list); + g_free(space); } static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space) @@ -2707,6 +2731,8 @@ static void vfio_disconnect_container(VFIOGroup *group) group->container = NULL; if (QLIST_EMPTY(&container->group_list)) { + VFIOAddressSpace *space = container->space; + if (container->iommu_data.release) { container->iommu_data.release(container); } @@ -2714,6 +2740,8 @@ static void vfio_disconnect_container(VFIOGroup *group) DPRINTF("vfio_disconnect_container: close container->fd\n"); close(container->fd); g_free(container); + + vfio_put_address_space(space); } } @@ -3085,6 +3113,7 @@ static int vfio_initfn(PCIDevice *pdev) { VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev); VFIOGroup *group; + VFIOAddressSpace *space; char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name; ssize_t len; struct stat st; @@ -3120,14 +3149,12 @@ static int vfio_initfn(PCIDevice *pdev) DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function, groupid); - if (pci_iommu_as(pdev) != &address_space_memory) { - error_report("vfio: DMA address space must be system memory"); - return -ENXIO; - } + space = vfio_get_address_space(pci_iommu_as(pdev)); - group = vfio_get_group(groupid, &vfio_address_space_memory); + group = vfio_get_group(groupid, space); if (!group) { error_report("vfio: failed to get group %d", groupid); + vfio_put_address_space(space); return -ENOENT; } @@ -3348,7 +3375,6 @@ static const TypeInfo vfio_pci_dev_info = { static void register_vfio_pci_dev_type(void) { - vfio_address_space_init(&vfio_address_space_memory, &address_space_memory); type_register_static(&vfio_pci_dev_info); } -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy ` (3 preceding siblings ...) 2013-08-07 8:21 ` [Qemu-devel] [PATCH 4/8] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-12 22:07 ` Alex Williamson 2013-08-07 8:21 ` [Qemu-devel] [PATCH 6/8] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy ` (2 subsequent siblings) 7 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson This patch uses the new IOMMU notifiers to allow VFIO pass through devices to work with guest side IOMMUs, as long as the host-side VFIO iommu has sufficient capability and granularity to match the guest side. This works by tracking all map and unmap operations on the guest IOMMU using the notifiers, and mirroring them into VFIO. There are a number of FIXMEs, and the scheme involves rather more notifier structures than I'd like, but it shuold make for a reasonable proof of concept. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> --- hw/misc/vfio.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 129 insertions(+), 20 deletions(-) diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index e1ee56e..3855efe 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -150,10 +150,18 @@ typedef struct VFIOContainer { }; void (*release)(struct VFIOContainer *); } iommu_data; + QLIST_HEAD(, VFIOGuestIOMMU) guest_iommus; QLIST_HEAD(, VFIOGroup) group_list; QLIST_ENTRY(VFIOContainer) next; } VFIOContainer; +typedef struct VFIOGuestIOMMU { + VFIOContainer *container; + MemoryRegion *iommu; + Notifier n; + QLIST_ENTRY(VFIOGuestIOMMU) list; +} VFIOGuestIOMMU; + /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */ typedef struct VFIOMSIXInfo { uint8_t table_bar; @@ -1917,19 +1925,70 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, static bool vfio_listener_skipped_section(MemoryRegionSection *section) { - return !memory_region_is_ram(section->mr); + return !memory_region_is_ram(section->mr) && + !memory_region_is_iommu(section->mr); } -static void vfio_listener_region_add(MemoryListener *listener, - MemoryRegionSection *section) +static void vfio_iommu_map_notify(Notifier *n, void *data) { - VFIOContainer *container = container_of(listener, VFIOContainer, - iommu_data.listener); - hwaddr iova, end; + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); + VFIOContainer *container = giommu->container; + IOMMUTLBEntry *iotlb = data; + MemoryRegion *mr; + hwaddr xlat; + hwaddr len = iotlb->addr_mask + 1; void *vaddr; int ret; - assert(!memory_region_is_iommu(section->mr)); + DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", + iotlb->iova, iotlb->iova + iotlb->address_mask); + + /* The IOMMU TLB entry we have just covers translation through + * this IOMMU to its immediate target. We need to translate + * it the rest of the way through to memory. */ + mr = address_space_translate(&address_space_memory, + iotlb->translated_addr, + &xlat, &len, iotlb->perm & IOMMU_WO); + if (!memory_region_is_ram(mr)) { + DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n", + xlat); + return; + } + if (len & iotlb->addr_mask) { + DPRINTF("iommu has granularity incompatible with target AS\n"); + return; + } + + vaddr = memory_region_get_ram_ptr(mr) + xlat; + + if (iotlb->perm != IOMMU_NONE) { + ret = vfio_dma_map(container, iotlb->iova, + iotlb->addr_mask + 1, vaddr, + !(iotlb->perm & IOMMU_WO) || mr->readonly); + if (ret) { + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iotlb->iova, + iotlb->addr_mask + 1, vaddr, ret); + } + } else { + ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iotlb->iova, + iotlb->addr_mask + 1, ret); + } + } +} + +static void vfio_listener_region_add(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, + iommu_data.listener); + hwaddr iova, end; + int ret; if (vfio_listener_skipped_section(section)) { DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", @@ -1952,19 +2011,51 @@ static void vfio_listener_region_add(MemoryListener *listener, return; } - vaddr = memory_region_get_ram_ptr(section->mr) + - section->offset_within_region + - (iova - section->offset_within_address_space); - - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", - iova, end - 1, vaddr); - - memory_region_ref(section->mr); - ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); - if (ret) { - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, end - iova, vaddr, ret); + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", + iova, end - 1); + + memory_region_ref(section->mr); + /* + * FIXME: We should do some checking to see if the + * capabilities of the host VFIO IOMMU are adequate to model + * the guest IOMMU + * + * FIXME: This assumes that the guest IOMMU is empty of + * mappings at this point - we should either enforce this, or + * loop through existing mappings to map them into VFIO. + * + * FIXME: For VFIO iommu types which have KVM acceleration to + * avoid bouncing all map/unmaps through qemu this way, this + * would be the right place to wire that up (tell the KVM + * device emulation the VFIO iommu handles to use). + */ + giommu = g_malloc0(sizeof(*giommu)); + giommu->iommu = section->mr; + giommu->container = container; + giommu->n.notify = vfio_iommu_map_notify; + QLIST_INSERT_HEAD(&container->guest_iommus, giommu, list); + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); + + } else if (memory_region_is_ram(section->mr)) { + void *vaddr; + + vaddr = memory_region_get_ram_ptr(section->mr) + + section->offset_within_region + + (iova - section->offset_within_address_space); + + DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", + iova, end - 1, vaddr); + + memory_region_ref(section->mr); + ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); + if (ret) { + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, end - iova, vaddr, ret); + } } } @@ -1989,6 +2080,24 @@ static void vfio_listener_region_del(MemoryListener *listener, return; } + if (memory_region_is_iommu(section->mr)) { + VFIOGuestIOMMU *giommu; + + QLIST_FOREACH(giommu, &container->guest_iommus, list) { + if (giommu->iommu == section->mr) { + memory_region_unregister_iommu_notifier(&giommu->n); + QLIST_REMOVE(giommu, list); + g_free(giommu); + break; + } + } + + /* FIXME: We assume the one big unmap below is adequate to + * remove any individual page mappings in the IOMMU which + * might have been copied into VFIO. That may not be true for + * all IOMMU types */ + } + iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); end = (section->offset_within_address_space + int128_get64(section->size)) & TARGET_PAGE_MASK; -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support 2013-08-07 8:21 ` [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support Alexey Kardashevskiy @ 2013-08-12 22:07 ` Alex Williamson 2013-08-15 6:02 ` Alexey Kardashevskiy 0 siblings, 1 reply; 21+ messages in thread From: Alex Williamson @ 2013-08-12 22:07 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, Alexander Graf, qemu-devel, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson On Wed, 2013-08-07 at 18:21 +1000, Alexey Kardashevskiy wrote: > This patch uses the new IOMMU notifiers to allow VFIO pass through devices > to work with guest side IOMMUs, as long as the host-side VFIO iommu has > sufficient capability and granularity to match the guest side. This works > by tracking all map and unmap operations on the guest IOMMU using the > notifiers, and mirroring them into VFIO. > > There are a number of FIXMEs, and the scheme involves rather more notifier > structures than I'd like, but it shuold make for a reasonable proof of > concept. > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au> > --- > hw/misc/vfio.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 129 insertions(+), 20 deletions(-) > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c > index e1ee56e..3855efe 100644 > --- a/hw/misc/vfio.c > +++ b/hw/misc/vfio.c > @@ -150,10 +150,18 @@ typedef struct VFIOContainer { > }; > void (*release)(struct VFIOContainer *); > } iommu_data; > + QLIST_HEAD(, VFIOGuestIOMMU) guest_iommus; > QLIST_HEAD(, VFIOGroup) group_list; > QLIST_ENTRY(VFIOContainer) next; > } VFIOContainer; > > +typedef struct VFIOGuestIOMMU { > + VFIOContainer *container; > + MemoryRegion *iommu; > + Notifier n; > + QLIST_ENTRY(VFIOGuestIOMMU) list; Within vfio the convention is that an ENTRY is next or foo_next and a HEAD is foo_list. So perhaps giommu_list and next? > +} VFIOGuestIOMMU; > + > /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */ > typedef struct VFIOMSIXInfo { > uint8_t table_bar; > @@ -1917,19 +1925,70 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova, > > static bool vfio_listener_skipped_section(MemoryRegionSection *section) > { > - return !memory_region_is_ram(section->mr); > + return !memory_region_is_ram(section->mr) && > + !memory_region_is_iommu(section->mr); > } > > -static void vfio_listener_region_add(MemoryListener *listener, > - MemoryRegionSection *section) > +static void vfio_iommu_map_notify(Notifier *n, void *data) > { > - VFIOContainer *container = container_of(listener, VFIOContainer, > - iommu_data.listener); > - hwaddr iova, end; > + VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n); > + VFIOContainer *container = giommu->container; > + IOMMUTLBEntry *iotlb = data; > + MemoryRegion *mr; > + hwaddr xlat; > + hwaddr len = iotlb->addr_mask + 1; > void *vaddr; > int ret; > > - assert(!memory_region_is_iommu(section->mr)); Yay! > + DPRINTF("iommu map @ %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", > + iotlb->iova, iotlb->iova + iotlb->address_mask); > + > + /* The IOMMU TLB entry we have just covers translation through > + * this IOMMU to its immediate target. We need to translate > + * it the rest of the way through to memory. */ /* * Multi-line comment this way please */ > + mr = address_space_translate(&address_space_memory, > + iotlb->translated_addr, > + &xlat, &len, iotlb->perm & IOMMU_WO); > + if (!memory_region_is_ram(mr)) { > + DPRINTF("iommu map to non memory area %"HWADDR_PRIx"\n", > + xlat); > + return; > + } > + if (len & iotlb->addr_mask) { > + DPRINTF("iommu has granularity incompatible with target AS\n"); > + return; > + } > + > + vaddr = memory_region_get_ram_ptr(mr) + xlat; > + > + if (iotlb->perm != IOMMU_NONE) { > + ret = vfio_dma_map(container, iotlb->iova, > + iotlb->addr_mask + 1, vaddr, > + !(iotlb->perm & IOMMU_WO) || mr->readonly); > + if (ret) { > + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > + "0x%"HWADDR_PRIx", %p) = %d (%m)", > + container, iotlb->iova, > + iotlb->addr_mask + 1, vaddr, ret); > + } > + } else { > + ret = vfio_dma_unmap(container, iotlb->iova, iotlb->addr_mask + 1); > + if (ret) { > + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " > + "0x%"HWADDR_PRIx") = %d (%m)", > + container, iotlb->iova, > + iotlb->addr_mask + 1, ret); > + } > + } > +} > + > +static void vfio_listener_region_add(MemoryListener *listener, > + MemoryRegionSection *section) > +{ > + VFIOContainer *container = container_of(listener, VFIOContainer, > + iommu_data.listener); > + hwaddr iova, end; > + int ret; > > if (vfio_listener_skipped_section(section)) { > DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", > @@ -1952,19 +2011,51 @@ static void vfio_listener_region_add(MemoryListener *listener, > return; > } > > - vaddr = memory_region_get_ram_ptr(section->mr) + > - section->offset_within_region + > - (iova - section->offset_within_address_space); > - > - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", > - iova, end - 1, vaddr); > - > - memory_region_ref(section->mr); > - ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); > - if (ret) { > - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > - "0x%"HWADDR_PRIx", %p) = %d (%m)", > - container, iova, end - iova, vaddr, ret); > + if (memory_region_is_iommu(section->mr)) { > + VFIOGuestIOMMU *giommu; > + > + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", > + iova, end - 1); > + > + memory_region_ref(section->mr); > + /* > + * FIXME: We should do some checking to see if the > + * capabilities of the host VFIO IOMMU are adequate to model > + * the guest IOMMU > + * > + * FIXME: This assumes that the guest IOMMU is empty of > + * mappings at this point - we should either enforce this, or > + * loop through existing mappings to map them into VFIO. > + * > + * FIXME: For VFIO iommu types which have KVM acceleration to > + * avoid bouncing all map/unmaps through qemu this way, this > + * would be the right place to wire that up (tell the KVM > + * device emulation the VFIO iommu handles to use). > + */ > + giommu = g_malloc0(sizeof(*giommu)); > + giommu->iommu = section->mr; > + giommu->container = container; > + giommu->n.notify = vfio_iommu_map_notify; > + QLIST_INSERT_HEAD(&container->guest_iommus, giommu, list); > + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); > + > + } else if (memory_region_is_ram(section->mr)) { > + void *vaddr; > + > + vaddr = memory_region_get_ram_ptr(section->mr) + > + section->offset_within_region + > + (iova - section->offset_within_address_space); > + > + DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", > + iova, end - 1, vaddr); > + > + memory_region_ref(section->mr); Where are these unref'd? > + ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); > + if (ret) { > + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " > + "0x%"HWADDR_PRIx", %p) = %d (%m)", > + container, iova, end - iova, vaddr, ret); > + } > } > } > > @@ -1989,6 +2080,24 @@ static void vfio_listener_region_del(MemoryListener *listener, > return; > } > > + if (memory_region_is_iommu(section->mr)) { > + VFIOGuestIOMMU *giommu; > + > + QLIST_FOREACH(giommu, &container->guest_iommus, list) { > + if (giommu->iommu == section->mr) { > + memory_region_unregister_iommu_notifier(&giommu->n); > + QLIST_REMOVE(giommu, list); > + g_free(giommu); > + break; > + } > + } > + > + /* FIXME: We assume the one big unmap below is adequate to > + * remove any individual page mappings in the IOMMU which > + * might have been copied into VFIO. That may not be true for > + * all IOMMU types */ > + } > + > iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); > end = (section->offset_within_address_space + int128_get64(section->size)) & > TARGET_PAGE_MASK; ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support 2013-08-12 22:07 ` Alex Williamson @ 2013-08-15 6:02 ` Alexey Kardashevskiy 2013-08-19 13:12 ` Paolo Bonzini 0 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-15 6:02 UTC (permalink / raw) To: Alex Williamson Cc: Anthony Liguori, Michael S . Tsirkin, Alexander Graf, qemu-devel, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson On 08/13/2013 08:07 AM, Alex Williamson wrote: >> +static void vfio_listener_region_add(MemoryListener *listener, >> + MemoryRegionSection *section) >> +{ >> + VFIOContainer *container = container_of(listener, VFIOContainer, >> + iommu_data.listener); >> + hwaddr iova, end; >> + int ret; >> >> if (vfio_listener_skipped_section(section)) { >> DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", >> @@ -1952,19 +2011,51 @@ static void vfio_listener_region_add(MemoryListener *listener, >> return; >> } >> >> - vaddr = memory_region_get_ram_ptr(section->mr) + >> - section->offset_within_region + >> - (iova - section->offset_within_address_space); >> - >> - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", >> - iova, end - 1, vaddr); >> - >> - memory_region_ref(section->mr); >> - ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); >> - if (ret) { >> - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >> - "0x%"HWADDR_PRIx", %p) = %d (%m)", >> - container, iova, end - iova, vaddr, ret); >> + if (memory_region_is_iommu(section->mr)) { >> + VFIOGuestIOMMU *giommu; >> + >> + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", >> + iova, end - 1); >> + >> + memory_region_ref(section->mr); >> + /* >> + * FIXME: We should do some checking to see if the >> + * capabilities of the host VFIO IOMMU are adequate to model >> + * the guest IOMMU >> + * >> + * FIXME: This assumes that the guest IOMMU is empty of >> + * mappings at this point - we should either enforce this, or >> + * loop through existing mappings to map them into VFIO. >> + * >> + * FIXME: For VFIO iommu types which have KVM acceleration to >> + * avoid bouncing all map/unmaps through qemu this way, this >> + * would be the right place to wire that up (tell the KVM >> + * device emulation the VFIO iommu handles to use). >> + */ >> + giommu = g_malloc0(sizeof(*giommu)); >> + giommu->iommu = section->mr; >> + giommu->container = container; >> + giommu->n.notify = vfio_iommu_map_notify; >> + QLIST_INSERT_HEAD(&container->guest_iommus, giommu, list); >> + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); >> + >> + } else if (memory_region_is_ram(section->mr)) { >> + void *vaddr; >> + >> + vaddr = memory_region_get_ram_ptr(section->mr) + >> + section->offset_within_region + >> + (iova - section->offset_within_address_space); >> + >> + DPRINTF("region_add [ram] %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", >> + iova, end - 1, vaddr); >> + >> + memory_region_ref(section->mr); > > > Where are these unref'd? It is already vfio_listener_region_del(), as the original ref() which this patch just moves around. -- Alexey ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support 2013-08-15 6:02 ` Alexey Kardashevskiy @ 2013-08-19 13:12 ` Paolo Bonzini 2013-08-21 5:31 ` Alexey Kardashevskiy 0 siblings, 1 reply; 21+ messages in thread From: Paolo Bonzini @ 2013-08-19 13:12 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc, Paul Mackerras, David Gibson Il 15/08/2013 08:02, Alexey Kardashevskiy ha scritto: > On 08/13/2013 08:07 AM, Alex Williamson wrote: >>> +static void vfio_listener_region_add(MemoryListener *listener, >>> + MemoryRegionSection *section) >>> +{ >>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>> + iommu_data.listener); >>> + hwaddr iova, end; >>> + int ret; >>> >>> if (vfio_listener_skipped_section(section)) { >>> DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", >>> @@ -1952,19 +2011,51 @@ static void vfio_listener_region_add(MemoryListener *listener, >>> return; >>> } >>> >>> - vaddr = memory_region_get_ram_ptr(section->mr) + >>> - section->offset_within_region + >>> - (iova - section->offset_within_address_space); >>> - >>> - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", >>> - iova, end - 1, vaddr); >>> - >>> - memory_region_ref(section->mr); >>> - ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); >>> - if (ret) { >>> - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >>> - "0x%"HWADDR_PRIx", %p) = %d (%m)", >>> - container, iova, end - iova, vaddr, ret); >>> + if (memory_region_is_iommu(section->mr)) { >>> + VFIOGuestIOMMU *giommu; >>> + >>> + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", >>> + iova, end - 1); >>> + >>> + memory_region_ref(section->mr); >>> + /* >>> + * FIXME: We should do some checking to see if the >>> + * capabilities of the host VFIO IOMMU are adequate to model >>> + * the guest IOMMU >>> + * >>> + * FIXME: This assumes that the guest IOMMU is empty of >>> + * mappings at this point - we should either enforce this, or >>> + * loop through existing mappings to map them into VFIO. >>> + * >>> + * FIXME: For VFIO iommu types which have KVM acceleration to >>> + * avoid bouncing all map/unmaps through qemu this way, this >>> + * would be the right place to wire that up (tell the KVM >>> + * device emulation the VFIO iommu handles to use). >>> + */ >>> + giommu = g_malloc0(sizeof(*giommu)); >>> + giommu->iommu = section->mr; >>> + giommu->container = container; >>> + giommu->n.notify = vfio_iommu_map_notify; >>> + QLIST_INSERT_HEAD(&container->guest_iommus, giommu, list); >>> + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); >>> + >>> + } else if (memory_region_is_ram(section->mr)) { Please change this to an "else", and leave the refs outside the if, just after checking for vfio_listener_skipped_section. This way it's clearer that vfio_listener_region_add matches vfio_listener_region_del. Paolo ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support 2013-08-19 13:12 ` Paolo Bonzini @ 2013-08-21 5:31 ` Alexey Kardashevskiy 2013-08-21 8:55 ` Paolo Bonzini 0 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-21 5:31 UTC (permalink / raw) To: Paolo Bonzini Cc: Anthony Liguori, Michael S . Tsirkin, qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc, Paul Mackerras, David Gibson On 08/19/2013 11:12 PM, Paolo Bonzini wrote: > Il 15/08/2013 08:02, Alexey Kardashevskiy ha scritto: >> On 08/13/2013 08:07 AM, Alex Williamson wrote: >>>> +static void vfio_listener_region_add(MemoryListener *listener, >>>> + MemoryRegionSection *section) >>>> +{ >>>> + VFIOContainer *container = container_of(listener, VFIOContainer, >>>> + iommu_data.listener); >>>> + hwaddr iova, end; >>>> + int ret; >>>> >>>> if (vfio_listener_skipped_section(section)) { >>>> DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", >>>> @@ -1952,19 +2011,51 @@ static void vfio_listener_region_add(MemoryListener *listener, >>>> return; >>>> } >>>> >>>> - vaddr = memory_region_get_ram_ptr(section->mr) + >>>> - section->offset_within_region + >>>> - (iova - section->offset_within_address_space); >>>> - >>>> - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", >>>> - iova, end - 1, vaddr); >>>> - >>>> - memory_region_ref(section->mr); >>>> - ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); >>>> - if (ret) { >>>> - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " >>>> - "0x%"HWADDR_PRIx", %p) = %d (%m)", >>>> - container, iova, end - iova, vaddr, ret); >>>> + if (memory_region_is_iommu(section->mr)) { >>>> + VFIOGuestIOMMU *giommu; >>>> + >>>> + DPRINTF("region_add [iommu] %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", >>>> + iova, end - 1); >>>> + >>>> + memory_region_ref(section->mr); >>>> + /* >>>> + * FIXME: We should do some checking to see if the >>>> + * capabilities of the host VFIO IOMMU are adequate to model >>>> + * the guest IOMMU >>>> + * >>>> + * FIXME: This assumes that the guest IOMMU is empty of >>>> + * mappings at this point - we should either enforce this, or >>>> + * loop through existing mappings to map them into VFIO. >>>> + * >>>> + * FIXME: For VFIO iommu types which have KVM acceleration to >>>> + * avoid bouncing all map/unmaps through qemu this way, this >>>> + * would be the right place to wire that up (tell the KVM >>>> + * device emulation the VFIO iommu handles to use). >>>> + */ >>>> + giommu = g_malloc0(sizeof(*giommu)); >>>> + giommu->iommu = section->mr; >>>> + giommu->container = container; >>>> + giommu->n.notify = vfio_iommu_map_notify; >>>> + QLIST_INSERT_HEAD(&container->guest_iommus, giommu, list); >>>> + memory_region_register_iommu_notifier(giommu->iommu, &giommu->n); >>>> + >>>> + } else if (memory_region_is_ram(section->mr)) { > > Please change this to an "else", and leave the refs outside the if, just > after checking for vfio_listener_skipped_section. Why right after vfio_listener_skipped_section? The whole block looks now as: === static void vfio_listener_region_add(MemoryListener *listener, MemoryRegionSection *section) { VFIOContainer *container = container_of(listener, VFIOContainer, iommu_data.listener); hwaddr iova, end; void *vaddr; int ret; if (vfio_listener_skipped_section(section)) { DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", section->offset_within_address_space, section->offset_within_address_space + section->size - 1); return; } if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != (section->offset_within_region & ~TARGET_PAGE_MASK))) { error_report("%s received unaligned region", __func__); return; } iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); end = (section->offset_within_address_space + int128_get64(section->size)) & TARGET_PAGE_MASK; if (iova >= end) { return; } memory_region_ref(section->mr); === You want me to move memory_region_ref() earlier to add a reference even if two last checks fail? > This way it's clearer > that vfio_listener_region_add matches vfio_listener_region_del. Yes, true, reworked. Thanks for comments! -- Alexey ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support 2013-08-21 5:31 ` Alexey Kardashevskiy @ 2013-08-21 8:55 ` Paolo Bonzini 0 siblings, 0 replies; 21+ messages in thread From: Paolo Bonzini @ 2013-08-21 8:55 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, qemu-devel, Alexander Graf, Alex Williamson, qemu-ppc, Paul Mackerras, David Gibson Il 21/08/2013 07:31, Alexey Kardashevskiy ha scritto: > memory_region_ref(section->mr); > === > > You want me to move memory_region_ref() earlier to add a reference even if > two last checks fail? No, of course not---I was looking at _del source code not _add. Paolo ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 6/8] spapr vfio: add vfio_container_spapr_get_info() 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy ` (4 preceding siblings ...) 2013-08-07 8:21 ` [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-12 22:07 ` Alex Williamson 2013-08-07 8:21 ` [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 8/8] spapr vfio: enable for spapr Alexey Kardashevskiy 7 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson As sPAPR platform supports DMA windows on a PCI bus, the information about their location and size should be passed into the guest via the device tree. The patch adds a helper to read this info from the container fd. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- hw/misc/vfio.c | 37 +++++++++++++++++++++++++++++++++++++ include/hw/misc/vfio.h | 11 +++++++++++ 2 files changed, 48 insertions(+) create mode 100644 include/hw/misc/vfio.h diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index 3855efe..4571f64 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -39,6 +39,7 @@ #include "qemu/range.h" #include "sysemu/kvm.h" #include "sysemu/sysemu.h" +#include "hw/misc/vfio.h" /* #define DEBUG_VFIO */ #ifdef DEBUG_VFIO @@ -3488,3 +3489,39 @@ static void register_vfio_pci_dev_type(void) } type_init(register_vfio_pci_dev_type) + +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid, + struct vfio_iommu_spapr_tce_info *info, + int *group_fd) +{ + VFIOAddressSpace *space; + VFIOGroup *group; + VFIOContainer *container; + int ret, fd; + + space = vfio_get_address_space(as); + if (!space) { + return -1; + } + group = vfio_get_group(groupid, space); + if (!group) { + return -1; + } + container = group->container; + if (!group->container) { + return -1; + } + fd = container->fd; + if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) { + return -1; + } + ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info); + if (ret) { + error_report("vfio: failed to get iommu info for container: %s", + strerror(errno)); + return -1; + } + *group_fd = group->fd; + + return 0; +} diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h new file mode 100644 index 0000000..ac9a971 --- /dev/null +++ b/include/hw/misc/vfio.h @@ -0,0 +1,11 @@ +#ifndef VFIO_API_H +#define VFIO_API_H + +#include "qemu/typedefs.h" +#include <linux/vfio.h> + +extern int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid, + struct vfio_iommu_spapr_tce_info *info, + int *group_fd); + +#endif -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 6/8] spapr vfio: add vfio_container_spapr_get_info() 2013-08-07 8:21 ` [Qemu-devel] [PATCH 6/8] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy @ 2013-08-12 22:07 ` Alex Williamson 0 siblings, 0 replies; 21+ messages in thread From: Alex Williamson @ 2013-08-12 22:07 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, Alexander Graf, qemu-devel, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson On Wed, 2013-08-07 at 18:21 +1000, Alexey Kardashevskiy wrote: > As sPAPR platform supports DMA windows on a PCI bus, the information > about their location and size should be passed into the guest via > the device tree. > > The patch adds a helper to read this info from the container fd. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > hw/misc/vfio.c | 37 +++++++++++++++++++++++++++++++++++++ > include/hw/misc/vfio.h | 11 +++++++++++ > 2 files changed, 48 insertions(+) > create mode 100644 include/hw/misc/vfio.h > > diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c > index 3855efe..4571f64 100644 > --- a/hw/misc/vfio.c > +++ b/hw/misc/vfio.c > @@ -39,6 +39,7 @@ > #include "qemu/range.h" > #include "sysemu/kvm.h" > #include "sysemu/sysemu.h" > +#include "hw/misc/vfio.h" > > /* #define DEBUG_VFIO */ > #ifdef DEBUG_VFIO > @@ -3488,3 +3489,39 @@ static void register_vfio_pci_dev_type(void) > } > > type_init(register_vfio_pci_dev_type) > + > +int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid, > + struct vfio_iommu_spapr_tce_info *info, > + int *group_fd) > +{ > + VFIOAddressSpace *space; > + VFIOGroup *group; > + VFIOContainer *container; > + int ret, fd; > + > + space = vfio_get_address_space(as); > + if (!space) { > + return -1; > + } > + group = vfio_get_group(groupid, space); > + if (!group) { > + return -1; > + } > + container = group->container; > + if (!group->container) { > + return -1; > + } > + fd = container->fd; > + if (!ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) { > + return -1; > + } > + ret = ioctl(fd, VFIO_IOMMU_SPAPR_TCE_GET_INFO, info); > + if (ret) { > + error_report("vfio: failed to get iommu info for container: %s", > + strerror(errno)); > + return -1; > + } Doesn't this leak on all the exit paths if the address space and group don't exist yet? > + *group_fd = group->fd; > + > + return 0; > +} > diff --git a/include/hw/misc/vfio.h b/include/hw/misc/vfio.h > new file mode 100644 > index 0000000..ac9a971 > --- /dev/null > +++ b/include/hw/misc/vfio.h > @@ -0,0 +1,11 @@ > +#ifndef VFIO_API_H > +#define VFIO_API_H > + > +#include "qemu/typedefs.h" > +#include <linux/vfio.h> > + > +extern int vfio_container_spapr_get_info(AddressSpace *as, int32_t groupid, > + struct vfio_iommu_spapr_tce_info *info, > + int *group_fd); > + > +#endif ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy ` (5 preceding siblings ...) 2013-08-07 8:21 ` [Qemu-devel] [PATCH 6/8] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 2013-08-27 11:08 ` Alexander Graf 2013-08-07 8:21 ` [Qemu-devel] [PATCH 8/8] spapr vfio: enable for spapr Alexey Kardashevskiy 7 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson The patch adds a spapr-pci-vfio-host-bridge device type which is a PCI Host Bridge with VFIO support. The new device inherits from the spapr-pci-host-bridge device and adds the following properties: iommu - IOMMU group ID which represents a Partitionable Endpoint, QEMU/ppc64 uses a separate PHB for an IOMMU group so the guest kernel has to have PCI Domain support enabled. forceaddr (optional, 0 by default) - forces QEMU to copy device:function from the host address as certain guest drivers expect devices to appear in particular locations; mf (optional, 0 by default) - forces multifunction bit for the function #0 of a found device, only makes sense for multifunction devices and only with the forceaddr property set. It would not be required if there was a way to know in advance whether a device is multifunctional or not. scan (optional, 1 by default) - if non-zero, the new PHB walks through all non-bridge devices in the group and tries adding them to the PHB; if zero, all devices in the group have to be configured manually via the QEMU command line. The patch also adds a VFIO IOMMU type support to the existing sPAPR TCE list in spapr_iommu.c. The patch also uses the host kernel support of a new KVM_CAP_SPAPR_TCE_IOMMU capability and KVM_CREATE_SPAPR_TCE_IOMMU ioctl which let QEMU tell the host what LIOBN is used for an IOMMU group. This ioctl turns real mode TCE requests handling on which accelerates actual throughput in 2.5-5 times. Examples: 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6: -device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6 2) Configure and Add 3 functions of a multifunctional device to QEMU: (the NEC PCI USB card is used as an example here): -device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \ -device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true -device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB -device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- hw/ppc/spapr_iommu.c | 176 ++++++++++++++++++++++++++++++++----- hw/ppc/spapr_pci.c | 209 +++++++++++++++++++++++++++++++++++++++++--- include/hw/pci-host/spapr.h | 12 +++ include/hw/ppc/spapr.h | 19 ++++ target-ppc/kvm.c | 33 +++++++ target-ppc/kvm_ppc.h | 12 +++ trace-events | 4 + 7 files changed, 429 insertions(+), 36 deletions(-) diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c index 22b09be..096b6a9 100644 --- a/hw/ppc/spapr_iommu.c +++ b/hw/ppc/spapr_iommu.c @@ -16,12 +16,14 @@ * You should have received a copy of the GNU Lesser General Public * License along with this library; if not, see <http://www.gnu.org/licenses/>. */ + #include "hw/hw.h" #include "sysemu/kvm.h" #include "hw/qdev.h" #include "kvm_ppc.h" #include "sysemu/dma.h" #include "exec/address-spaces.h" +#include "trace.h" #include "hw/ppc/spapr.h" @@ -244,6 +246,74 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba, return H_SUCCESS; } +static IOMMUTLBEntry spapr_vfio_translate_iommu(MemoryRegion *iommu, hwaddr addr) +{ + IOMMUTLBEntry entry; + /* Must never be called */ + assert(0); + return entry; +} + +static MemoryRegionIOMMUOps spapr_vfio_iommu_ops = { + .translate = spapr_vfio_translate_iommu, +}; + +static int spapr_tce_table_vfio_realize(DeviceState *dev) +{ + sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev); + + memory_region_init_iommu(&tcet->iommu, NULL, &spapr_vfio_iommu_ops, + "iommu-vfio-spapr", (uint64_t)INT64_MAX+1); + + QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list); + + return 0; +} + +sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn, + int group_fd) +{ + sPAPRTCETable *tcet; + int fd; + + if (spapr_tce_find_by_liobn(liobn)) { + fprintf(stderr, "Attempted to create TCE table with duplicate" + " LIOBN 0x%x\n", liobn); + return NULL; + } + + fd = kvmppc_create_spapr_tce_iommu(liobn, group_fd); + + tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE_VFIO)); + tcet->liobn = liobn; + tcet->fd = fd; + object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL); + + qdev_init_nofail(DEVICE(tcet)); + + return tcet; +} + +static target_ulong put_tce_vfio(sPAPRTCETable *tcet, target_ulong ioba, + target_ulong tce) +{ + IOMMUTLBEntry entry; + + entry.iova = ioba & ~SPAPR_TCE_PAGE_MASK; + entry.translated_addr = tce & ~SPAPR_TCE_PAGE_MASK; + entry.addr_mask = SPAPR_TCE_PAGE_MASK; + entry.perm = 0; + if ((tce & SPAPR_TCE_RO) == SPAPR_TCE_RO) { + entry.perm |= IOMMU_RO; + } + if ((tce & SPAPR_TCE_WO) == SPAPR_TCE_WO) { + entry.perm |= IOMMU_WO; + } + memory_region_notify_iommu(&tcet->iommu, entry); + + return H_SUCCESS; +} + static target_ulong h_put_tce_indirect(PowerPCCPU *cpu, sPAPREnvironment *spapr, target_ulong opcode, target_ulong *args) @@ -255,18 +325,36 @@ static target_ulong h_put_tce_indirect(PowerPCCPU *cpu, target_ulong npages = args[3]; target_ulong ret = 0; sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); + sPAPRTCETableClass *info; - if (tcet) { - for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { - target_ulong tce = ldq_phys((tce_list & ~SPAPR_TCE_PAGE_MASK) + - i * sizeof(target_ulong)); - ret = put_tce_emu(tcet, ioba, tce); - if (ret) { - break; - } + if (!tcet) { + return H_PARAMETER; + } + + info = SPAPR_TCE_TABLE_GET_CLASS(tcet); + if (!info || !info->put_tce) { + return H_PARAMETER; + } + + if ((tce_list & SPAPR_TCE_PAGE_MASK) || (npages > 512)) { + return H_PARAMETER; + } + + if (liobn & 0xFFFFFFFF00000000ULL) { + hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN " + TARGET_FMT_lx "\n", liobn); + return H_PARAMETER; + } + + for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { + target_ulong tce = ldq_phys((tce_list & ~SPAPR_TCE_PAGE_MASK) + + i * sizeof(target_ulong)); + ret = info->put_tce(tcet, ioba, tce); + if (ret) { + break; } - return ret; } + #ifdef DEBUG_TCE fprintf(stderr, "%s on liobn=" TARGET_FMT_lx " ioba 0x" TARGET_FMT_lx " TCE 0x" TARGET_FMT_lx @@ -274,7 +362,7 @@ static target_ulong h_put_tce_indirect(PowerPCCPU *cpu, __func__, liobn, ioba, tce_list, ret); #endif - return H_PARAMETER; + return ret; } static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, @@ -287,17 +375,30 @@ static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, target_ulong npages = args[3]; target_ulong ret = 0; sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); + sPAPRTCETableClass *info; + + if (!tcet) { + return H_PARAMETER; + } + + info = SPAPR_TCE_TABLE_GET_CLASS(tcet); + if (!info || !info->put_tce) { + return H_PARAMETER; + } + + if (liobn & 0xFFFFFFFF00000000ULL) { + hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN " + TARGET_FMT_lx "\n", liobn); + return H_PARAMETER; + } ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1); - if (tcet) { - for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { - ret = put_tce_emu(tcet, ioba, tce_value); - if (ret) { - break; - } + for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { + ret = info->put_tce(tcet, ioba, tce_value); + if (ret) { + break; } - return ret; } #ifdef DEBUG_TCE fprintf(stderr, "%s on liobn=" TARGET_FMT_lx @@ -306,7 +407,7 @@ static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, __func__, liobn, ioba, tce_value, ret); #endif - return H_PARAMETER; + return ret; } static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, @@ -316,12 +417,21 @@ static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, target_ulong ioba = args[1]; target_ulong tce = args[2]; sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); + target_ulong ret; + sPAPRTCETableClass *info; + + if (!tcet) { + return H_PARAMETER; + } + + info = SPAPR_TCE_TABLE_GET_CLASS(tcet); + if (!info || !info->put_tce) { + return H_PARAMETER; + } ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1); - if (tcet) { - return put_tce_emu(tcet, ioba, tce); - } + ret = info->put_tce(tcet, ioba, tce); #ifdef DEBUG_TCE fprintf(stderr, "%s on liobn=" TARGET_FMT_lx " ioba 0x" TARGET_FMT_lx " TCE 0x" TARGET_FMT_lx @@ -329,7 +439,7 @@ static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, __func__, liobn, ioba, tce, ret); #endif - return H_PARAMETER; + return ret; } int spapr_dma_dt(void *fdt, int node_off, const char *propname, @@ -376,9 +486,12 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname, static void spapr_tce_table_class_init(ObjectClass *klass, void *data) { DeviceClass *dc = DEVICE_CLASS(klass); + sPAPRTCETableClass *k = SPAPR_TCE_TABLE_CLASS(klass); + dc->vmsd = &vmstate_spapr_tce_table; dc->init = spapr_tce_table_realize; dc->reset = spapr_tce_reset; + k->put_tce = put_tce_emu; QLIST_INIT(&spapr_tce_tables); @@ -393,12 +506,31 @@ static TypeInfo spapr_tce_table_info = { .parent = TYPE_DEVICE, .instance_size = sizeof(sPAPRTCETable), .class_init = spapr_tce_table_class_init, + .class_size = sizeof(sPAPRTCETableClass), .instance_finalize = spapr_tce_table_finalize, }; +static void spapr_tce_table_vfio_class_init(ObjectClass *klass, void *data) +{ + DeviceClass *dc = DEVICE_CLASS(klass); + sPAPRTCETableClass *k = SPAPR_TCE_TABLE_CLASS(klass); + + dc->init = spapr_tce_table_vfio_realize; + k->put_tce = put_tce_vfio; +} + +static TypeInfo spapr_tce_table_vfio_info = { + .name = TYPE_SPAPR_TCE_TABLE_VFIO, + .parent = TYPE_SPAPR_TCE_TABLE, + .instance_size = sizeof(sPAPRTCETable), + .class_init = spapr_tce_table_vfio_class_init, + .class_size = sizeof(sPAPRTCETableClass), +}; + static void register_types(void) { type_register_static(&spapr_tce_table_info); + type_register_static(&spapr_tce_table_vfio_info); } type_init(register_types); diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c index 869ca43..3f37cac 100644 --- a/hw/ppc/spapr_pci.c +++ b/hw/ppc/spapr_pci.c @@ -22,6 +22,9 @@ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN * THE SOFTWARE. */ +#include <sys/types.h> +#include <dirent.h> + #include "hw/hw.h" #include "hw/pci/pci.h" #include "hw/pci/msi.h" @@ -32,6 +35,7 @@ #include "exec/address-spaces.h" #include <libfdt.h> #include "trace.h" +#include "hw/misc/vfio.h" #include "hw/pci/pci_bus.h" @@ -496,7 +500,11 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn) return &phb->iommu_as; } -static int spapr_phb_init(SysBusDevice *s) +/* + * This is the common initialization part for both emulated and VFIO PHBs + * which includes everything but DMA and device scan (optional, VFIO only). + */ +static int _spapr_phb_init(SysBusDevice *s) { DeviceState *dev = DEVICE(s); sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s); @@ -610,19 +618,6 @@ static int spapr_phb_init(SysBusDevice *s) PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS); phb->bus = bus; - sphb->dma_window_start = 0; - sphb->dma_window_size = 0x40000000; - sphb->tcet = spapr_tce_new_table(dev, sphb->dma_liobn, - sphb->dma_window_size); - if (!sphb->tcet) { - fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname); - return -1; - } - address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet), - sphb->dtbusname); - - pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb); - QLIST_INSERT_HEAD(&spapr->phbs, sphb, list); /* Initialize the LSI table */ @@ -641,6 +636,30 @@ static int spapr_phb_init(SysBusDevice *s) return 0; } +static int spapr_phb_init(SysBusDevice *s) +{ + sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s); + int ret; + + ret = _spapr_phb_init(s); + if (ret) + return ret; + + sphb->dma_window_start = 0; + sphb->dma_window_size = 0x40000000; + sphb->tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn, + sphb->dma_window_size); + if (!sphb->tcet) { + fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname); + return -1; + } + address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet), + sphb->dtbusname); + pci_setup_iommu(sphb->parent_obj.bus, spapr_pci_dma_iommu, sphb); + + return 0; +} + static void spapr_phb_reset(DeviceState *qdev) { SysBusDevice *s = SYS_BUS_DEVICE(qdev); @@ -749,6 +768,163 @@ PCIHostState *spapr_create_phb(sPAPREnvironment *spapr, int index) return PCI_HOST_BRIDGE(dev); } +/* sPAPR VFIO */ +static Property spapr_phb_vfio_properties[] = { + DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1), + DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1), + DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0), + DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0), + DEFINE_PROP_END_OF_LIST(), +}; + +static int spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb) +{ + PCIHostState *phb = PCI_HOST_BRIDGE(svphb); + char *iommupath; + DIR *dirp; + struct dirent *entry; + + if (!svphb->scan) { + trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname); + return 0; + } + + iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/", + svphb->iommugroupid); + if (!iommupath) { + return -ENOMEM; + } + + dirp = opendir(iommupath); + if (!dirp) { + fprintf(stderr, "failed to scan group=%d\n", svphb->iommugroupid); + g_free(iommupath); + return -1; + } + + while ((entry = readdir(dirp)) != NULL) { + Error *err = NULL; + char *tmp; + FILE *deviceclassfile; + unsigned deviceclass = 0, domainid, busid, devid, fnid; + char addr[32]; + DeviceState *dev; + + if (sscanf(entry->d_name, "%X:%X:%X.%x", + &domainid, &busid, &devid, &fnid) != 4) { + continue; + } + + tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name); + trace_spapr_pci("Reading device class from ", tmp); + + deviceclassfile = fopen(tmp, "r"); + if (deviceclassfile) { + int ret = fscanf(deviceclassfile, "%x", &deviceclass); + fclose(deviceclassfile); + if (ret != 1) { + continue; + } + } + g_free(tmp); + + if (!deviceclass) { + continue; + } + if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) { + /* Skip bridges */ + continue; + } + trace_spapr_pci("Creating device from ", entry->d_name); + + dev = qdev_create(&phb->bus->qbus, "vfio-pci"); + if (!dev) { + fprintf(stderr, "failed to create vfio-pci\n"); + continue; + } + qdev_prop_parse(dev, "host", entry->d_name, &err); + if (err != NULL) { + continue; + } + if (svphb->force_addr) { + snprintf(addr, sizeof(addr), "%x.%x", devid, fnid); + err = NULL; + qdev_prop_parse(dev, "addr", addr, &err); + if (err != NULL) { + continue; + } + } + if (svphb->enable_multifunction) { + qdev_prop_set_bit(dev, "multifunction", 1); + } + qdev_init_nofail(dev); + } + closedir(dirp); + g_free(iommupath); + + return 0; +} + +static int spapr_phb_vfio_init(SysBusDevice *s) +{ + sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(s); + sPAPRPHBState *sphb = &svphb->phb; + struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) }; + int ret, group_fd; + + if (svphb->iommugroupid == -1) { + fprintf(stderr, "Wrong IOMMU group ID %d\n", svphb->iommugroupid); + return -1; + } + + ret = _spapr_phb_init(s); + if (ret) { + return ret; + } + + ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as, + svphb->iommugroupid, + &info, &group_fd); + if (ret) + return ret; + + svphb->phb.dma_window_start = info.dma32_window_start; + svphb->phb.dma_window_size = info.dma32_window_size; + svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn, + group_fd); + + address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet), + sphb->dtbusname); + pci_setup_iommu(sphb->parent_obj.bus, spapr_pci_dma_iommu, sphb); + + ret = spapr_pci_vfio_scan(svphb); + + return ret; +} + +static void spapr_phb_vfio_reset(DeviceState *qdev) +{ + /* Do nothing */ +} + +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data) +{ + SysBusDeviceClass *sdc = SYS_BUS_DEVICE_CLASS(klass); + DeviceClass *dc = DEVICE_CLASS(klass); + + sdc->init = spapr_phb_vfio_init; + dc->props = spapr_phb_vfio_properties; + dc->reset = spapr_phb_vfio_reset; + dc->vmsd = &vmstate_spapr_pci; +} + +static const TypeInfo spapr_phb_vfio_info = { + .name = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE, + .parent = TYPE_SPAPR_PCI_HOST_BRIDGE, + .instance_size = sizeof(sPAPRPHBVFIOState), + .class_init = spapr_phb_vfio_class_init, +}; + /* Macros to operate with address in OF binding to PCI */ #define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p)) #define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */ @@ -839,6 +1015,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map, sizeof(interrupt_map))); + if (!phb->dma_window_size) { + fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n"); + exit(1); + } spapr_dma_dt(fdt, bus_off, "ibm,dma-window", phb->dma_liobn, phb->dma_window_start, phb->dma_window_size); @@ -862,6 +1042,7 @@ void spapr_pci_rtas_init(void) static void spapr_pci_register_types(void) { type_register_static(&spapr_phb_info); + type_register_static(&spapr_phb_vfio_info); } type_init(spapr_pci_register_types) diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h index 970b4a9..fab18e5 100644 --- a/include/hw/pci-host/spapr.h +++ b/include/hw/pci-host/spapr.h @@ -30,10 +30,14 @@ #define SPAPR_MSIX_MAX_DEVS 32 #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge" +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge" #define SPAPR_PCI_HOST_BRIDGE(obj) \ OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE) +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \ + OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE) + typedef struct sPAPRPHBState { PCIHostState parent_obj; @@ -64,6 +68,14 @@ typedef struct sPAPRPHBState { QLIST_ENTRY(sPAPRPHBState) list; } sPAPRPHBState; +typedef struct sPAPRPHBVFIOState { + sPAPRPHBState phb; + + struct VFIOContainer *container; + int32_t iommugroupid; + uint8_t scan, enable_multifunction, force_addr; +} sPAPRPHBVFIOState; + #define SPAPR_PCI_BASE_BUID 0x800000020000000ULL #define SPAPR_PCI_WINDOW_BASE 0x10000000000ULL diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h index 2dc3d06..a64e58a 100644 --- a/include/hw/ppc/spapr.h +++ b/include/hw/ppc/spapr.h @@ -353,12 +353,29 @@ int spapr_rtas_device_tree_setup(void *fdt, hwaddr rtas_addr, #define RTAS_ERROR_LOG_MAX 2048 +typedef struct sPAPRTCETableClass sPAPRTCETableClass; typedef struct sPAPRTCETable sPAPRTCETable; #define TYPE_SPAPR_TCE_TABLE "spapr-tce-table" #define SPAPR_TCE_TABLE(obj) \ OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE) +#define TYPE_SPAPR_TCE_TABLE_VFIO "spapr-tce-table-vfio" +#define SPAPR_TCE_TABLE_VFIO(obj) \ + OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE_VFIO) + +#define SPAPR_TCE_TABLE_CLASS(klass) \ + OBJECT_CLASS_CHECK(sPAPRTCETableClass, (klass), TYPE_SPAPR_TCE_TABLE) +#define SPAPR_TCE_TABLE_GET_CLASS(obj) \ + OBJECT_GET_CLASS(sPAPRTCETableClass, (obj), TYPE_SPAPR_TCE_TABLE) + +struct sPAPRTCETableClass { + DeviceClass parent_class; + + target_ulong (*put_tce)(sPAPRTCETable *tcet, target_ulong ioba, + target_ulong tce); +}; + struct sPAPRTCETable { DeviceState parent; uint32_t liobn; @@ -375,6 +392,8 @@ void spapr_events_init(sPAPREnvironment *spapr); void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq); sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn, size_t window_size); +sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn, + int group_fd); MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet); void spapr_tce_set_bypass(sPAPRTCETable *tcet, bool bypass); int spapr_dma_dt(void *fdt, int node_off, const char *propname, diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c index 3d0e398..eb59d7d 100644 --- a/target-ppc/kvm.c +++ b/target-ppc/kvm.c @@ -61,6 +61,7 @@ static int cap_ppc_smt; static int cap_ppc_rma; static int cap_spapr_tce; static int cap_spapr_multitce; +static int cap_spapr_tce_iommu; static int cap_hior; static int cap_one_reg; static int cap_epr; @@ -98,6 +99,7 @@ int kvm_arch_init(KVMState *s) cap_ppc_rma = kvm_check_extension(s, KVM_CAP_PPC_RMA); cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE); cap_spapr_multitce = kvm_check_extension(s, KVM_CAP_SPAPR_MULTITCE); + cap_spapr_tce_iommu = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_IOMMU); cap_one_reg = kvm_check_extension(s, KVM_CAP_ONE_REG); cap_hior = kvm_check_extension(s, KVM_CAP_PPC_HIOR); cap_epr = kvm_check_extension(s, KVM_CAP_PPC_EPR); @@ -1669,6 +1671,37 @@ int kvmppc_remove_spapr_tce(void *table, int fd, uint32_t window_size) return 0; } +int kvmppc_create_spapr_tce_iommu(uint32_t liobn, int group_fd) +{ + int fd = 0; + struct kvm_create_spapr_tce_iommu args = { + .liobn = liobn, + .fd = group_fd + }; + + if (!kvm_enabled() || !cap_spapr_tce_iommu) { + fprintf(stderr, "KVM VFIO: TCE IOMMU capability is not present, DMA may be slow\n"); + return -1; + } + + fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_SPAPR_TCE_IOMMU, &args); + if (fd < 0) { + fprintf(stderr, "KVM VFIO: Failed to create TCE table for liobn 0x%x, ret = %d, DMA may be slow\n", + liobn, fd); + } + + return fd; +} + +int kvmppc_remove_spapr_tce_iommu(int fd) +{ + if (fd < 0) { + return -1; + } + + return close(fd); +} + int kvmppc_reset_htab(int shift_hint) { uint32_t shift = shift_hint; diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h index a2a903f..a223e63 100644 --- a/target-ppc/kvm_ppc.h +++ b/target-ppc/kvm_ppc.h @@ -34,6 +34,8 @@ off_t kvmppc_alloc_rma(const char *name, MemoryRegion *sysmem); bool kvmppc_spapr_use_multitce(void); void *kvmppc_create_spapr_tce(uint32_t liobn, uint32_t window_size, int *pfd); int kvmppc_remove_spapr_tce(void *table, int pfd, uint32_t window_size); +int kvmppc_create_spapr_tce_iommu(uint32_t liobn, int group_fd); +int kvmppc_remove_spapr_tce_iommu(int fd); int kvmppc_reset_htab(int shift_hint); uint64_t kvmppc_rma_size(uint64_t current_size, unsigned int hash_shift); #endif /* !CONFIG_USER_ONLY */ @@ -144,6 +146,16 @@ static inline int kvmppc_remove_spapr_tce(void *table, int pfd, return -1; } +static inline int kvmppc_create_spapr_tce_iommu(uint32_t liobn, uint32_t iommu_id) +{ + return -1; +} + +static inline int kvmppc_remove_spapr_tce_iommu(int fd) +{ + return -1; +} + static inline int kvmppc_reset_htab(int shift_hint) { return -1; diff --git a/trace-events b/trace-events index 3856b5c..d1e54ad 100644 --- a/trace-events +++ b/trace-events @@ -1113,6 +1113,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride, qxl_render_update_area_done(void *cookie) "%p" # hw/ppc/spapr_pci.c +spapr_pci(const char *msg1, const char *msg2) "%s%s" spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)" spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64 spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u" @@ -1133,6 +1134,9 @@ xics_ics_write_xive(int nr, int srcno, int server, uint8_t priority) "ics_write_ xics_ics_reject(int nr, int srcno) "reject irq %#x [src %d]" xics_ics_eoi(int nr) "ics_eoi: irq %#x" +# hw/ppc/spapr_iommu.c +spapr_iommu(const char *op, uint32_t liobn, uint64_t ioba, uint64_t tce, int ret) "%s %x ioba=%"PRIx64" tce=%"PRIx64" ret=%d" + # util/hbitmap.c hbitmap_iter_skip_words(const void *hb, void *hbi, uint64_t pos, unsigned long cur) "hb %p hbi %p pos %"PRId64" cur 0x%lx" hbitmap_reset(void *hb, uint64_t start, uint64_t count, uint64_t sbit, uint64_t ebit) "hb %p items %"PRIu64",%"PRIu64" bits %"PRIu64"..%"PRIu64 -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio 2013-08-07 8:21 ` [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy @ 2013-08-27 11:08 ` Alexander Graf 2013-08-30 7:43 ` Alexey Kardashevskiy 0 siblings, 1 reply; 21+ messages in thread From: Alexander Graf @ 2013-08-27 11:08 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, qemu-devel@nongnu.org qemu-devel, Alex Williamson, qemu-ppc@nongnu.org list:PowerPC, Paolo Bonzini, Paul Mackerras, Andreas Färber, David Gibson On 07.08.2013, at 10:21, Alexey Kardashevskiy wrote: > The patch adds a spapr-pci-vfio-host-bridge device type > which is a PCI Host Bridge with VFIO support. The new device > inherits from the spapr-pci-host-bridge device and adds > the following properties: > iommu - IOMMU group ID which represents a Partitionable > Endpoint, QEMU/ppc64 uses a separate PHB for > an IOMMU group so the guest kernel has to have > PCI Domain support enabled. > forceaddr (optional, 0 by default) - forces QEMU to copy > device:function from the host address as > certain guest drivers expect devices to appear in > particular locations; > mf (optional, 0 by default) - forces multifunction bit for > the function #0 of a found device, only makes sense > for multifunction devices and only with the forceaddr > property set. It would not be required if there > was a way to know in advance whether a device is > multifunctional or not. > scan (optional, 1 by default) - if non-zero, the new PHB walks > through all non-bridge devices in the group and tries > adding them to the PHB; if zero, all devices in the group > have to be configured manually via the QEMU command line. > > The patch also adds a VFIO IOMMU type support to the existing > sPAPR TCE list in spapr_iommu.c. > > The patch also uses the host kernel support of a new KVM_CAP_SPAPR_TCE_IOMMU > capability and KVM_CREATE_SPAPR_TCE_IOMMU ioctl which let QEMU tell > the host what LIOBN is used for an IOMMU group. This ioctl turns real mode TCE > requests handling on which accelerates actual throughput in 2.5-5 times. > > Examples: > 1) Scan and add all devices from IOMMU group with ID=1 to QEMU's PHB #6: > -device spapr-pci-vfio-host-bridge,id=DEVICENAME,iommu=1,index=6 > > 2) Configure and Add 3 functions of a multifunctional device to QEMU: > (the NEC PCI USB card is used as an example here): > -device spapr-pci-vfio-host-bridge,id=USB,iommu=4,scan=0,index=7 \ > -device vfio-pci,host=4:0:1.0,addr=1.0,bus=USB,multifunction=true > -device vfio-pci,host=4:0:1.1,addr=1.1,bus=USB > -device vfio-pci,host=4:0:1.2,addr=1.2,bus=USB > > Cc: David Gibson <david@gibson.dropbear.id.au> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- > hw/ppc/spapr_iommu.c | 176 ++++++++++++++++++++++++++++++++----- > hw/ppc/spapr_pci.c | 209 +++++++++++++++++++++++++++++++++++++++++--- > include/hw/pci-host/spapr.h | 12 +++ > include/hw/ppc/spapr.h | 19 ++++ > target-ppc/kvm.c | 33 +++++++ > target-ppc/kvm_ppc.h | 12 +++ > trace-events | 4 + > 7 files changed, 429 insertions(+), 36 deletions(-) > > diff --git a/hw/ppc/spapr_iommu.c b/hw/ppc/spapr_iommu.c > index 22b09be..096b6a9 100644 > --- a/hw/ppc/spapr_iommu.c > +++ b/hw/ppc/spapr_iommu.c > @@ -16,12 +16,14 @@ > * You should have received a copy of the GNU Lesser General Public > * License along with this library; if not, see <http://www.gnu.org/licenses/>. > */ > + > #include "hw/hw.h" > #include "sysemu/kvm.h" > #include "hw/qdev.h" > #include "kvm_ppc.h" > #include "sysemu/dma.h" > #include "exec/address-spaces.h" > +#include "trace.h" > > #include "hw/ppc/spapr.h" > > @@ -244,6 +246,74 @@ static target_ulong put_tce_emu(sPAPRTCETable *tcet, target_ulong ioba, > return H_SUCCESS; > } > > +static IOMMUTLBEntry spapr_vfio_translate_iommu(MemoryRegion *iommu, hwaddr addr) > +{ > + IOMMUTLBEntry entry; > + /* Must never be called */ > + assert(0); > + return entry; > +} > + > +static MemoryRegionIOMMUOps spapr_vfio_iommu_ops = { > + .translate = spapr_vfio_translate_iommu, > +}; > + > +static int spapr_tce_table_vfio_realize(DeviceState *dev) > +{ > + sPAPRTCETable *tcet = SPAPR_TCE_TABLE(dev); > + > + memory_region_init_iommu(&tcet->iommu, NULL, &spapr_vfio_iommu_ops, > + "iommu-vfio-spapr", (uint64_t)INT64_MAX+1); > + > + QLIST_INSERT_HEAD(&spapr_tce_tables, tcet, list); > + > + return 0; > +} > + > +sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn, > + int group_fd) > +{ > + sPAPRTCETable *tcet; > + int fd; > + > + if (spapr_tce_find_by_liobn(liobn)) { > + fprintf(stderr, "Attempted to create TCE table with duplicate" > + " LIOBN 0x%x\n", liobn); > + return NULL; > + } > + > + fd = kvmppc_create_spapr_tce_iommu(liobn, group_fd); > + > + tcet = SPAPR_TCE_TABLE(object_new(TYPE_SPAPR_TCE_TABLE_VFIO)); > + tcet->liobn = liobn; > + tcet->fd = fd; > + object_property_add_child(OBJECT(owner), "tce-table", OBJECT(tcet), NULL); > + > + qdev_init_nofail(DEVICE(tcet)); > + > + return tcet; > +} > + > +static target_ulong put_tce_vfio(sPAPRTCETable *tcet, target_ulong ioba, > + target_ulong tce) > +{ > + IOMMUTLBEntry entry; > + > + entry.iova = ioba & ~SPAPR_TCE_PAGE_MASK; > + entry.translated_addr = tce & ~SPAPR_TCE_PAGE_MASK; > + entry.addr_mask = SPAPR_TCE_PAGE_MASK; > + entry.perm = 0; > + if ((tce & SPAPR_TCE_RO) == SPAPR_TCE_RO) { > + entry.perm |= IOMMU_RO; > + } > + if ((tce & SPAPR_TCE_WO) == SPAPR_TCE_WO) { > + entry.perm |= IOMMU_WO; > + } > + memory_region_notify_iommu(&tcet->iommu, entry); > + > + return H_SUCCESS; > +} > + > static target_ulong h_put_tce_indirect(PowerPCCPU *cpu, > sPAPREnvironment *spapr, > target_ulong opcode, target_ulong *args) > @@ -255,18 +325,36 @@ static target_ulong h_put_tce_indirect(PowerPCCPU *cpu, > target_ulong npages = args[3]; > target_ulong ret = 0; > sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); > + sPAPRTCETableClass *info; > > - if (tcet) { > - for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { > - target_ulong tce = ldq_phys((tce_list & ~SPAPR_TCE_PAGE_MASK) + > - i * sizeof(target_ulong)); > - ret = put_tce_emu(tcet, ioba, tce); > - if (ret) { > - break; > - } > + if (!tcet) { > + return H_PARAMETER; > + } > + > + info = SPAPR_TCE_TABLE_GET_CLASS(tcet); > + if (!info || !info->put_tce) { > + return H_PARAMETER; > + } > + > + if ((tce_list & SPAPR_TCE_PAGE_MASK) || (npages > 512)) { > + return H_PARAMETER; > + } > + > + if (liobn & 0xFFFFFFFF00000000ULL) { > + hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN " > + TARGET_FMT_lx "\n", liobn); > + return H_PARAMETER; > + } > + > + for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { > + target_ulong tce = ldq_phys((tce_list & ~SPAPR_TCE_PAGE_MASK) + > + i * sizeof(target_ulong)); > + ret = info->put_tce(tcet, ioba, tce); > + if (ret) { > + break; > } > - return ret; > } > + > #ifdef DEBUG_TCE > fprintf(stderr, "%s on liobn=" TARGET_FMT_lx > " ioba 0x" TARGET_FMT_lx " TCE 0x" TARGET_FMT_lx > @@ -274,7 +362,7 @@ static target_ulong h_put_tce_indirect(PowerPCCPU *cpu, > __func__, liobn, ioba, tce_list, ret); > #endif > > - return H_PARAMETER; > + return ret; > } > > static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, > @@ -287,17 +375,30 @@ static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, > target_ulong npages = args[3]; > target_ulong ret = 0; > sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); > + sPAPRTCETableClass *info; > + > + if (!tcet) { > + return H_PARAMETER; > + } > + > + info = SPAPR_TCE_TABLE_GET_CLASS(tcet); > + if (!info || !info->put_tce) { > + return H_PARAMETER; > + } > + > + if (liobn & 0xFFFFFFFF00000000ULL) { > + hcall_dprintf("spapr_vio_put_tce on out-of-boundsw LIOBN " > + TARGET_FMT_lx "\n", liobn); > + return H_PARAMETER; > + } > > ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1); > > - if (tcet) { > - for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { > - ret = put_tce_emu(tcet, ioba, tce_value); > - if (ret) { > - break; > - } > + for (i = 0; i < npages; ++i, ioba += SPAPR_TCE_PAGE_SIZE) { > + ret = info->put_tce(tcet, ioba, tce_value); > + if (ret) { > + break; > } > - return ret; > } > #ifdef DEBUG_TCE > fprintf(stderr, "%s on liobn=" TARGET_FMT_lx > @@ -306,7 +407,7 @@ static target_ulong h_stuff_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, > __func__, liobn, ioba, tce_value, ret); > #endif > > - return H_PARAMETER; > + return ret; > } > > static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, > @@ -316,12 +417,21 @@ static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, > target_ulong ioba = args[1]; > target_ulong tce = args[2]; > sPAPRTCETable *tcet = spapr_tce_find_by_liobn(liobn); > + target_ulong ret; > + sPAPRTCETableClass *info; > + > + if (!tcet) { > + return H_PARAMETER; > + } > + > + info = SPAPR_TCE_TABLE_GET_CLASS(tcet); > + if (!info || !info->put_tce) { > + return H_PARAMETER; > + } > > ioba &= ~(SPAPR_TCE_PAGE_SIZE - 1); > > - if (tcet) { > - return put_tce_emu(tcet, ioba, tce); > - } > + ret = info->put_tce(tcet, ioba, tce); > #ifdef DEBUG_TCE > fprintf(stderr, "%s on liobn=" TARGET_FMT_lx > " ioba 0x" TARGET_FMT_lx " TCE 0x" TARGET_FMT_lx > @@ -329,7 +439,7 @@ static target_ulong h_put_tce(PowerPCCPU *cpu, sPAPREnvironment *spapr, > __func__, liobn, ioba, tce, ret); > #endif > > - return H_PARAMETER; > + return ret; > } > > int spapr_dma_dt(void *fdt, int node_off, const char *propname, > @@ -376,9 +486,12 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname, > static void spapr_tce_table_class_init(ObjectClass *klass, void *data) > { > DeviceClass *dc = DEVICE_CLASS(klass); > + sPAPRTCETableClass *k = SPAPR_TCE_TABLE_CLASS(klass); > + > dc->vmsd = &vmstate_spapr_tce_table; > dc->init = spapr_tce_table_realize; > dc->reset = spapr_tce_reset; > + k->put_tce = put_tce_emu; > > QLIST_INIT(&spapr_tce_tables); > > @@ -393,12 +506,31 @@ static TypeInfo spapr_tce_table_info = { > .parent = TYPE_DEVICE, > .instance_size = sizeof(sPAPRTCETable), > .class_init = spapr_tce_table_class_init, > + .class_size = sizeof(sPAPRTCETableClass), > .instance_finalize = spapr_tce_table_finalize, > }; > > +static void spapr_tce_table_vfio_class_init(ObjectClass *klass, void *data) > +{ > + DeviceClass *dc = DEVICE_CLASS(klass); > + sPAPRTCETableClass *k = SPAPR_TCE_TABLE_CLASS(klass); > + > + dc->init = spapr_tce_table_vfio_realize; > + k->put_tce = put_tce_vfio; > +} > + > +static TypeInfo spapr_tce_table_vfio_info = { > + .name = TYPE_SPAPR_TCE_TABLE_VFIO, > + .parent = TYPE_SPAPR_TCE_TABLE, > + .instance_size = sizeof(sPAPRTCETable), > + .class_init = spapr_tce_table_vfio_class_init, > + .class_size = sizeof(sPAPRTCETableClass), > +}; > + > static void register_types(void) > { > type_register_static(&spapr_tce_table_info); > + type_register_static(&spapr_tce_table_vfio_info); > } > > type_init(register_types); > diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c > index 869ca43..3f37cac 100644 > --- a/hw/ppc/spapr_pci.c > +++ b/hw/ppc/spapr_pci.c I think we should move the vfio phb into a separate file and make it be a proper subclass without even the chance to randomly call normal spapr pci functions ;). Andreas, could you please check through this and see if you can spot a way to isolate it out? Alex > @@ -22,6 +22,9 @@ > * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN > * THE SOFTWARE. > */ > +#include <sys/types.h> > +#include <dirent.h> > + > #include "hw/hw.h" > #include "hw/pci/pci.h" > #include "hw/pci/msi.h" > @@ -32,6 +35,7 @@ > #include "exec/address-spaces.h" > #include <libfdt.h> > #include "trace.h" > +#include "hw/misc/vfio.h" > > #include "hw/pci/pci_bus.h" > > @@ -496,7 +500,11 @@ static AddressSpace *spapr_pci_dma_iommu(PCIBus *bus, void *opaque, int devfn) > return &phb->iommu_as; > } > > -static int spapr_phb_init(SysBusDevice *s) > +/* > + * This is the common initialization part for both emulated and VFIO PHBs > + * which includes everything but DMA and device scan (optional, VFIO only). > + */ > +static int _spapr_phb_init(SysBusDevice *s) > { > DeviceState *dev = DEVICE(s); > sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s); > @@ -610,19 +618,6 @@ static int spapr_phb_init(SysBusDevice *s) > PCI_DEVFN(0, 0), PCI_NUM_PINS, TYPE_PCI_BUS); > phb->bus = bus; > > - sphb->dma_window_start = 0; > - sphb->dma_window_size = 0x40000000; > - sphb->tcet = spapr_tce_new_table(dev, sphb->dma_liobn, > - sphb->dma_window_size); > - if (!sphb->tcet) { > - fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname); > - return -1; > - } > - address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet), > - sphb->dtbusname); > - > - pci_setup_iommu(bus, spapr_pci_dma_iommu, sphb); > - > QLIST_INSERT_HEAD(&spapr->phbs, sphb, list); > > /* Initialize the LSI table */ > @@ -641,6 +636,30 @@ static int spapr_phb_init(SysBusDevice *s) > return 0; > } > > +static int spapr_phb_init(SysBusDevice *s) > +{ > + sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s); > + int ret; > + > + ret = _spapr_phb_init(s); > + if (ret) > + return ret; > + > + sphb->dma_window_start = 0; > + sphb->dma_window_size = 0x40000000; > + sphb->tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn, > + sphb->dma_window_size); > + if (!sphb->tcet) { > + fprintf(stderr, "Unable to create TCE table for %s\n", sphb->dtbusname); > + return -1; > + } > + address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet), > + sphb->dtbusname); > + pci_setup_iommu(sphb->parent_obj.bus, spapr_pci_dma_iommu, sphb); > + > + return 0; > +} > + > static void spapr_phb_reset(DeviceState *qdev) > { > SysBusDevice *s = SYS_BUS_DEVICE(qdev); > @@ -749,6 +768,163 @@ PCIHostState *spapr_create_phb(sPAPREnvironment *spapr, int index) > return PCI_HOST_BRIDGE(dev); > } > > +/* sPAPR VFIO */ > +static Property spapr_phb_vfio_properties[] = { > + DEFINE_PROP_INT32("iommu", sPAPRPHBVFIOState, iommugroupid, -1), > + DEFINE_PROP_UINT8("scan", sPAPRPHBVFIOState, scan, 1), > + DEFINE_PROP_UINT8("mf", sPAPRPHBVFIOState, enable_multifunction, 0), > + DEFINE_PROP_UINT8("forceaddr", sPAPRPHBVFIOState, force_addr, 0), > + DEFINE_PROP_END_OF_LIST(), > +}; > + > +static int spapr_pci_vfio_scan(sPAPRPHBVFIOState *svphb) > +{ > + PCIHostState *phb = PCI_HOST_BRIDGE(svphb); > + char *iommupath; > + DIR *dirp; > + struct dirent *entry; > + > + if (!svphb->scan) { > + trace_spapr_pci("autoscan disabled for ", svphb->phb.dtbusname); > + return 0; > + } > + > + iommupath = g_strdup_printf("/sys/kernel/iommu_groups/%d/devices/", > + svphb->iommugroupid); > + if (!iommupath) { > + return -ENOMEM; > + } > + > + dirp = opendir(iommupath); > + if (!dirp) { > + fprintf(stderr, "failed to scan group=%d\n", svphb->iommugroupid); > + g_free(iommupath); > + return -1; > + } > + > + while ((entry = readdir(dirp)) != NULL) { > + Error *err = NULL; > + char *tmp; > + FILE *deviceclassfile; > + unsigned deviceclass = 0, domainid, busid, devid, fnid; > + char addr[32]; > + DeviceState *dev; > + > + if (sscanf(entry->d_name, "%X:%X:%X.%x", > + &domainid, &busid, &devid, &fnid) != 4) { > + continue; > + } > + > + tmp = g_strdup_printf("%s%s/class", iommupath, entry->d_name); > + trace_spapr_pci("Reading device class from ", tmp); > + > + deviceclassfile = fopen(tmp, "r"); > + if (deviceclassfile) { > + int ret = fscanf(deviceclassfile, "%x", &deviceclass); > + fclose(deviceclassfile); > + if (ret != 1) { > + continue; > + } > + } > + g_free(tmp); > + > + if (!deviceclass) { > + continue; > + } > + if ((deviceclass >> 16) == (PCI_CLASS_BRIDGE_OTHER >> 8)) { > + /* Skip bridges */ > + continue; > + } > + trace_spapr_pci("Creating device from ", entry->d_name); > + > + dev = qdev_create(&phb->bus->qbus, "vfio-pci"); > + if (!dev) { > + fprintf(stderr, "failed to create vfio-pci\n"); > + continue; > + } > + qdev_prop_parse(dev, "host", entry->d_name, &err); > + if (err != NULL) { > + continue; > + } > + if (svphb->force_addr) { > + snprintf(addr, sizeof(addr), "%x.%x", devid, fnid); > + err = NULL; > + qdev_prop_parse(dev, "addr", addr, &err); > + if (err != NULL) { > + continue; > + } > + } > + if (svphb->enable_multifunction) { > + qdev_prop_set_bit(dev, "multifunction", 1); > + } > + qdev_init_nofail(dev); > + } > + closedir(dirp); > + g_free(iommupath); > + > + return 0; > +} > + > +static int spapr_phb_vfio_init(SysBusDevice *s) > +{ > + sPAPRPHBVFIOState *svphb = SPAPR_PCI_VFIO_HOST_BRIDGE(s); > + sPAPRPHBState *sphb = &svphb->phb; > + struct vfio_iommu_spapr_tce_info info = { .argsz = sizeof(info) }; > + int ret, group_fd; > + > + if (svphb->iommugroupid == -1) { > + fprintf(stderr, "Wrong IOMMU group ID %d\n", svphb->iommugroupid); > + return -1; > + } > + > + ret = _spapr_phb_init(s); > + if (ret) { > + return ret; > + } > + > + ret = vfio_container_spapr_get_info(&svphb->phb.iommu_as, > + svphb->iommugroupid, > + &info, &group_fd); > + if (ret) > + return ret; > + > + svphb->phb.dma_window_start = info.dma32_window_start; > + svphb->phb.dma_window_size = info.dma32_window_size; > + svphb->phb.tcet = spapr_vfio_new_table(DEVICE(sphb), svphb->phb.dma_liobn, > + group_fd); > + > + address_space_init(&sphb->iommu_as, spapr_tce_get_iommu(sphb->tcet), > + sphb->dtbusname); > + pci_setup_iommu(sphb->parent_obj.bus, spapr_pci_dma_iommu, sphb); > + > + ret = spapr_pci_vfio_scan(svphb); > + > + return ret; > +} > + > +static void spapr_phb_vfio_reset(DeviceState *qdev) > +{ > + /* Do nothing */ > +} > + > +static void spapr_phb_vfio_class_init(ObjectClass *klass, void *data) > +{ > + SysBusDeviceClass *sdc = SYS_BUS_DEVICE_CLASS(klass); > + DeviceClass *dc = DEVICE_CLASS(klass); > + > + sdc->init = spapr_phb_vfio_init; > + dc->props = spapr_phb_vfio_properties; > + dc->reset = spapr_phb_vfio_reset; > + dc->vmsd = &vmstate_spapr_pci; > +} > + > +static const TypeInfo spapr_phb_vfio_info = { > + .name = TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE, > + .parent = TYPE_SPAPR_PCI_HOST_BRIDGE, > + .instance_size = sizeof(sPAPRPHBVFIOState), > + .class_init = spapr_phb_vfio_class_init, > +}; > + > /* Macros to operate with address in OF binding to PCI */ > #define b_x(x, p, l) (((x) & ((1<<(l))-1)) << (p)) > #define b_n(x) b_x((x), 31, 1) /* 0 if relocatable */ > @@ -839,6 +1015,10 @@ int spapr_populate_pci_dt(sPAPRPHBState *phb, > _FDT(fdt_setprop(fdt, bus_off, "interrupt-map", &interrupt_map, > sizeof(interrupt_map))); > > + if (!phb->dma_window_size) { > + fprintf(stderr, "Unexpected error: DMA window is zero, exiting\n"); > + exit(1); > + } > spapr_dma_dt(fdt, bus_off, "ibm,dma-window", > phb->dma_liobn, phb->dma_window_start, > phb->dma_window_size); > @@ -862,6 +1042,7 @@ void spapr_pci_rtas_init(void) > static void spapr_pci_register_types(void) > { > type_register_static(&spapr_phb_info); > + type_register_static(&spapr_phb_vfio_info); > } > > type_init(spapr_pci_register_types) > diff --git a/include/hw/pci-host/spapr.h b/include/hw/pci-host/spapr.h > index 970b4a9..fab18e5 100644 > --- a/include/hw/pci-host/spapr.h > +++ b/include/hw/pci-host/spapr.h > @@ -30,10 +30,14 @@ > #define SPAPR_MSIX_MAX_DEVS 32 > > #define TYPE_SPAPR_PCI_HOST_BRIDGE "spapr-pci-host-bridge" > +#define TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE "spapr-pci-vfio-host-bridge" > > #define SPAPR_PCI_HOST_BRIDGE(obj) \ > OBJECT_CHECK(sPAPRPHBState, (obj), TYPE_SPAPR_PCI_HOST_BRIDGE) > > +#define SPAPR_PCI_VFIO_HOST_BRIDGE(obj) \ > + OBJECT_CHECK(sPAPRPHBVFIOState, (obj), TYPE_SPAPR_PCI_VFIO_HOST_BRIDGE) > + > typedef struct sPAPRPHBState { > PCIHostState parent_obj; > > @@ -64,6 +68,14 @@ typedef struct sPAPRPHBState { > QLIST_ENTRY(sPAPRPHBState) list; > } sPAPRPHBState; > > +typedef struct sPAPRPHBVFIOState { > + sPAPRPHBState phb; > + > + struct VFIOContainer *container; > + int32_t iommugroupid; > + uint8_t scan, enable_multifunction, force_addr; > +} sPAPRPHBVFIOState; > + > #define SPAPR_PCI_BASE_BUID 0x800000020000000ULL > > #define SPAPR_PCI_WINDOW_BASE 0x10000000000ULL > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h > index 2dc3d06..a64e58a 100644 > --- a/include/hw/ppc/spapr.h > +++ b/include/hw/ppc/spapr.h > @@ -353,12 +353,29 @@ int spapr_rtas_device_tree_setup(void *fdt, hwaddr rtas_addr, > > #define RTAS_ERROR_LOG_MAX 2048 > > +typedef struct sPAPRTCETableClass sPAPRTCETableClass; > typedef struct sPAPRTCETable sPAPRTCETable; > > #define TYPE_SPAPR_TCE_TABLE "spapr-tce-table" > #define SPAPR_TCE_TABLE(obj) \ > OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE) > > +#define TYPE_SPAPR_TCE_TABLE_VFIO "spapr-tce-table-vfio" > +#define SPAPR_TCE_TABLE_VFIO(obj) \ > + OBJECT_CHECK(sPAPRTCETable, (obj), TYPE_SPAPR_TCE_TABLE_VFIO) > + > +#define SPAPR_TCE_TABLE_CLASS(klass) \ > + OBJECT_CLASS_CHECK(sPAPRTCETableClass, (klass), TYPE_SPAPR_TCE_TABLE) > +#define SPAPR_TCE_TABLE_GET_CLASS(obj) \ > + OBJECT_GET_CLASS(sPAPRTCETableClass, (obj), TYPE_SPAPR_TCE_TABLE) > + > +struct sPAPRTCETableClass { > + DeviceClass parent_class; > + > + target_ulong (*put_tce)(sPAPRTCETable *tcet, target_ulong ioba, > + target_ulong tce); > +}; > + > struct sPAPRTCETable { > DeviceState parent; > uint32_t liobn; > @@ -375,6 +392,8 @@ void spapr_events_init(sPAPREnvironment *spapr); > void spapr_events_fdt_skel(void *fdt, uint32_t epow_irq); > sPAPRTCETable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn, > size_t window_size); > +sPAPRTCETable *spapr_vfio_new_table(DeviceState *owner, uint32_t liobn, > + int group_fd); > MemoryRegion *spapr_tce_get_iommu(sPAPRTCETable *tcet); > void spapr_tce_set_bypass(sPAPRTCETable *tcet, bool bypass); > int spapr_dma_dt(void *fdt, int node_off, const char *propname, > diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c > index 3d0e398..eb59d7d 100644 > --- a/target-ppc/kvm.c > +++ b/target-ppc/kvm.c > @@ -61,6 +61,7 @@ static int cap_ppc_smt; > static int cap_ppc_rma; > static int cap_spapr_tce; > static int cap_spapr_multitce; > +static int cap_spapr_tce_iommu; > static int cap_hior; > static int cap_one_reg; > static int cap_epr; > @@ -98,6 +99,7 @@ int kvm_arch_init(KVMState *s) > cap_ppc_rma = kvm_check_extension(s, KVM_CAP_PPC_RMA); > cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE); > cap_spapr_multitce = kvm_check_extension(s, KVM_CAP_SPAPR_MULTITCE); > + cap_spapr_tce_iommu = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_IOMMU); > cap_one_reg = kvm_check_extension(s, KVM_CAP_ONE_REG); > cap_hior = kvm_check_extension(s, KVM_CAP_PPC_HIOR); > cap_epr = kvm_check_extension(s, KVM_CAP_PPC_EPR); > @@ -1669,6 +1671,37 @@ int kvmppc_remove_spapr_tce(void *table, int fd, uint32_t window_size) > return 0; > } > > +int kvmppc_create_spapr_tce_iommu(uint32_t liobn, int group_fd) > +{ > + int fd = 0; > + struct kvm_create_spapr_tce_iommu args = { > + .liobn = liobn, > + .fd = group_fd > + }; > + > + if (!kvm_enabled() || !cap_spapr_tce_iommu) { > + fprintf(stderr, "KVM VFIO: TCE IOMMU capability is not present, DMA may be slow\n"); > + return -1; > + } > + > + fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_SPAPR_TCE_IOMMU, &args); > + if (fd < 0) { > + fprintf(stderr, "KVM VFIO: Failed to create TCE table for liobn 0x%x, ret = %d, DMA may be slow\n", > + liobn, fd); > + } > + > + return fd; > +} > + > +int kvmppc_remove_spapr_tce_iommu(int fd) > +{ > + if (fd < 0) { > + return -1; > + } > + > + return close(fd); > +} > + > int kvmppc_reset_htab(int shift_hint) > { > uint32_t shift = shift_hint; > diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h > index a2a903f..a223e63 100644 > --- a/target-ppc/kvm_ppc.h > +++ b/target-ppc/kvm_ppc.h > @@ -34,6 +34,8 @@ off_t kvmppc_alloc_rma(const char *name, MemoryRegion *sysmem); > bool kvmppc_spapr_use_multitce(void); > void *kvmppc_create_spapr_tce(uint32_t liobn, uint32_t window_size, int *pfd); > int kvmppc_remove_spapr_tce(void *table, int pfd, uint32_t window_size); > +int kvmppc_create_spapr_tce_iommu(uint32_t liobn, int group_fd); > +int kvmppc_remove_spapr_tce_iommu(int fd); > int kvmppc_reset_htab(int shift_hint); > uint64_t kvmppc_rma_size(uint64_t current_size, unsigned int hash_shift); > #endif /* !CONFIG_USER_ONLY */ > @@ -144,6 +146,16 @@ static inline int kvmppc_remove_spapr_tce(void *table, int pfd, > return -1; > } > > +static inline int kvmppc_create_spapr_tce_iommu(uint32_t liobn, uint32_t iommu_id) > +{ > + return -1; > +} > + > +static inline int kvmppc_remove_spapr_tce_iommu(int fd) > +{ > + return -1; > +} > + > static inline int kvmppc_reset_htab(int shift_hint) > { > return -1; > diff --git a/trace-events b/trace-events > index 3856b5c..d1e54ad 100644 > --- a/trace-events > +++ b/trace-events > @@ -1113,6 +1113,7 @@ qxl_render_guest_primary_resized(int32_t width, int32_t height, int32_t stride, > qxl_render_update_area_done(void *cookie) "%p" > > # hw/ppc/spapr_pci.c > +spapr_pci(const char *msg1, const char *msg2) "%s%s" > spapr_pci_msi(const char *msg, uint32_t n, uint32_t ca) "%s (device#%d, cfg=%x)" > spapr_pci_msi_setup(const char *name, unsigned vector, uint64_t addr) "dev\"%s\" vector %u, addr=%"PRIx64 > spapr_pci_rtas_ibm_change_msi(unsigned func, unsigned req) "func %u, requested %u" > @@ -1133,6 +1134,9 @@ xics_ics_write_xive(int nr, int srcno, int server, uint8_t priority) "ics_write_ > xics_ics_reject(int nr, int srcno) "reject irq %#x [src %d]" > xics_ics_eoi(int nr) "ics_eoi: irq %#x" > > +# hw/ppc/spapr_iommu.c > +spapr_iommu(const char *op, uint32_t liobn, uint64_t ioba, uint64_t tce, int ret) "%s %x ioba=%"PRIx64" tce=%"PRIx64" ret=%d" > + > # util/hbitmap.c > hbitmap_iter_skip_words(const void *hb, void *hbi, uint64_t pos, unsigned long cur) "hb %p hbi %p pos %"PRId64" cur 0x%lx" > hbitmap_reset(void *hb, uint64_t start, uint64_t count, uint64_t sbit, uint64_t ebit) "hb %p items %"PRIu64",%"PRIu64" bits %"PRIu64"..%"PRIu64 > -- > 1.8.3.2 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio 2013-08-27 11:08 ` Alexander Graf @ 2013-08-30 7:43 ` Alexey Kardashevskiy 2013-08-30 13:01 ` Andreas Färber 0 siblings, 1 reply; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-30 7:43 UTC (permalink / raw) To: Alexander Graf Cc: Anthony Liguori, Michael S . Tsirkin, qemu-devel@nongnu.org qemu-devel, Alex Williamson, qemu-ppc@nongnu.org list:PowerPC, Paolo Bonzini, Paul Mackerras, Andreas Färber, David Gibson On 08/27/2013 09:08 PM, Alexander Graf wrote: >> type_init(register_types); >> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >> index 869ca43..3f37cac 100644 >> --- a/hw/ppc/spapr_pci.c >> +++ b/hw/ppc/spapr_pci.c > > I think we should move the vfio phb into a separate file and make it be a proper subclass without even the chance to randomly call normal spapr pci functions ;). > > Andreas, could you please check through this and see if you can spot a way to isolate it out? After the lesson you both gave me with xics/xics-kvm, I am (more or less) aware of what I need to change here so wait a bit till I post another version. -- Alexey ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio 2013-08-30 7:43 ` Alexey Kardashevskiy @ 2013-08-30 13:01 ` Andreas Färber 0 siblings, 0 replies; 21+ messages in thread From: Andreas Färber @ 2013-08-30 13:01 UTC (permalink / raw) To: Alexey Kardashevskiy Cc: Anthony Liguori, Michael S . Tsirkin, Alexander Graf, qemu-devel, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson Am 30.08.2013 09:43, schrieb Alexey Kardashevskiy: > On 08/27/2013 09:08 PM, Alexander Graf wrote: >>> type_init(register_types); >>> diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c >>> index 869ca43..3f37cac 100644 >>> --- a/hw/ppc/spapr_pci.c >>> +++ b/hw/ppc/spapr_pci.c >> >> I think we should move the vfio phb into a separate file and make it be a proper subclass without even the chance to randomly call normal spapr pci functions ;). >> >> Andreas, could you please check through this and see if you can spot a way to isolate it out? Just noticing this question now... (me not so into reviewing VFIO) > After the lesson you both gave me with xics/xics-kvm, I am (more or less) > aware of what I need to change here so wait a bit till I post another version. Thanks. Andreas -- SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg ^ permalink raw reply [flat|nested] 21+ messages in thread
* [Qemu-devel] [PATCH 8/8] spapr vfio: enable for spapr 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy ` (6 preceding siblings ...) 2013-08-07 8:21 ` [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy @ 2013-08-07 8:21 ` Alexey Kardashevskiy 7 siblings, 0 replies; 21+ messages in thread From: Alexey Kardashevskiy @ 2013-08-07 8:21 UTC (permalink / raw) To: qemu-devel Cc: Anthony Liguori, Michael S . Tsirkin, Alexey Kardashevskiy, Alexander Graf, Alex Williamson, qemu-ppc, Paolo Bonzini, Paul Mackerras, David Gibson This turns the sPAPR support on and enables VFIO container use in the kernel. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- hw/misc/vfio.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/hw/misc/vfio.c b/hw/misc/vfio.c index 4571f64..4484826 100644 --- a/hw/misc/vfio.c +++ b/hw/misc/vfio.c @@ -2812,6 +2812,37 @@ static int vfio_connect_container(VFIOGroup *group, VFIOAddressSpace *space) memory_listener_register(&container->iommu_data.listener, container->space->as); + } else if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_SPAPR_TCE_IOMMU)) { + ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd); + if (ret) { + error_report("vfio: failed to set group container: %m"); + g_free(container); + close(fd); + return -errno; + } + + ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_SPAPR_TCE_IOMMU); + if (ret) { + error_report("vfio: failed to set iommu for container: %m"); + g_free(container); + close(fd); + return -errno; + } + + ret = ioctl(fd, VFIO_IOMMU_ENABLE); + if (ret) { + error_report("vfio: failed to enable container: %s", + strerror(errno)); + g_free(container); + close(fd); + return -errno; + } + + container->iommu_data.listener = vfio_memory_listener; + container->iommu_data.release = vfio_listener_release; + + memory_listener_register(&container->iommu_data.listener, + container->space->as); } else { error_report("vfio: No available IOMMU models"); g_free(container); -- 1.8.3.2 ^ permalink raw reply related [flat|nested] 21+ messages in thread
end of thread, other threads:[~2013-08-30 13:02 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-08-07 8:21 [Qemu-devel] [PATCH 0/8 v3] vfio on power: preparations for VFIO, guest IOMMUs and VFIO itself Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 1/8] pci: Introduce helper to retrieve a PCI device's DMA address space Alexey Kardashevskiy 2013-08-07 10:27 ` Michael S. Tsirkin 2013-08-07 8:21 ` [Qemu-devel] [PATCH 2/8] memory: Sanity check that no listeners remain on a destroyed AddressSpace Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 3/8] vfio: Introduce VFIO address spaces Alexey Kardashevskiy 2013-08-12 22:07 ` Alex Williamson 2013-08-19 13:15 ` Paolo Bonzini 2013-08-07 8:21 ` [Qemu-devel] [PATCH 4/8] vfio: Create VFIOAddressSpace objects as needed Alexey Kardashevskiy 2013-08-07 8:21 ` [Qemu-devel] [PATCH 5/8] vfio: Add guest side IOMMU support Alexey Kardashevskiy 2013-08-12 22:07 ` Alex Williamson 2013-08-15 6:02 ` Alexey Kardashevskiy 2013-08-19 13:12 ` Paolo Bonzini 2013-08-21 5:31 ` Alexey Kardashevskiy 2013-08-21 8:55 ` Paolo Bonzini 2013-08-07 8:21 ` [Qemu-devel] [PATCH 6/8] spapr vfio: add vfio_container_spapr_get_info() Alexey Kardashevskiy 2013-08-12 22:07 ` Alex Williamson 2013-08-07 8:21 ` [Qemu-devel] [PATCH 7/8] spapr vfio: add spapr-pci-vfio-host-bridge to support vfio Alexey Kardashevskiy 2013-08-27 11:08 ` Alexander Graf 2013-08-30 7:43 ` Alexey Kardashevskiy 2013-08-30 13:01 ` Andreas Färber 2013-08-07 8:21 ` [Qemu-devel] [PATCH 8/8] spapr vfio: enable for spapr Alexey Kardashevskiy
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).