qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v1 0/7] IOMMU support
@ 2012-10-11 13:26 Avi Kivity
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction Avi Kivity
                   ` (7 more replies)
  0 siblings, 8 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:26 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

These patches add IOMMU support to the memory core. IOMMUs can be added anywhere in
the memory hierarchy, and may be arranged in series.

Avi Kivity (7):
  memory: fix address space initialization/destruction
  memory: limit sections in the radix tree to the actual address space
    size
  memory: iommu support
  pci: switch iommu to using the memory API
  i440fx: add an iommu
  vfio: abort if an emulated iommu is used
  vhost: abort if an emulated iommu is used

 exec.c             |  43 ++++++++++++++++++---
 hw/pci.c           |  59 +++++++++++++++++-----------
 hw/pci.h           |   7 +++-
 hw/pci_internals.h |   5 ++-
 hw/piix_pci.c      |  74 +++++++++++++++++++++++++++++++++++
 hw/spapr.h         |   2 +
 hw/spapr_iommu.c   |  35 ++++++++---------
 hw/spapr_pci.c     |  26 +++++++++++--
 hw/spapr_pci.h     |   1 +
 hw/vfio_pci.c      |   2 +
 hw/vhost.c         |   2 +
 memory.c           | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 memory.h           |  46 ++++++++++++++++++++++
 13 files changed, 356 insertions(+), 56 deletions(-)

-- 
1.7.12

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
@ 2012-10-11 13:26 ` Avi Kivity
  2012-10-11 13:31   ` Paolo Bonzini
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 2/7] memory: limit sections in the radix tree to the actual address space size Avi Kivity
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:26 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

A couple of fields were left uninitialized.  This was not observed earlier
because all address spaces were statically allocated.  Also free allocation
for those fields.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 memory.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/memory.c b/memory.c
index 2f68d67..5df6177 100644
--- a/memory.c
+++ b/memory.c
@@ -1538,6 +1538,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root)
     as->root = root;
     as->current_map = g_new(FlatView, 1);
     flatview_init(as->current_map);
+    as->ioeventfd_nb = 0;
+    as->ioeventfds = NULL;
     QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
     as->name = NULL;
     memory_region_transaction_commit();
@@ -1554,6 +1556,7 @@ void address_space_destroy(AddressSpace *as)
     address_space_destroy_dispatch(as);
     flatview_destroy(as->current_map);
     g_free(as->current_map);
+    g_free(as->ioeventfds);
 }
 
 uint64_t io_mem_read(MemoryRegion *mr, target_phys_addr_t addr, unsigned size)
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 2/7] memory: limit sections in the radix tree to the actual address space size
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction Avi Kivity
@ 2012-10-11 13:26 ` Avi Kivity
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 3/7] memory: iommu support Avi Kivity
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:26 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

The radix tree is statically sized to fit TARGET_PHYS_ADDR_SPACE_BITS.
If a larger memory region is registered, it will overflow.

Fix by limiting any section in the radix tree to the supported size.

This problem was not observed earlier since artificial regions (containers
and aliases) are eliminated by the memory core, leaving only device regions
which have reasonable sizes.  An IOMMU however cannot be eliminated by the
memory core, and may have an artificial size.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 exec.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/exec.c b/exec.c
index 4bc79fb..328753d 100644
--- a/exec.c
+++ b/exec.c
@@ -2272,10 +2272,23 @@ static void register_multipage(AddressSpaceDispatch *d, MemoryRegionSection *sec
                   section_index);
 }
 
+static MemoryRegionSection limit(MemoryRegionSection section)
+{
+    unsigned practical_as_bits = MIN(TARGET_PHYS_ADDR_SPACE_BITS, 62);
+    target_phys_addr_t as_limit;
+
+    as_limit = (target_phys_addr_t)1 << practical_as_bits;
+
+    section.size = MIN(section.offset_within_address_space + section.size, as_limit)
+                   - section.offset_within_address_space;
+
+    return section;
+}
+
 static void mem_add(MemoryListener *listener, MemoryRegionSection *section)
 {
     AddressSpaceDispatch *d = container_of(listener, AddressSpaceDispatch, listener);
-    MemoryRegionSection now = *section, remain = *section;
+    MemoryRegionSection now = limit(*section), remain = limit(*section);
 
     if ((now.offset_within_address_space & ~TARGET_PAGE_MASK)
         || (now.size < TARGET_PAGE_SIZE)) {
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction Avi Kivity
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 2/7] memory: limit sections in the radix tree to the actual address space size Avi Kivity
@ 2012-10-11 13:26 ` Avi Kivity
  2012-10-11 13:42   ` Paolo Bonzini
  2012-10-11 14:29   ` Avi Kivity
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API Avi Kivity
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:26 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

Add a new memory region type that translates addresses it is given,
then forwards them to a target address space.  This is similar to
an alias, except that the mapping is more flexible than a linear
translation and trucation, and also less efficient since the
translation happens at runtime.

The implementation uses an AddressSpace mapping the target region to
avoid hierarchical dispatch all the way to the resolved region; only
iommu regions are looked up dynamically.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 exec.c   |  28 ++++++++++++++---
 memory.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 memory.h |  46 +++++++++++++++++++++++++++
 3 files changed, 177 insertions(+), 4 deletions(-)

diff --git a/exec.c b/exec.c
index 328753d..13df16c 100644
--- a/exec.c
+++ b/exec.c
@@ -3498,6 +3498,7 @@ void cpu_physical_memory_write_rom(target_phys_addr_t addr,
 
 typedef struct {
     void *buffer;
+    AddressSpace *as;
     target_phys_addr_t addr;
     target_phys_addr_t len;
 } BounceBuffer;
@@ -3563,23 +3564,42 @@ void *address_space_map(AddressSpace *as,
     ram_addr_t raddr = RAM_ADDR_MAX;
     ram_addr_t rlen;
     void *ret;
+    IOMMUTLBEntry iotlb;
+    target_phys_addr_t xlat;
+    AddressSpace *as_xlat;
 
     while (len > 0) {
+        xlat = addr;
+        as_xlat = as;
         page = addr & TARGET_PAGE_MASK;
         l = (page + TARGET_PAGE_SIZE) - addr;
         if (l > len)
             l = len;
         section = phys_page_find(d, page >> TARGET_PAGE_BITS);
 
+        while (section->mr->iommu_ops) {
+            iotlb = section->mr->iommu_ops->translate(section->mr, addr, is_write);
+            if (iotlb.valid) {
+                xlat = ((iotlb.translated_addr & ~iotlb.addr_mask)
+                        | (addr & iotlb.addr_mask));
+                as_xlat = section->mr->iommu_target_as;
+                l = (MIN(xlat + l - 1, xlat | iotlb.addr_mask) - xlat) + 1;
+                section = phys_page_find(as_xlat->dispatch, xlat >> TARGET_PAGE_BITS);
+            } else {
+                section = &phys_sections[phys_section_unassigned];
+            }
+        }
+
         if (!(memory_region_is_ram(section->mr) && !section->readonly)) {
             if (todo || bounce.buffer) {
                 break;
             }
             bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
-            bounce.addr = addr;
+            bounce.addr = xlat;
+            bounce.as = as_xlat;
             bounce.len = l;
             if (!is_write) {
-                address_space_read(as, addr, bounce.buffer, l);
+                address_space_read(as_xlat, xlat, bounce.buffer, l);
             }
 
             *plen = l;
@@ -3587,7 +3607,7 @@ void *address_space_map(AddressSpace *as,
         }
         if (!todo) {
             raddr = memory_region_get_ram_addr(section->mr)
-                + memory_region_section_addr(section, addr);
+                + memory_region_section_addr(section, xlat);
         }
 
         len -= l;
@@ -3632,7 +3652,7 @@ void address_space_unmap(AddressSpace *as, void *buffer, target_phys_addr_t len,
         return;
     }
     if (is_write) {
-        address_space_write(as, bounce.addr, bounce.buffer, access_len);
+        address_space_write(bounce.as, bounce.addr, bounce.buffer, access_len);
     }
     qemu_vfree(bounce.buffer);
     bounce.buffer = NULL;
diff --git a/memory.c b/memory.c
index 5df6177..1d92bb8 100644
--- a/memory.c
+++ b/memory.c
@@ -775,6 +775,12 @@ static void memory_region_destructor_rom_device(MemoryRegion *mr)
     qemu_ram_free(mr->ram_addr & TARGET_PAGE_MASK);
 }
 
+static void memory_region_destructor_iommu(MemoryRegion *mr)
+{
+    address_space_destroy(mr->iommu_target_as);
+    g_free(mr->iommu_target_as);
+}
+
 static bool memory_region_wrong_endianness(MemoryRegion *mr)
 {
 #ifdef TARGET_WORDS_BIGENDIAN
@@ -789,6 +795,7 @@ void memory_region_init(MemoryRegion *mr,
                         uint64_t size)
 {
     mr->ops = NULL;
+    mr->iommu_ops = NULL;
     mr->parent = NULL;
     mr->size = int128_make64(size);
     if (size == UINT64_MAX) {
@@ -980,6 +987,101 @@ void memory_region_init_rom_device(MemoryRegion *mr,
     mr->ram_addr = qemu_ram_alloc(size, mr);
 }
 
+static void memory_region_iommu_rw(MemoryRegion *iommu, target_phys_addr_t addr,
+                                   uint8_t *buf, unsigned len, bool is_write)
+{
+    IOMMUTLBEntry tlb;
+    unsigned clen;
+    target_phys_addr_t xlat;
+
+    while (len) {
+        tlb = iommu->iommu_ops->translate(iommu, addr, is_write);
+        clen = (MIN(addr | tlb.addr_mask, addr + len - 1) - addr) + 1;
+        if (tlb.valid) {
+            xlat = (tlb.translated_addr & ~tlb.addr_mask) | (addr & tlb.addr_mask);
+            address_space_rw(iommu->iommu_target_as, xlat, buf, clen, is_write);
+        } else {
+            if (!is_write) {
+                memset(buf, 0xff, clen);
+            }
+        }
+        buf += clen;
+        addr += clen;
+        len -= clen;
+    }
+}
+
+static uint64_t memory_region_iommu_read(void *opaque, target_phys_addr_t addr,
+                                         unsigned size)
+{
+    MemoryRegion *iommu = opaque;
+    union {
+        uint8_t buf[8];
+        uint8_t u8;
+        uint16_t u16;
+        uint32_t u32;
+        uint64_t u64;
+    } ret;
+
+    memory_region_iommu_rw(iommu, addr, ret.buf, size, false);
+    switch (size) {
+    case 1: return ret.u8;
+    case 2: return ret.u16;
+    case 4: return ret.u32;
+    case 8: return ret.u64;
+    default: abort();
+    }
+}
+
+static void memory_region_iommu_write(void *opaque, target_phys_addr_t addr,
+                                      uint64_t data, unsigned size)
+{
+    MemoryRegion *iommu = opaque;
+    union {
+        uint8_t buf[8];
+        uint8_t u8;
+        uint16_t u16;
+        uint32_t u32;
+        uint64_t u64;
+    } in;
+
+    switch (size) {
+    case 1: in.u8 = data; break;
+    case 2: in.u16 = data; break;
+    case 4: in.u32 = data; break;
+    case 8: in.u64 = data; break;
+    default: abort();
+    }
+    memory_region_iommu_rw(iommu, addr, in.buf, size, true);
+}
+
+static MemoryRegionOps memory_region_iommu_ops = {
+    .read = memory_region_iommu_read,
+    .write = memory_region_iommu_write,
+#ifdef HOST_BIGENDIAN
+    .endianness = DEVICE_BIG_ENDIAN,
+#else
+    .endianness = DEVICE_LITTLE_ENDIAN,
+#endif
+};
+
+void memory_region_init_iommu(MemoryRegion *mr,
+                              MemoryRegionIOMMUOps *ops,
+                              MemoryRegion *target,
+                              const char *name,
+                              uint64_t size)
+{
+    memory_region_init(mr, name, size);
+    mr->ops = &memory_region_iommu_ops;
+    mr->iommu_ops = ops,
+    mr->opaque = mr;
+    mr->terminates = true;  /* then re-forwards */
+    mr->destructor = memory_region_destructor_iommu;
+    mr->iommu_target = target;
+    mr->iommu_target_as = g_new(AddressSpace, 1);
+    address_space_init(mr->iommu_target_as, target);
+}
+
 static uint64_t invalid_read(void *opaque, target_phys_addr_t addr,
                              unsigned size)
 {
@@ -1053,6 +1155,11 @@ bool memory_region_is_rom(MemoryRegion *mr)
     return mr->ram && mr->readonly;
 }
 
+bool memory_region_is_iommu(MemoryRegion *mr)
+{
+    return mr->iommu_ops;
+}
+
 void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
 {
     uint8_t mask = 1 << client;
diff --git a/memory.h b/memory.h
index 79393f1..299d584 100644
--- a/memory.h
+++ b/memory.h
@@ -113,12 +113,29 @@ struct MemoryRegionOps {
     const MemoryRegionMmio old_mmio;
 };
 
+typedef struct IOMMUTLBEntry IOMMUTLBEntry;
+typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
+
+struct IOMMUTLBEntry {
+    target_phys_addr_t device_addr;
+    target_phys_addr_t translated_addr;
+    target_phys_addr_t addr_mask;  /* 0xfff = 4k translation */
+    bool valid;
+};
+
+struct MemoryRegionIOMMUOps {
+    /* Returns a TLB entry that contains a given address. */
+    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
+                               bool is_write);
+};
+
 typedef struct CoalescedMemoryRange CoalescedMemoryRange;
 typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
 
 struct MemoryRegion {
     /* All fields are private - violators will be prosecuted */
     const MemoryRegionOps *ops;
+    const MemoryRegionIOMMUOps *iommu_ops;
     void *opaque;
     MemoryRegion *parent;
     Int128 size;
@@ -145,6 +162,8 @@ struct MemoryRegion {
     uint8_t dirty_log_mask;
     unsigned ioeventfd_nb;
     MemoryRegionIoeventfd *ioeventfds;
+    MemoryRegion *iommu_target;
+    struct AddressSpace *iommu_target_as;
 };
 
 struct MemoryRegionPortio {
@@ -334,6 +353,24 @@ void memory_region_init_rom_device(MemoryRegion *mr,
 void memory_region_init_reservation(MemoryRegion *mr,
                                     const char *name,
                                     uint64_t size);
+
+/**
+ * memory_region_init_iommu: Initialize a memory region that translates addresses
+ *
+ * An IOMMU region translates addresses and forwards accesses to a target memory region.
+ *
+ * @mr: the #MemoryRegion to be initialized
+ * @ops: a function that translates addresses into the @target region
+ * @target: a #MemoryRegion that will be used to satisfy accesses to translated addresses
+ * @name: used for debugging; not visible to the user or ABI
+ * @size: size of the region.
+ */
+void memory_region_init_iommu(MemoryRegion *mr,
+                              MemoryRegionIOMMUOps *ops,
+                              MemoryRegion *target,
+                              const char *name,
+                              uint64_t size);
+
 /**
  * memory_region_destroy: Destroy a memory region and reclaim all resources.
  *
@@ -373,6 +410,15 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
 }
 
 /**
+ * memory_region_is_ram: check whether a memory region is an iommu
+ *
+ * Returns %true is a memory region is an iommu.
+ *
+ * @mr: the memory region being queried
+ */
+bool memory_region_is_iommu(MemoryRegion *mr);
+
+/**
  * memory_region_name: get a memory region's name
  *
  * Returns the string that was used to initialize the memory region.
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
                   ` (2 preceding siblings ...)
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 3/7] memory: iommu support Avi Kivity
@ 2012-10-11 13:27 ` Avi Kivity
  2012-10-11 13:53   ` Paolo Bonzini
  2012-10-13  9:13   ` Blue Swirl
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 5/7] i440fx: add an iommu Avi Kivity
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:27 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

Instead of requesting a DMAContext from the bus implementation, use a
MemoryRegion.  This can be initialized using memory_region_init_iommu()
(or memory_region_init_alias() for simple, static translations).

Add a destructor, since setups that have per-device translations will
need to return a different iommu region for every device.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/pci.c           | 59 +++++++++++++++++++++++++++++++++---------------------
 hw/pci.h           |  7 +++++--
 hw/pci_internals.h |  5 +++--
 hw/spapr.h         |  2 ++
 hw/spapr_iommu.c   | 35 ++++++++++++++------------------
 hw/spapr_pci.c     | 26 ++++++++++++++++++++----
 hw/spapr_pci.h     |  1 +
 7 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 7adf61b..02e1989 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -274,6 +274,21 @@ int pci_find_domain(const PCIBus *bus)
     return -1;
 }
 
+static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
+{
+    MemoryRegion *mr = g_new(MemoryRegion, 1);
+
+    /* FIXME: inherit memory region from bus creator */
+    memory_region_init_alias(mr, "iommu-nop", get_system_memory(), 0, INT64_MAX);
+    return mr;
+}
+
+static void pci_default_iommu_dtor(MemoryRegion *mr)
+{
+    memory_region_destroy(mr);
+    g_free(mr);
+}
+
 void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
                          const char *name,
                          MemoryRegion *address_space_mem,
@@ -285,6 +300,7 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
     bus->devfn_min = devfn_min;
     bus->address_space_mem = address_space_mem;
     bus->address_space_io = address_space_io;
+    pci_setup_iommu(bus, pci_default_iommu, pci_default_iommu_dtor, NULL);
 
     /* host bridge */
     QLIST_INIT(&bus->child);
@@ -775,21 +791,16 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
                      PCI_SLOT(devfn), PCI_FUNC(devfn), name, bus->devices[devfn]->name);
         return NULL;
     }
+
     pci_dev->bus = bus;
-    if (bus->dma_context_fn) {
-        pci_dev->dma = bus->dma_context_fn(bus, bus->dma_context_opaque, devfn);
-    } else {
-        /* FIXME: Make dma_context_fn use MemoryRegions instead, so this path is
-         * taken unconditionally */
-        /* FIXME: inherit memory region from bus creator */
-        memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
-                                 get_system_memory(), 0,
-                                 memory_region_size(get_system_memory()));
-        memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
-        address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region);
-        pci_dev->dma = g_new(DMAContext, 1);
-        dma_context_init(pci_dev->dma, &pci_dev->bus_master_as, NULL, NULL, NULL);
-    }
+    pci_dev->iommu = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
+    memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
+                             pci_dev->iommu, 0, memory_region_size(pci_dev->iommu));
+    memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
+    address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region);
+    pci_dev->dma = g_new(DMAContext, 1);
+    dma_context_init(pci_dev->dma, &pci_dev->bus_master_as, NULL, NULL, NULL);
+
     pci_dev->devfn = devfn;
     pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
     pci_dev->irq_state = 0;
@@ -843,12 +854,12 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
     pci_dev->bus->devices[pci_dev->devfn] = NULL;
     pci_config_free(pci_dev);
 
-    if (!pci_dev->bus->dma_context_fn) {
-        address_space_destroy(&pci_dev->bus_master_as);
-        memory_region_destroy(&pci_dev->bus_master_enable_region);
-        g_free(pci_dev->dma);
-        pci_dev->dma = NULL;
-    }
+    address_space_destroy(&pci_dev->bus_master_as);
+    memory_region_del_subregion(&pci_dev->bus_master_enable_region, pci_dev->iommu);
+    pci_dev->bus->iommu_dtor_fn(pci_dev->iommu);
+    memory_region_destroy(&pci_dev->bus_master_enable_region);
+    g_free(pci_dev->dma);
+    pci_dev->dma = NULL;
 }
 
 static void pci_unregister_io_regions(PCIDevice *pci_dev)
@@ -2092,10 +2103,12 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
     k->props = pci_props;
 }
 
-void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque)
+void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
+                     void *opaque)
 {
-    bus->dma_context_fn = fn;
-    bus->dma_context_opaque = opaque;
+    bus->iommu_fn = fn;
+    bus->iommu_dtor_fn = dtor;
+    bus->iommu_opaque = opaque;
 }
 
 static TypeInfo pci_device_type_info = {
diff --git a/hw/pci.h b/hw/pci.h
index a65e490..370354a 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -213,6 +213,7 @@ struct PCIDevice {
     PCIIORegion io_regions[PCI_NUM_REGIONS];
     AddressSpace bus_master_as;
     MemoryRegion bus_master_enable_region;
+    MemoryRegion *iommu;
     DMAContext *dma;
 
     /* do not access the following fields */
@@ -351,9 +352,11 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
 
 void pci_device_deassert_intx(PCIDevice *dev);
 
-typedef DMAContext *(*PCIDMAContextFunc)(PCIBus *, void *, int);
+typedef MemoryRegion *(*PCIIOMMUFunc)(PCIBus *, void *, int);
+typedef void (*PCIIOMMUDestructorFunc)(MemoryRegion *mr);
 
-void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque);
+void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
+                     void *opaque);
 
 static inline void
 pci_set_byte(uint8_t *config, uint8_t val)
diff --git a/hw/pci_internals.h b/hw/pci_internals.h
index c931b64..040508f 100644
--- a/hw/pci_internals.h
+++ b/hw/pci_internals.h
@@ -17,8 +17,9 @@
 
 struct PCIBus {
     BusState qbus;
-    PCIDMAContextFunc dma_context_fn;
-    void *dma_context_opaque;
+    PCIIOMMUFunc iommu_fn;
+    PCIIOMMUDestructorFunc iommu_dtor_fn;
+    void *iommu_opaque;
     uint8_t devfn_min;
     pci_set_irq_fn set_irq;
     pci_map_irq_fn map_irq;
diff --git a/hw/spapr.h b/hw/spapr.h
index ac34a17..452efba1 100644
--- a/hw/spapr.h
+++ b/hw/spapr.h
@@ -334,6 +334,8 @@ typedef struct sPAPRTCE {
 #define SPAPR_PCI_BASE_LIOBN    0x80000000
 
 void spapr_iommu_init(void);
+IOMMUTLBEntry spapr_tce_translate(DMAContext *dma, target_phys_addr_t addr,
+                                  bool is_write);
 DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size);
 void spapr_tce_free(DMAContext *dma);
 int spapr_dma_dt(void *fdt, int node_off, const char *propname,
diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
index 54798a3..79e35d1 100644
--- a/hw/spapr_iommu.c
+++ b/hw/spapr_iommu.c
@@ -63,15 +63,11 @@ static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
     return NULL;
 }
 
-static int spapr_tce_translate(DMAContext *dma,
-                               dma_addr_t addr,
-                               target_phys_addr_t *paddr,
-                               target_phys_addr_t *len,
-                               DMADirection dir)
+IOMMUTLBEntry spapr_tce_translate(DMAContext *dma, target_phys_addr_t addr,
+                                  bool is_write)
 {
     sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
-    enum sPAPRTCEAccess access = (dir == DMA_DIRECTION_FROM_DEVICE)
-        ? SPAPR_TCE_WO : SPAPR_TCE_RO;
+    enum sPAPRTCEAccess access = is_write ? SPAPR_TCE_WO : SPAPR_TCE_RO;
     uint64_t tce;
 
 #ifdef DEBUG_TCE
@@ -84,29 +80,27 @@ static int spapr_tce_translate(DMAContext *dma,
 #ifdef DEBUG_TCE
         fprintf(stderr, "spapr_tce_translate out of bounds\n");
 #endif
-        return -EFAULT;
+        return (IOMMUTLBEntry) { .valid = false };
     }
 
     tce = tcet->table[addr >> SPAPR_TCE_PAGE_SHIFT].tce;
 
     /* Check TCE */
     if (!(tce & access)) {
-        return -EPERM;
+        return (IOMMUTLBEntry) { .valid = false };
     }
 
-    /* How much til end of page ? */
-    *len = ((~addr) & SPAPR_TCE_PAGE_MASK) + 1;
-
-    /* Translate */
-    *paddr = (tce & ~SPAPR_TCE_PAGE_MASK) |
-        (addr & SPAPR_TCE_PAGE_MASK);
-
 #ifdef DEBUG_TCE
     fprintf(stderr, " ->  *paddr=0x" TARGET_FMT_plx ", *len=0x"
-            TARGET_FMT_plx "\n", *paddr, *len);
+            TARGET_FMT_plx "\n", (tce & ~SPAPR_TCE_PAGE_MASK), SPAPR_TCE_PAGE_MASK + 1);
 #endif
 
-    return 0;
+    return (IOMMUTLBEntry) {
+        .device_addr = addr & SPAPR_TCE_PAGE_MASK,
+        .translated_addr = (tce & ~SPAPR_TCE_PAGE_MASK),
+        .addr_mask = SPAPR_TCE_PAGE_MASK,
+        .valid = true,
+    };
 }
 
 DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
@@ -118,7 +112,7 @@ DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
     }
 
     tcet = g_malloc0(sizeof(*tcet));
-    dma_context_init(&tcet->dma, &address_space_memory, spapr_tce_translate, NULL, NULL);
+    dma_context_init(&tcet->dma, &address_space_memory, NULL, NULL, NULL);
 
     tcet->liobn = liobn;
     tcet->window_size = window_size;
@@ -253,7 +247,8 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
         return 0;
     }
 
-    if (iommu->translate == spapr_tce_translate) {
+    /* FIXME: WHAT?? */
+    if (true) {
         sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, iommu);
         return spapr_dma_dt(fdt, node_off, propname,
                 tcet->liobn, 0, tcet->window_size);
diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
index 661c05b..24f5b46 100644
--- a/hw/spapr_pci.c
+++ b/hw/spapr_pci.c
@@ -506,14 +506,30 @@ static void spapr_msi_write(void *opaque, target_phys_addr_t addr,
 /*
  * PHB PCI device
  */
-static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
-                                            int devfn)
+static MemoryRegion *spapr_pci_dma_iommu_new(PCIBus *bus, void *opaque, int devfn)
 {
     sPAPRPHBState *phb = opaque;
 
-    return phb->dma;
+    return &phb->iommu;
 }
 
+static void spapr_pci_dma_iommu_destroy(MemoryRegion *iommu)
+{
+    /* iommu is shared among devices, do nothing */
+}
+
+static IOMMUTLBEntry spapr_phb_translate(MemoryRegion *iommu, target_phys_addr_t addr,
+                                         bool is_write)
+{
+    sPAPRPHBState *phb = container_of(iommu, sPAPRPHBState, iommu);
+
+    return spapr_tce_translate(phb->dma, addr, is_write);
+}
+
+static MemoryRegionIOMMUOps spapr_iommu_ops = {
+    .translate = spapr_phb_translate,
+};
+
 static int spapr_phb_init(SysBusDevice *s)
 {
     sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
@@ -576,7 +592,9 @@ static int spapr_phb_init(SysBusDevice *s)
     sphb->dma_window_start = 0;
     sphb->dma_window_size = 0x40000000;
     sphb->dma = spapr_tce_new_dma_context(sphb->dma_liobn, sphb->dma_window_size);
-    pci_setup_iommu(bus, spapr_pci_dma_context_fn, sphb);
+    memory_region_init_iommu(&sphb->iommu, &spapr_iommu_ops, get_system_memory(),
+                             "iommu-spapr", INT64_MAX);
+    pci_setup_iommu(bus, spapr_pci_dma_iommu_new, spapr_pci_dma_iommu_destroy, sphb);
 
     QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
 
diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
index 670dc62..171db45 100644
--- a/hw/spapr_pci.h
+++ b/hw/spapr_pci.h
@@ -45,6 +45,7 @@ typedef struct sPAPRPHBState {
     target_phys_addr_t mem_win_addr, mem_win_size, io_win_addr, io_win_size;
     target_phys_addr_t msi_win_addr;
     MemoryRegion memwindow, iowindow, msiwindow;
+    MemoryRegion iommu;
 
     uint32_t dma_liobn;
     uint64_t dma_window_start;
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 5/7] i440fx: add an iommu
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
                   ` (3 preceding siblings ...)
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API Avi Kivity
@ 2012-10-11 13:27 ` Avi Kivity
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 6/7] vfio: abort if an emulated iommu is used Avi Kivity
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:27 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

This iommu encrypts addresses on the device bus to avoid divuling information
to hackers equipped with bus analyzers.  Following 3DES, addresses are encrypted
multiple times.  A XOR cypher is employed for efficiency.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/piix_pci.c | 74 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 74 insertions(+)

diff --git a/hw/piix_pci.c b/hw/piix_pci.c
index 537fc19..33c2587 100644
--- a/hw/piix_pci.c
+++ b/hw/piix_pci.c
@@ -30,6 +30,7 @@
 #include "sysbus.h"
 #include "range.h"
 #include "xen.h"
+#include "exec-memory.h"
 
 /*
  * I440FX chipset data sheet.
@@ -252,6 +253,78 @@ static int i440fx_initfn(PCIDevice *dev)
     return 0;
 }
 
+typedef struct SillyIOMMU SillyIOMMU;
+
+struct SillyIOMMU {
+    MemoryRegion l1;
+    MemoryRegion l2;
+    target_phys_addr_t mask;
+    target_phys_addr_t secret;
+};
+
+static IOMMUTLBEntry silly_l1_translate(MemoryRegion *l1, target_phys_addr_t addr,
+                                        bool is_write)
+{
+    SillyIOMMU *s = container_of(l1, SillyIOMMU, l1);
+    target_phys_addr_t xlat = addr ^ s->secret;
+
+    printf("l1: %" TARGET_PRIxPHYS " -> %" TARGET_PRIxPHYS "\n", addr, xlat);
+
+    return (IOMMUTLBEntry) {
+        .device_addr = addr & ~s->mask,
+        .translated_addr = xlat & ~s->mask,
+        .addr_mask = s->mask,
+        .valid = true,
+    };
+}
+
+static MemoryRegionIOMMUOps silly_l1_iommu_ops = {
+    .translate = silly_l1_translate,
+};
+
+static IOMMUTLBEntry silly_l2_translate(MemoryRegion *l2, target_phys_addr_t addr,
+                                        bool is_write)
+{
+    SillyIOMMU *s = container_of(l2, SillyIOMMU, l2);
+    target_phys_addr_t xlat = addr ^ s->secret;
+
+    printf("l2: %" TARGET_PRIxPHYS " -> %" TARGET_PRIxPHYS "\n", addr, xlat);
+
+    return (IOMMUTLBEntry) {
+        .device_addr = addr & ~s->mask,
+        .translated_addr = xlat & ~s->mask,
+        .addr_mask = s->mask,
+        .valid = true,
+    };
+}
+
+static MemoryRegionIOMMUOps silly_l2_iommu_ops = {
+    .translate = silly_l2_translate,
+};
+
+static MemoryRegion *silly_iommu_new(PCIBus *bus, void *opaque, int devfn)
+{
+    SillyIOMMU *s = g_new(SillyIOMMU, 1);
+    MemoryRegion *sysmem = get_system_memory();
+
+    s->mask = (0x1000 << (devfn >> 3)) - 1;
+    s->secret = (((devfn << 24) | 0x00aabbccdd) & ~s->mask) * (devfn >= 3 * 8);
+    memory_region_init_iommu(&s->l2, &silly_l2_iommu_ops, sysmem, "silly-l2", INT64_MAX);
+    memory_region_init_iommu(&s->l1, &silly_l1_iommu_ops, &s->l2, "silly-l1", INT64_MAX);
+    return &s->l1;
+}
+
+static void silly_iommu_del(MemoryRegion *l1)
+{
+    SillyIOMMU *s = container_of(l1, SillyIOMMU, l1);
+
+    memory_region_del_subregion(&s->l2, get_system_memory());
+    memory_region_del_subregion(&s->l1, &s->l2);
+    memory_region_destroy(&s->l2);
+    memory_region_destroy(&s->l1);
+    g_free(s);
+}
+
 static PCIBus *i440fx_common_init(const char *device_name,
                                   PCII440FXState **pi440fx_state,
                                   int *piix3_devfn,
@@ -278,6 +351,7 @@ static PCIBus *i440fx_common_init(const char *device_name,
     s->address_space = address_space_mem;
     b = pci_bus_new(dev, NULL, pci_address_space,
                     address_space_io, 0);
+    pci_setup_iommu(b, silly_iommu_new, silly_iommu_del, NULL);
     s->bus = b;
     object_property_add_child(qdev_get_machine(), "i440fx", OBJECT(dev), NULL);
     qdev_init_nofail(dev);
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 6/7] vfio: abort if an emulated iommu is used
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
                   ` (4 preceding siblings ...)
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 5/7] i440fx: add an iommu Avi Kivity
@ 2012-10-11 13:27 ` Avi Kivity
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 7/7] vhost: " Avi Kivity
  2012-10-12  2:36 ` [Qemu-devel] [RFC v1 0/7] IOMMU support Benjamin Herrenschmidt
  7 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:27 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

vfio doesn't support guest iommus yet, indicate it to the user
by gently depositing a core on their disk.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/vfio_pci.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
index e9399a1..f935d00 100644
--- a/hw/vfio_pci.c
+++ b/hw/vfio_pci.c
@@ -817,6 +817,8 @@ static void vfio_listener_region_add(MemoryListener *listener,
     void *vaddr;
     int ret;
 
+    assert(!memory_region_is_iommu(section.mr));
+
     if (vfio_listener_skipped_section(section)) {
         DPRINTF("vfio: SKIPPING region_add %"TARGET_PRIxPHYS" - %"PRIx64"\n",
                 section->offset_within_address_space,
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
                   ` (5 preceding siblings ...)
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 6/7] vfio: abort if an emulated iommu is used Avi Kivity
@ 2012-10-11 13:27 ` Avi Kivity
  2012-10-11 13:31   ` Michael S. Tsirkin
  2012-10-12  2:36 ` [Qemu-devel] [RFC v1 0/7] IOMMU support Benjamin Herrenschmidt
  7 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:27 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

vhost doesn't support guest iommus yet, indicate it to the user
by gently depositing a core on their disk.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/vhost.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vhost.c b/hw/vhost.c
index 0b4ac3f..cd5d9f5 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -451,6 +451,8 @@ static void vhost_region_add(MemoryListener *listener,
     struct vhost_dev *dev = container_of(listener, struct vhost_dev,
                                          memory_listener);
 
+    assert(!memory_region_is_iommu(section.mr));
+
     if (!vhost_section(section)) {
         return;
     }
-- 
1.7.12

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction Avi Kivity
@ 2012-10-11 13:31   ` Paolo Bonzini
  2012-10-11 13:33     ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2012-10-11 13:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori

Il 11/10/2012 15:26, Avi Kivity ha scritto:
> A couple of fields were left uninitialized.  This was not observed earlier
> because all address spaces were statically allocated.  Also free allocation
> for those fields.

Patch is obvious, but there are at least two alternatives: 1) using
memset 2) using a compound initializer.

Paolo

> Signed-off-by: Avi Kivity <avi@redhat.com>
> ---
>  memory.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/memory.c b/memory.c
> index 2f68d67..5df6177 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -1538,6 +1538,8 @@ void address_space_init(AddressSpace *as, MemoryRegion *root)
>      as->root = root;
>      as->current_map = g_new(FlatView, 1);
>      flatview_init(as->current_map);
> +    as->ioeventfd_nb = 0;
> +    as->ioeventfds = NULL;
>      QTAILQ_INSERT_TAIL(&address_spaces, as, address_spaces_link);
>      as->name = NULL;
>      memory_region_transaction_commit();
> @@ -1554,6 +1556,7 @@ void address_space_destroy(AddressSpace *as)
>      address_space_destroy_dispatch(as);
>      flatview_destroy(as->current_map);
>      g_free(as->current_map);
> +    g_free(as->ioeventfds);
>  }
>  
>  uint64_t io_mem_read(MemoryRegion *mr, target_phys_addr_t addr, unsigned size)
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 7/7] vhost: " Avi Kivity
@ 2012-10-11 13:31   ` Michael S. Tsirkin
  2012-10-11 13:34     ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-10-11 13:31 UTC (permalink / raw)
  To: Avi Kivity
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 03:27:03PM +0200, Avi Kivity wrote:
> vhost doesn't support guest iommus yet, indicate it to the user
> by gently depositing a core on their disk.
> 
> Signed-off-by: Avi Kivity <avi@redhat.com>

Actually there is no problem. virtio bypasses an IOMMU,
so vhost works fine by writing into guest memory directly.

So I don't think we need this patch.

> ---
>  hw/vhost.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/hw/vhost.c b/hw/vhost.c
> index 0b4ac3f..cd5d9f5 100644
> --- a/hw/vhost.c
> +++ b/hw/vhost.c
> @@ -451,6 +451,8 @@ static void vhost_region_add(MemoryListener *listener,
>      struct vhost_dev *dev = container_of(listener, struct vhost_dev,
>                                           memory_listener);
>  
> +    assert(!memory_region_is_iommu(section.mr));
> +
>      if (!vhost_section(section)) {
>          return;
>      }
> -- 
> 1.7.12

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction
  2012-10-11 13:31   ` Paolo Bonzini
@ 2012-10-11 13:33     ` Avi Kivity
  2012-10-13  9:14       ` Blue Swirl
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:33 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori

On 10/11/2012 03:31 PM, Paolo Bonzini wrote:
> Il 11/10/2012 15:26, Avi Kivity ha scritto:
>> A couple of fields were left uninitialized.  This was not observed earlier
>> because all address spaces were statically allocated.  Also free allocation
>> for those fields.
> 
> Patch is obvious, but there are at least two alternatives: 1) using
> memset 2) using a compound initializer.

3) std::vector<>, which also covers address_space_destroy().

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 13:31   ` Michael S. Tsirkin
@ 2012-10-11 13:34     ` Avi Kivity
  2012-10-11 13:44       ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On 10/11/2012 03:31 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2012 at 03:27:03PM +0200, Avi Kivity wrote:
>> vhost doesn't support guest iommus yet, indicate it to the user
>> by gently depositing a core on their disk.
>> 
>> Signed-off-by: Avi Kivity <avi@redhat.com>
> 
> Actually there is no problem. virtio bypasses an IOMMU,
> so vhost works fine by writing into guest memory directly.
> 
> So I don't think we need this patch.

The pci subsystem should set up the iommu so that it ignores virtio
devices.  If it does, an emulated iommu will not reach vhost.  If it
doesn't, then it will, and the assert() will alert us that we have a bug.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 3/7] memory: iommu support Avi Kivity
@ 2012-10-11 13:42   ` Paolo Bonzini
  2012-10-11 13:45     ` Avi Kivity
  2012-10-12  2:45     ` Benjamin Herrenschmidt
  2012-10-11 14:29   ` Avi Kivity
  1 sibling, 2 replies; 41+ messages in thread
From: Paolo Bonzini @ 2012-10-11 13:42 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, David Gibson

Il 11/10/2012 15:26, Avi Kivity ha scritto:
> +struct MemoryRegionIOMMUOps {
> +    /* Returns a TLB entry that contains a given address. */
> +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
> +                               bool is_write);
> +};

Do map/unmap still make sense in this model?  Ben & David, what were
your plans there?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 13:44       ` Michael S. Tsirkin
@ 2012-10-11 13:44         ` Avi Kivity
  2012-10-11 14:35           ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On 10/11/2012 03:44 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2012 at 03:34:54PM +0200, Avi Kivity wrote:
>> On 10/11/2012 03:31 PM, Michael S. Tsirkin wrote:
>> > On Thu, Oct 11, 2012 at 03:27:03PM +0200, Avi Kivity wrote:
>> >> vhost doesn't support guest iommus yet, indicate it to the user
>> >> by gently depositing a core on their disk.
>> >> 
>> >> Signed-off-by: Avi Kivity <avi@redhat.com>
>> > 
>> > Actually there is no problem. virtio bypasses an IOMMU,
>> > so vhost works fine by writing into guest memory directly.
>> > 
>> > So I don't think we need this patch.
>> 
>> The pci subsystem should set up the iommu so that it ignores virtio
>> devices.  If it does, an emulated iommu will not reach vhost.  If it
>> doesn't, then it will, and the assert() will alert us that we have a bug.
> 
> You mean pci subsystem in the guest? I'm pretty sure that's not
> the case at the moment: iommu is on by default and applies
> to all devices unless you do something special.
> I see where you are coming from but it does
> not look right to break all existing guests.

No, qemu should configure virtio devices to bypass the iommu, even if it
is on.

> Also - I see no reason to single out vhost - I think same applies with
> any virtio device, since it doesn't use the DMA API.

True.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 13:34     ` Avi Kivity
@ 2012-10-11 13:44       ` Michael S. Tsirkin
  2012-10-11 13:44         ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-10-11 13:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 03:34:54PM +0200, Avi Kivity wrote:
> On 10/11/2012 03:31 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 11, 2012 at 03:27:03PM +0200, Avi Kivity wrote:
> >> vhost doesn't support guest iommus yet, indicate it to the user
> >> by gently depositing a core on their disk.
> >> 
> >> Signed-off-by: Avi Kivity <avi@redhat.com>
> > 
> > Actually there is no problem. virtio bypasses an IOMMU,
> > so vhost works fine by writing into guest memory directly.
> > 
> > So I don't think we need this patch.
> 
> The pci subsystem should set up the iommu so that it ignores virtio
> devices.  If it does, an emulated iommu will not reach vhost.  If it
> doesn't, then it will, and the assert() will alert us that we have a bug.

You mean pci subsystem in the guest? I'm pretty sure that's not
the case at the moment: iommu is on by default and applies
to all devices unless you do something special.
I see where you are coming from but it does
not look right to break all existing guests.

Also - I see no reason to single out vhost - I think same applies with
any virtio device, since it doesn't use the DMA API.

> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:42   ` Paolo Bonzini
@ 2012-10-11 13:45     ` Avi Kivity
  2012-10-11 13:54       ` Paolo Bonzini
  2012-10-12  2:45     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, David Gibson

On 10/11/2012 03:42 PM, Paolo Bonzini wrote:
> Il 11/10/2012 15:26, Avi Kivity ha scritto:
>> +struct MemoryRegionIOMMUOps {
>> +    /* Returns a TLB entry that contains a given address. */
>> +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
>> +                               bool is_write);
>> +};
> 
> Do map/unmap still make sense in this model?  Ben & David, what were
> your plans there?
> 

Map/unmap is supported via address_space_map(), which calls
->translate().  I don't see how a lower-level map/unmap helps, unless
the hardware supplies such a function.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API Avi Kivity
@ 2012-10-11 13:53   ` Paolo Bonzini
  2012-10-11 13:56     ` Avi Kivity
  2012-10-13  9:13   ` Blue Swirl
  1 sibling, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2012-10-11 13:53 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori

Il 11/10/2012 15:27, Avi Kivity ha scritto:
> -static int spapr_tce_translate(DMAContext *dma,
> -                               dma_addr_t addr,
> -                               target_phys_addr_t *paddr,
> -                               target_phys_addr_t *len,
> -                               DMADirection dir)
> +IOMMUTLBEntry spapr_tce_translate(DMAContext *dma, target_phys_addr_t addr,
> +                                  bool is_write)
>  {
>      sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
> -    enum sPAPRTCEAccess access = (dir == DMA_DIRECTION_FROM_DEVICE)
> -        ? SPAPR_TCE_WO : SPAPR_TCE_RO;
> +    enum sPAPRTCEAccess access = is_write ? SPAPR_TCE_WO : SPAPR_TCE_RO;
>      uint64_t tce;
>  
>  #ifdef DEBUG_TCE
> @@ -84,29 +80,27 @@ static int spapr_tce_translate(DMAContext *dma,
>  #ifdef DEBUG_TCE
>          fprintf(stderr, "spapr_tce_translate out of bounds\n");
>  #endif
> -        return -EFAULT;
> +        return (IOMMUTLBEntry) { .valid = false };
>      }
>  
>      tce = tcet->table[addr >> SPAPR_TCE_PAGE_SHIFT].tce;
>  
>      /* Check TCE */
>      if (!(tce & access)) {
> -        return -EPERM;
> +        return (IOMMUTLBEntry) { .valid = false };
>      }
>  
> -    /* How much til end of page ? */
> -    *len = ((~addr) & SPAPR_TCE_PAGE_MASK) + 1;
> -
> -    /* Translate */
> -    *paddr = (tce & ~SPAPR_TCE_PAGE_MASK) |
> -        (addr & SPAPR_TCE_PAGE_MASK);
> -
>  #ifdef DEBUG_TCE
>      fprintf(stderr, " ->  *paddr=0x" TARGET_FMT_plx ", *len=0x"
> -            TARGET_FMT_plx "\n", *paddr, *len);
> +            TARGET_FMT_plx "\n", (tce & ~SPAPR_TCE_PAGE_MASK), SPAPR_TCE_PAGE_MASK + 1);
>  #endif
>  
> -    return 0;
> +    return (IOMMUTLBEntry) {
> +        .device_addr = addr & SPAPR_TCE_PAGE_MASK,
> +        .translated_addr = (tce & ~SPAPR_TCE_PAGE_MASK),
> +        .addr_mask = SPAPR_TCE_PAGE_MASK,
> +        .valid = true,
> +    };
>  }
>  
>  DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
> @@ -118,7 +112,7 @@ DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
>      }
>  
>      tcet = g_malloc0(sizeof(*tcet));
> -    dma_context_init(&tcet->dma, &address_space_memory, spapr_tce_translate, NULL, NULL);
> +    dma_context_init(&tcet->dma, &address_space_memory, NULL, NULL, NULL);
>  
>      tcet->liobn = liobn;
>      tcet->window_size = window_size;
> @@ -253,7 +247,8 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>          return 0;
>      }
>  
> -    if (iommu->translate == spapr_tce_translate) {
> +    /* FIXME: WHAT?? */
> +    if (true) {
>          sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, iommu);
>          return spapr_dma_dt(fdt, node_off, propname,
>                  tcet->liobn, 0, tcet->window_size);
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 661c05b..24f5b46 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -506,14 +506,30 @@ static void spapr_msi_write(void *opaque, target_phys_addr_t addr,
>  /*
>   * PHB PCI device
>   */
> -static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
> -                                            int devfn)
> +static MemoryRegion *spapr_pci_dma_iommu_new(PCIBus *bus, void *opaque, int devfn)
>  {
>      sPAPRPHBState *phb = opaque;
>  
> -    return phb->dma;
> +    return &phb->iommu;
>  }
>  
> +static void spapr_pci_dma_iommu_destroy(MemoryRegion *iommu)
> +{
> +    /* iommu is shared among devices, do nothing */
> +}
> +
> +static IOMMUTLBEntry spapr_phb_translate(MemoryRegion *iommu, target_phys_addr_t addr,
> +                                         bool is_write)
> +{
> +    sPAPRPHBState *phb = container_of(iommu, sPAPRPHBState, iommu);
> +
> +    return spapr_tce_translate(phb->dma, addr, is_write);
> +}

IIUC, sPAPRTCETable could drop the DMAContext field and just include the
actual translation info.  spapr_tce_new_dma_context becomes
spapr_tce_new and returns an opaque sPAPRTCETable struct that is passed
back to spapr_tce_translate.

And DMAContext can disappear and will be replaced with just an
AddressSpace *.

I like it!

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:45     ` Avi Kivity
@ 2012-10-11 13:54       ` Paolo Bonzini
  2012-10-11 13:57         ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Paolo Bonzini @ 2012-10-11 13:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, David Gibson

Il 11/10/2012 15:45, Avi Kivity ha scritto:
>>> >> +struct MemoryRegionIOMMUOps {
>>> >> +    /* Returns a TLB entry that contains a given address. */
>>> >> +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
>>> >> +                               bool is_write);
>>> >> +};
>> > 
>> > Do map/unmap still make sense in this model?  Ben & David, what were
>> > your plans there?
>> > 
> Map/unmap is supported via address_space_map(), which calls
> ->translate().  I don't see how a lower-level map/unmap helps, unless
> the hardware supplies such a function.

Yep, it's just the map/unmap callbacks that are not supported anymore,
but nobody uses that feature of DMAContext yet.

Paolo

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API
  2012-10-11 13:53   ` Paolo Bonzini
@ 2012-10-11 13:56     ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Alexander Graf, Michael S. Tsirkin, liu ping fan, qemu-devel,
	Blue Swirl, Alex Williamson, Anthony Liguori

On 10/11/2012 03:53 PM, Paolo Bonzini wrote:
> Il 11/10/2012 15:27, Avi Kivity ha scritto:
>> -static int spapr_tce_translate(DMAContext *dma,
>> -                               dma_addr_t addr,
>> -                               target_phys_addr_t *paddr,
>> -                               target_phys_addr_t *len,
>> -                               DMADirection dir)
>> +IOMMUTLBEntry spapr_tce_translate(DMAContext *dma, target_phys_addr_t addr,
>> +                                  bool is_write)
>>  {
>>      sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
>> -    enum sPAPRTCEAccess access = (dir == DMA_DIRECTION_FROM_DEVICE)
>> -        ? SPAPR_TCE_WO : SPAPR_TCE_RO;
>> +    enum sPAPRTCEAccess access = is_write ? SPAPR_TCE_WO : SPAPR_TCE_RO;
>>      uint64_t tce;
>>  
>>  #ifdef DEBUG_TCE
>> @@ -84,29 +80,27 @@ static int spapr_tce_translate(DMAContext *dma,
>>  #ifdef DEBUG_TCE
>>          fprintf(stderr, "spapr_tce_translate out of bounds\n");
>>  #endif
>> -        return -EFAULT;
>> +        return (IOMMUTLBEntry) { .valid = false };
>>      }
>>  
>>      tce = tcet->table[addr >> SPAPR_TCE_PAGE_SHIFT].tce;
>>  
>>      /* Check TCE */
>>      if (!(tce & access)) {
>> -        return -EPERM;
>> +        return (IOMMUTLBEntry) { .valid = false };
>>      }
>>  
>> -    /* How much til end of page ? */
>> -    *len = ((~addr) & SPAPR_TCE_PAGE_MASK) + 1;
>> -
>> -    /* Translate */
>> -    *paddr = (tce & ~SPAPR_TCE_PAGE_MASK) |
>> -        (addr & SPAPR_TCE_PAGE_MASK);
>> -
>>  #ifdef DEBUG_TCE
>>      fprintf(stderr, " ->  *paddr=0x" TARGET_FMT_plx ", *len=0x"
>> -            TARGET_FMT_plx "\n", *paddr, *len);
>> +            TARGET_FMT_plx "\n", (tce & ~SPAPR_TCE_PAGE_MASK), SPAPR_TCE_PAGE_MASK + 1);
>>  #endif
>>  
>> -    return 0;
>> +    return (IOMMUTLBEntry) {
>> +        .device_addr = addr & SPAPR_TCE_PAGE_MASK,
>> +        .translated_addr = (tce & ~SPAPR_TCE_PAGE_MASK),
>> +        .addr_mask = SPAPR_TCE_PAGE_MASK,
>> +        .valid = true,
>> +    };
>>  }
>>  
>>  DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
>> @@ -118,7 +112,7 @@ DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
>>      }
>>  
>>      tcet = g_malloc0(sizeof(*tcet));
>> -    dma_context_init(&tcet->dma, &address_space_memory, spapr_tce_translate, NULL, NULL);
>> +    dma_context_init(&tcet->dma, &address_space_memory, NULL, NULL, NULL);
>>  
>>      tcet->liobn = liobn;
>>      tcet->window_size = window_size;
>> @@ -253,7 +247,8 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>>          return 0;
>>      }
>>  
>> -    if (iommu->translate == spapr_tce_translate) {
>> +    /* FIXME: WHAT?? */
>> +    if (true) {
>>          sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, iommu);
>>          return spapr_dma_dt(fdt, node_off, propname,
>>                  tcet->liobn, 0, tcet->window_size);
>> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
>> index 661c05b..24f5b46 100644
>> --- a/hw/spapr_pci.c
>> +++ b/hw/spapr_pci.c
>> @@ -506,14 +506,30 @@ static void spapr_msi_write(void *opaque, target_phys_addr_t addr,
>>  /*
>>   * PHB PCI device
>>   */
>> -static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
>> -                                            int devfn)
>> +static MemoryRegion *spapr_pci_dma_iommu_new(PCIBus *bus, void *opaque, int devfn)
>>  {
>>      sPAPRPHBState *phb = opaque;
>>  
>> -    return phb->dma;
>> +    return &phb->iommu;
>>  }
>>  
>> +static void spapr_pci_dma_iommu_destroy(MemoryRegion *iommu)
>> +{
>> +    /* iommu is shared among devices, do nothing */
>> +}
>> +
>> +static IOMMUTLBEntry spapr_phb_translate(MemoryRegion *iommu, target_phys_addr_t addr,
>> +                                         bool is_write)
>> +{
>> +    sPAPRPHBState *phb = container_of(iommu, sPAPRPHBState, iommu);
>> +
>> +    return spapr_tce_translate(phb->dma, addr, is_write);
>> +}
> 
> IIUC, sPAPRTCETable could drop the DMAContext field and just include the
> actual translation info.  spapr_tce_new_dma_context becomes
> spapr_tce_new and returns an opaque sPAPRTCETable struct that is passed
> back to spapr_tce_translate.

Maybe, I just typed in code until it compiled, then forgot to copy Alex.
 I have no idea how to test it, so I kept the changes minimal.

> 
> And DMAContext can disappear and will be replaced with just an
> AddressSpace *.

Indeed that's the plan.  address_space_map() and dma_memory_map() simply
duplicate each other.  The only thing remaining is dma_memory_set().

> I like it!

Me too, except when I'm coding it.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:54       ` Paolo Bonzini
@ 2012-10-11 13:57         ` Avi Kivity
  2012-10-12  2:51           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 13:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, David Gibson

On 10/11/2012 03:54 PM, Paolo Bonzini wrote:
> Il 11/10/2012 15:45, Avi Kivity ha scritto:
>>>> >> +struct MemoryRegionIOMMUOps {
>>>> >> +    /* Returns a TLB entry that contains a given address. */
>>>> >> +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
>>>> >> +                               bool is_write);
>>>> >> +};
>>> > 
>>> > Do map/unmap still make sense in this model?  Ben & David, what were
>>> > your plans there?
>>> > 
>> Map/unmap is supported via address_space_map(), which calls
>> ->translate().  I don't see how a lower-level map/unmap helps, unless
>> the hardware supplies such a function.
> 
> Yep, it's just the map/unmap callbacks that are not supported anymore,
> but nobody uses that feature of DMAContext yet.

What do those callbacks it even mean?


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:26 ` [Qemu-devel] [RFC v1 3/7] memory: iommu support Avi Kivity
  2012-10-11 13:42   ` Paolo Bonzini
@ 2012-10-11 14:29   ` Avi Kivity
  1 sibling, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 14:29 UTC (permalink / raw)
  To: qemu-devel, Blue Swirl, Anthony Liguori, Michael S. Tsirkin,
	Alex Williamson, liu ping fan, Paolo Bonzini

On 10/11/2012 03:26 PM, Avi Kivity wrote:
> Add a new memory region type that translates addresses it is given,
> then forwards them to a target address space.  This is similar to
> an alias, except that the mapping is more flexible than a linear
> translation and trucation, and also less efficient since the
> translation happens at runtime.
> 
> The implementation uses an AddressSpace mapping the target region to
> avoid hierarchical dispatch all the way to the resolved region; only
> iommu regions are looked up dynamically.
> 
>  
>  typedef struct {
>      void *buffer;
> +    AddressSpace *as;
>      target_phys_addr_t addr;
>      target_phys_addr_t len;
>  } BounceBuffer;
> @@ -3563,23 +3564,42 @@ void *address_space_map(AddressSpace *as,
>      ram_addr_t raddr = RAM_ADDR_MAX;
>      ram_addr_t rlen;
>      void *ret;
> +    IOMMUTLBEntry iotlb;
> +    target_phys_addr_t xlat;
> +    AddressSpace *as_xlat;
>  
>      while (len > 0) {
> +        xlat = addr;
> +        as_xlat = as;
>          page = addr & TARGET_PAGE_MASK;
>          l = (page + TARGET_PAGE_SIZE) - addr;
>          if (l > len)
>              l = len;
>          section = phys_page_find(d, page >> TARGET_PAGE_BITS);
>  
> +        while (section->mr->iommu_ops) {
> +            iotlb = section->mr->iommu_ops->translate(section->mr, addr, is_write);

should be using xlat here, or this fails the second time around.

> +            if (iotlb.valid) {
> +                xlat = ((iotlb.translated_addr & ~iotlb.addr_mask)
> +                        | (addr & iotlb.addr_mask));
> +                as_xlat = section->mr->iommu_target_as;
> +                l = (MIN(xlat + l - 1, xlat | iotlb.addr_mask) - xlat) + 1;
> +                section = phys_page_find(as_xlat->dispatch, xlat >> TARGET_PAGE_BITS);
> +            } else {
> +                section = &phys_sections[phys_section_unassigned];
> +            }
> +        }
> +
>          if (!(memory_region_is_ram(section->mr) && !section->readonly)) {
>              if (todo || bounce.buffer) {
>                  break;
>              }
>              bounce.buffer = qemu_memalign(TARGET_PAGE_SIZE, TARGET_PAGE_SIZE);
> -            bounce.addr = addr;
> +            bounce.addr = xlat;
> +            bounce.as = as_xlat;
>              bounce.len = l;
>              if (!is_write) {
> -                address_space_read(as, addr, bounce.buffer, l);
> +                address_space_read(as_xlat, xlat, bounce.buffer, l);
>              }
>  
>              *plen = l;
> @@ -3587,7 +3607,7 @@ void *address_space_map(AddressSpace *as,
>          }
>          if (!todo) {
>              raddr = memory_region_get_ram_addr(section->mr)
> -                + memory_region_section_addr(section, addr);
> +                + memory_region_section_addr(section, xlat);
>          }
>  
>          len -= l;
> @@ -3632,7 +3652,7 @@ void address_space_unmap(AddressSpace *as, void *buffer, target_phys_addr_t len,
>          return;
>      }
>      if (is_write) {
> -        address_space_write(as, bounce.addr, bounce.buffer, access_len);
> +        address_space_write(bounce.as, bounce.addr, bounce.buffer, access_len);
>      }
>      qemu_vfree(bounce.buffer);
>      bounce.buffer = NULL;
> diff --git a/memory.c b/memory.c
> index 5df6177..1d92bb8 100644
> --- a/memory.c
> +++ b/memory.c
> @@ -775,6 +775,12 @@ static void memory_region_destructor_rom_device(MemoryRegion *mr)
>      qemu_ram_free(mr->ram_addr & TARGET_PAGE_MASK);
>  }
>  
> +static void memory_region_destructor_iommu(MemoryRegion *mr)
> +{
> +    address_space_destroy(mr->iommu_target_as);
> +    g_free(mr->iommu_target_as);
> +}
> +
>  static bool memory_region_wrong_endianness(MemoryRegion *mr)
>  {
>  #ifdef TARGET_WORDS_BIGENDIAN
> @@ -789,6 +795,7 @@ void memory_region_init(MemoryRegion *mr,
>                          uint64_t size)
>  {
>      mr->ops = NULL;
> +    mr->iommu_ops = NULL;
>      mr->parent = NULL;
>      mr->size = int128_make64(size);
>      if (size == UINT64_MAX) {
> @@ -980,6 +987,101 @@ void memory_region_init_rom_device(MemoryRegion *mr,
>      mr->ram_addr = qemu_ram_alloc(size, mr);
>  }
>  
> +static void memory_region_iommu_rw(MemoryRegion *iommu, target_phys_addr_t addr,
> +                                   uint8_t *buf, unsigned len, bool is_write)
> +{
> +    IOMMUTLBEntry tlb;
> +    unsigned clen;
> +    target_phys_addr_t xlat;
> +
> +    while (len) {
> +        tlb = iommu->iommu_ops->translate(iommu, addr, is_write);
> +        clen = (MIN(addr | tlb.addr_mask, addr + len - 1) - addr) + 1;
> +        if (tlb.valid) {
> +            xlat = (tlb.translated_addr & ~tlb.addr_mask) | (addr & tlb.addr_mask);
> +            address_space_rw(iommu->iommu_target_as, xlat, buf, clen, is_write);
> +        } else {
> +            if (!is_write) {
> +                memset(buf, 0xff, clen);
> +            }
> +        }
> +        buf += clen;
> +        addr += clen;
> +        len -= clen;
> +    }
> +}
> +
> +static uint64_t memory_region_iommu_read(void *opaque, target_phys_addr_t addr,
> +                                         unsigned size)
> +{
> +    MemoryRegion *iommu = opaque;
> +    union {
> +        uint8_t buf[8];
> +        uint8_t u8;
> +        uint16_t u16;
> +        uint32_t u32;
> +        uint64_t u64;
> +    } ret;
> +
> +    memory_region_iommu_rw(iommu, addr, ret.buf, size, false);
> +    switch (size) {
> +    case 1: return ret.u8;
> +    case 2: return ret.u16;
> +    case 4: return ret.u32;
> +    case 8: return ret.u64;
> +    default: abort();
> +    }
> +}
> +
> +static void memory_region_iommu_write(void *opaque, target_phys_addr_t addr,
> +                                      uint64_t data, unsigned size)
> +{
> +    MemoryRegion *iommu = opaque;
> +    union {
> +        uint8_t buf[8];
> +        uint8_t u8;
> +        uint16_t u16;
> +        uint32_t u32;
> +        uint64_t u64;
> +    } in;
> +
> +    switch (size) {
> +    case 1: in.u8 = data; break;
> +    case 2: in.u16 = data; break;
> +    case 4: in.u32 = data; break;
> +    case 8: in.u64 = data; break;
> +    default: abort();
> +    }
> +    memory_region_iommu_rw(iommu, addr, in.buf, size, true);
> +}
> +
> +static MemoryRegionOps memory_region_iommu_ops = {
> +    .read = memory_region_iommu_read,
> +    .write = memory_region_iommu_write,
> +#ifdef HOST_BIGENDIAN
> +    .endianness = DEVICE_BIG_ENDIAN,
> +#else
> +    .endianness = DEVICE_LITTLE_ENDIAN,
> +#endif
> +};
> +
> +void memory_region_init_iommu(MemoryRegion *mr,
> +                              MemoryRegionIOMMUOps *ops,
> +                              MemoryRegion *target,
> +                              const char *name,
> +                              uint64_t size)
> +{
> +    memory_region_init(mr, name, size);
> +    mr->ops = &memory_region_iommu_ops;
> +    mr->iommu_ops = ops,
> +    mr->opaque = mr;
> +    mr->terminates = true;  /* then re-forwards */
> +    mr->destructor = memory_region_destructor_iommu;
> +    mr->iommu_target = target;
> +    mr->iommu_target_as = g_new(AddressSpace, 1);
> +    address_space_init(mr->iommu_target_as, target);
> +}
> +
>  static uint64_t invalid_read(void *opaque, target_phys_addr_t addr,
>                               unsigned size)
>  {
> @@ -1053,6 +1155,11 @@ bool memory_region_is_rom(MemoryRegion *mr)
>      return mr->ram && mr->readonly;
>  }
>  
> +bool memory_region_is_iommu(MemoryRegion *mr)
> +{
> +    return mr->iommu_ops;
> +}
> +
>  void memory_region_set_log(MemoryRegion *mr, bool log, unsigned client)
>  {
>      uint8_t mask = 1 << client;
> diff --git a/memory.h b/memory.h
> index 79393f1..299d584 100644
> --- a/memory.h
> +++ b/memory.h
> @@ -113,12 +113,29 @@ struct MemoryRegionOps {
>      const MemoryRegionMmio old_mmio;
>  };
>  
> +typedef struct IOMMUTLBEntry IOMMUTLBEntry;
> +typedef struct MemoryRegionIOMMUOps MemoryRegionIOMMUOps;
> +
> +struct IOMMUTLBEntry {
> +    target_phys_addr_t device_addr;
> +    target_phys_addr_t translated_addr;
> +    target_phys_addr_t addr_mask;  /* 0xfff = 4k translation */
> +    bool valid;
> +};
> +
> +struct MemoryRegionIOMMUOps {
> +    /* Returns a TLB entry that contains a given address. */
> +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
> +                               bool is_write);
> +};
> +
>  typedef struct CoalescedMemoryRange CoalescedMemoryRange;
>  typedef struct MemoryRegionIoeventfd MemoryRegionIoeventfd;
>  
>  struct MemoryRegion {
>      /* All fields are private - violators will be prosecuted */
>      const MemoryRegionOps *ops;
> +    const MemoryRegionIOMMUOps *iommu_ops;
>      void *opaque;
>      MemoryRegion *parent;
>      Int128 size;
> @@ -145,6 +162,8 @@ struct MemoryRegion {
>      uint8_t dirty_log_mask;
>      unsigned ioeventfd_nb;
>      MemoryRegionIoeventfd *ioeventfds;
> +    MemoryRegion *iommu_target;
> +    struct AddressSpace *iommu_target_as;
>  };
>  
>  struct MemoryRegionPortio {
> @@ -334,6 +353,24 @@ void memory_region_init_rom_device(MemoryRegion *mr,
>  void memory_region_init_reservation(MemoryRegion *mr,
>                                      const char *name,
>                                      uint64_t size);
> +
> +/**
> + * memory_region_init_iommu: Initialize a memory region that translates addresses
> + *
> + * An IOMMU region translates addresses and forwards accesses to a target memory region.
> + *
> + * @mr: the #MemoryRegion to be initialized
> + * @ops: a function that translates addresses into the @target region
> + * @target: a #MemoryRegion that will be used to satisfy accesses to translated addresses
> + * @name: used for debugging; not visible to the user or ABI
> + * @size: size of the region.
> + */
> +void memory_region_init_iommu(MemoryRegion *mr,
> +                              MemoryRegionIOMMUOps *ops,
> +                              MemoryRegion *target,
> +                              const char *name,
> +                              uint64_t size);
> +
>  /**
>   * memory_region_destroy: Destroy a memory region and reclaim all resources.
>   *
> @@ -373,6 +410,15 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>  }
>  
>  /**
> + * memory_region_is_ram: check whether a memory region is an iommu
> + *
> + * Returns %true is a memory region is an iommu.
> + *
> + * @mr: the memory region being queried
> + */
> +bool memory_region_is_iommu(MemoryRegion *mr);
> +
> +/**
>   * memory_region_name: get a memory region's name
>   *
>   * Returns the string that was used to initialize the memory region.
> 


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 13:44         ` Avi Kivity
@ 2012-10-11 14:35           ` Michael S. Tsirkin
  2012-10-11 14:35             ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-10-11 14:35 UTC (permalink / raw)
  To: Avi Kivity
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 03:44:10PM +0200, Avi Kivity wrote:
> On 10/11/2012 03:44 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 11, 2012 at 03:34:54PM +0200, Avi Kivity wrote:
> >> On 10/11/2012 03:31 PM, Michael S. Tsirkin wrote:
> >> > On Thu, Oct 11, 2012 at 03:27:03PM +0200, Avi Kivity wrote:
> >> >> vhost doesn't support guest iommus yet, indicate it to the user
> >> >> by gently depositing a core on their disk.
> >> >> 
> >> >> Signed-off-by: Avi Kivity <avi@redhat.com>
> >> > 
> >> > Actually there is no problem. virtio bypasses an IOMMU,
> >> > so vhost works fine by writing into guest memory directly.
> >> > 
> >> > So I don't think we need this patch.
> >> 
> >> The pci subsystem should set up the iommu so that it ignores virtio
> >> devices.  If it does, an emulated iommu will not reach vhost.  If it
> >> doesn't, then it will, and the assert() will alert us that we have a bug.
> > 
> > You mean pci subsystem in the guest? I'm pretty sure that's not
> > the case at the moment: iommu is on by default and applies
> > to all devices unless you do something special.
> > I see where you are coming from but it does
> > not look right to break all existing guests.
> 
> No, qemu should configure virtio devices to bypass the iommu, even if it
> is on.

Okay so there will be some API that virtio devices should call
to achieve this?

> > Also - I see no reason to single out vhost - I think same applies with
> > any virtio device, since it doesn't use the DMA API.
> 
> True.
> 
> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 14:35           ` Michael S. Tsirkin
@ 2012-10-11 14:35             ` Avi Kivity
  2012-10-11 15:34               ` Michael S. Tsirkin
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 14:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:

>> No, qemu should configure virtio devices to bypass the iommu, even if it
>> is on.
> 
> Okay so there will be some API that virtio devices should call
> to achieve this?

The iommu should probably call pci_device_bypasses_iommu() to check for
such devices.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 14:35             ` Avi Kivity
@ 2012-10-11 15:34               ` Michael S. Tsirkin
  2012-10-11 15:48                 ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Michael S. Tsirkin @ 2012-10-11 15:34 UTC (permalink / raw)
  To: Avi Kivity
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
> 
> >> No, qemu should configure virtio devices to bypass the iommu, even if it
> >> is on.
> > 
> > Okay so there will be some API that virtio devices should call
> > to achieve this?
> 
> The iommu should probably call pci_device_bypasses_iommu() to check for
> such devices.

So maybe this patch should depend on the introduction of such
an API.

> -- 
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 15:34               ` Michael S. Tsirkin
@ 2012-10-11 15:48                 ` Avi Kivity
  2012-10-11 19:38                   ` Alex Williamson
  2012-10-15  8:44                   ` liu ping fan
  0 siblings, 2 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-11 15:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, liu ping fan, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>> 
>> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>> >> is on.
>> > 
>> > Okay so there will be some API that virtio devices should call
>> > to achieve this?
>> 
>> The iommu should probably call pci_device_bypasses_iommu() to check for
>> such devices.
> 
> So maybe this patch should depend on the introduction of such
> an API.

I've dropped it for now.

In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
and the memory listener watches address_space_memory, no iommu there.
vfio needs to change to listen to pci_dev->bus_master_as, and need
special handling for iommu regions (abort for now, type 2 iommu later).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 15:48                 ` Avi Kivity
@ 2012-10-11 19:38                   ` Alex Williamson
  2012-10-15 10:24                     ` Avi Kivity
  2012-10-15  8:44                   ` liu ping fan
  1 sibling, 1 reply; 41+ messages in thread
From: Alex Williamson @ 2012-10-11 19:38 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Anthony Liguori, Paolo Bonzini

On Thu, 2012-10-11 at 17:48 +0200, Avi Kivity wrote:
> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
> >> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
> >> 
> >> >> No, qemu should configure virtio devices to bypass the iommu, even if it
> >> >> is on.
> >> > 
> >> > Okay so there will be some API that virtio devices should call
> >> > to achieve this?
> >> 
> >> The iommu should probably call pci_device_bypasses_iommu() to check for
> >> such devices.
> > 
> > So maybe this patch should depend on the introduction of such
> > an API.
> 
> I've dropped it for now.
> 
> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
> and the memory listener watches address_space_memory, no iommu there.
> vfio needs to change to listen to pci_dev->bus_master_as, and need
> special handling for iommu regions (abort for now, type 2 iommu later).

I don't see how we can ever support an assigned device with the
translate function.  Don't we want a flat address space at run time
anyway?  IOMMU drivers go to pains to make IOTLB updates efficient and
drivers optimize for long running translations, but here we impose a
penalty on every access.  I think we'd be more efficient and better able
to support assigned devices if the per device/bus address space was
updated and flattened when it changes.  Being able to implement an XOR
IOMMU is impressive, but is it practical?  We could be doing much more
practical things like nested device assignment with a flatten
translation ;)  Thanks,

Alex

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 0/7] IOMMU support
  2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
                   ` (6 preceding siblings ...)
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 7/7] vhost: " Avi Kivity
@ 2012-10-12  2:36 ` Benjamin Herrenschmidt
  2012-10-15 10:45   ` Avi Kivity
  7 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2012-10-12  2:36 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Thu, 2012-10-11 at 15:26 +0200, Avi Kivity wrote:
> These patches add IOMMU support to the memory core. IOMMUs can be added anywhere in
> the memory hierarchy, and may be arranged in series.

I haven't had a chance to review in details yet, but one thing I noticed
is that you basically have a single read/write protection information
for a translation.

This is a loss of functionality to some extent (well, maybe not from the
existing iommu layer but from what could be done by our HW) in that we
have separate read and write permission bits.

This is actually worth fixing I think. Catching incorrect reads from
write-only regions is probably worth it in term of debugging drivers.

Cheers,
Ben.

> Avi Kivity (7):
>   memory: fix address space initialization/destruction
>   memory: limit sections in the radix tree to the actual address space
>     size
>   memory: iommu support
>   pci: switch iommu to using the memory API
>   i440fx: add an iommu
>   vfio: abort if an emulated iommu is used
>   vhost: abort if an emulated iommu is used
> 
>  exec.c             |  43 ++++++++++++++++++---
>  hw/pci.c           |  59 +++++++++++++++++-----------
>  hw/pci.h           |   7 +++-
>  hw/pci_internals.h |   5 ++-
>  hw/piix_pci.c      |  74 +++++++++++++++++++++++++++++++++++
>  hw/spapr.h         |   2 +
>  hw/spapr_iommu.c   |  35 ++++++++---------
>  hw/spapr_pci.c     |  26 +++++++++++--
>  hw/spapr_pci.h     |   1 +
>  hw/vfio_pci.c      |   2 +
>  hw/vhost.c         |   2 +
>  memory.c           | 110 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  memory.h           |  46 ++++++++++++++++++++++
>  13 files changed, 356 insertions(+), 56 deletions(-)
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:42   ` Paolo Bonzini
  2012-10-11 13:45     ` Avi Kivity
@ 2012-10-12  2:45     ` Benjamin Herrenschmidt
  2012-10-13  9:30       ` Blue Swirl
  1 sibling, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2012-10-12  2:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Michael S. Tsirkin, qemu-devel, liu ping fan, Blue Swirl,
	Alex Williamson, Avi Kivity, Anthony Liguori, David Gibson

On Thu, 2012-10-11 at 15:42 +0200, Paolo Bonzini wrote:
> Il 11/10/2012 15:26, Avi Kivity ha scritto:
> > +struct MemoryRegionIOMMUOps {
> > +    /* Returns a TLB entry that contains a given address. */
> > +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
> > +                               bool is_write);
> > +};
> 
> Do map/unmap still make sense in this model?  Ben & David, what were
> your plans there?

To keep it under the rug for as long as we could ? :-)

The problem with map and unmap is invalidations. How do you convey to
the devices having done a map that the guest has invalidated a
translation entry.

Especially nasty on PAPR where the invalidation is a hcall which is
assumed to be synchronous.

We have simply not solved the problem for now. The risk due to the
possible access beyond the end of life of a translation is negligible as
long as we are not playing funny mapping tricks with emulated devices
(which we might do with some in the future... but not today) and the
scope of the problem is limited to the guest corrupting itself.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-11 13:57         ` Avi Kivity
@ 2012-10-12  2:51           ` Benjamin Herrenschmidt
  2012-10-15 16:54             ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2012-10-12  2:51 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, qemu-devel, liu ping fan, Blue Swirl,
	Alex Williamson, Anthony Liguori, Paolo Bonzini, David Gibson

On Thu, 2012-10-11 at 15:57 +0200, Avi Kivity wrote:
> >> Map/unmap is supported via address_space_map(), which calls
> >> ->translate().  I don't see how a lower-level map/unmap helps,
> unless
> >> the hardware supplies such a function.
> > 
> > Yep, it's just the map/unmap callbacks that are not supported
> anymore,
> > but nobody uses that feature of DMAContext yet.
> 
> What do those callbacks it even mean? 

Well, the unmap callback was meant for notifying the device that did a
map() that the iommu has invalidated part of that mapping.

The rough idea was that the actual invalidations would be delayed until
all "previous" maps have gone away, which works fine without callbacks
for transcient maps (packet buffers ,etc...) but doesn't for long lived
ones.

So in addition, we would call that callback for devices who own long
lived maps, asking them to dispose of them (and eventually re-try them,
which might or might not fail depending on why the invalidation occurred
in the first place).

The invalidation would still be delayed until the last old map has gone
away, so it's not a synchronous callback, more like a notification to
the device to wakeup & do something.

But in the latest patches that went in, because the whole scheme was too
complex and not really that useful, I ripped out the whole map tracking
etc... I kept the unmap callback API there in case we want to re-do it
more sanely.

When emulating HW iommu's the "invalidation not complete" is easy to
report asynchronously to the guest via a status bit that the guest is
supposdly polling after doing an invalidation request.

On something like synchronous hcalls (PAPR), the idea was to delay the
hcall completion by suspending the cpu who issued it.

A lot of pain for what is essentially a corner case that doesn't happen
in practice... unless we start doing mapping games.

By mapping games, I mean having an emulated device MMIO space being
mapped into user space in a way where the kernel might change the
mapping "live" (for example to point to backup memory as it migrates
thing away, etc...). This kind of stuff typically happens with graphics
where graphic objects can move between memory and vram.

Cheers,
Ben

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API
  2012-10-11 13:27 ` [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API Avi Kivity
  2012-10-11 13:53   ` Paolo Bonzini
@ 2012-10-13  9:13   ` Blue Swirl
  2012-10-15 10:31     ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: Blue Swirl @ 2012-10-13  9:13 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 1:27 PM, Avi Kivity <avi@redhat.com> wrote:
> Instead of requesting a DMAContext from the bus implementation, use a
> MemoryRegion.  This can be initialized using memory_region_init_iommu()
> (or memory_region_init_alias() for simple, static translations).
>
> Add a destructor, since setups that have per-device translations will
> need to return a different iommu region for every device.
>
> Signed-off-by: Avi Kivity <avi@redhat.com>
> ---
>  hw/pci.c           | 59 +++++++++++++++++++++++++++++++++---------------------
>  hw/pci.h           |  7 +++++--
>  hw/pci_internals.h |  5 +++--
>  hw/spapr.h         |  2 ++
>  hw/spapr_iommu.c   | 35 ++++++++++++++------------------
>  hw/spapr_pci.c     | 26 ++++++++++++++++++++----
>  hw/spapr_pci.h     |  1 +
>  7 files changed, 84 insertions(+), 51 deletions(-)
>
> diff --git a/hw/pci.c b/hw/pci.c
> index 7adf61b..02e1989 100644
> --- a/hw/pci.c
> +++ b/hw/pci.c
> @@ -274,6 +274,21 @@ int pci_find_domain(const PCIBus *bus)
>      return -1;
>  }
>
> +static MemoryRegion *pci_default_iommu(PCIBus *bus, void *opaque, int devfn)
> +{
> +    MemoryRegion *mr = g_new(MemoryRegion, 1);
> +
> +    /* FIXME: inherit memory region from bus creator */
> +    memory_region_init_alias(mr, "iommu-nop", get_system_memory(), 0, INT64_MAX);
> +    return mr;
> +}
> +
> +static void pci_default_iommu_dtor(MemoryRegion *mr)
> +{
> +    memory_region_destroy(mr);
> +    g_free(mr);
> +}
> +
>  void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
>                           const char *name,
>                           MemoryRegion *address_space_mem,
> @@ -285,6 +300,7 @@ void pci_bus_new_inplace(PCIBus *bus, DeviceState *parent,
>      bus->devfn_min = devfn_min;
>      bus->address_space_mem = address_space_mem;
>      bus->address_space_io = address_space_io;
> +    pci_setup_iommu(bus, pci_default_iommu, pci_default_iommu_dtor, NULL);
>
>      /* host bridge */
>      QLIST_INIT(&bus->child);
> @@ -775,21 +791,16 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev, PCIBus *bus,
>                       PCI_SLOT(devfn), PCI_FUNC(devfn), name, bus->devices[devfn]->name);
>          return NULL;
>      }
> +
>      pci_dev->bus = bus;
> -    if (bus->dma_context_fn) {
> -        pci_dev->dma = bus->dma_context_fn(bus, bus->dma_context_opaque, devfn);
> -    } else {
> -        /* FIXME: Make dma_context_fn use MemoryRegions instead, so this path is
> -         * taken unconditionally */
> -        /* FIXME: inherit memory region from bus creator */
> -        memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
> -                                 get_system_memory(), 0,
> -                                 memory_region_size(get_system_memory()));
> -        memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
> -        address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region);
> -        pci_dev->dma = g_new(DMAContext, 1);
> -        dma_context_init(pci_dev->dma, &pci_dev->bus_master_as, NULL, NULL, NULL);
> -    }
> +    pci_dev->iommu = bus->iommu_fn(bus, bus->iommu_opaque, devfn);
> +    memory_region_init_alias(&pci_dev->bus_master_enable_region, "bus master",
> +                             pci_dev->iommu, 0, memory_region_size(pci_dev->iommu));
> +    memory_region_set_enabled(&pci_dev->bus_master_enable_region, false);
> +    address_space_init(&pci_dev->bus_master_as, &pci_dev->bus_master_enable_region);
> +    pci_dev->dma = g_new(DMAContext, 1);
> +    dma_context_init(pci_dev->dma, &pci_dev->bus_master_as, NULL, NULL, NULL);
> +
>      pci_dev->devfn = devfn;
>      pstrcpy(pci_dev->name, sizeof(pci_dev->name), name);
>      pci_dev->irq_state = 0;
> @@ -843,12 +854,12 @@ static void do_pci_unregister_device(PCIDevice *pci_dev)
>      pci_dev->bus->devices[pci_dev->devfn] = NULL;
>      pci_config_free(pci_dev);
>
> -    if (!pci_dev->bus->dma_context_fn) {
> -        address_space_destroy(&pci_dev->bus_master_as);
> -        memory_region_destroy(&pci_dev->bus_master_enable_region);
> -        g_free(pci_dev->dma);
> -        pci_dev->dma = NULL;
> -    }
> +    address_space_destroy(&pci_dev->bus_master_as);
> +    memory_region_del_subregion(&pci_dev->bus_master_enable_region, pci_dev->iommu);
> +    pci_dev->bus->iommu_dtor_fn(pci_dev->iommu);
> +    memory_region_destroy(&pci_dev->bus_master_enable_region);
> +    g_free(pci_dev->dma);
> +    pci_dev->dma = NULL;
>  }
>
>  static void pci_unregister_io_regions(PCIDevice *pci_dev)
> @@ -2092,10 +2103,12 @@ static void pci_device_class_init(ObjectClass *klass, void *data)
>      k->props = pci_props;
>  }
>
> -void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque)
> +void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
> +                     void *opaque)
>  {
> -    bus->dma_context_fn = fn;
> -    bus->dma_context_opaque = opaque;
> +    bus->iommu_fn = fn;
> +    bus->iommu_dtor_fn = dtor;
> +    bus->iommu_opaque = opaque;
>  }
>
>  static TypeInfo pci_device_type_info = {
> diff --git a/hw/pci.h b/hw/pci.h
> index a65e490..370354a 100644
> --- a/hw/pci.h
> +++ b/hw/pci.h
> @@ -213,6 +213,7 @@ struct PCIDevice {
>      PCIIORegion io_regions[PCI_NUM_REGIONS];
>      AddressSpace bus_master_as;
>      MemoryRegion bus_master_enable_region;
> +    MemoryRegion *iommu;
>      DMAContext *dma;
>
>      /* do not access the following fields */
> @@ -351,9 +352,11 @@ int pci_read_devaddr(Monitor *mon, const char *addr, int *domp, int *busp,
>
>  void pci_device_deassert_intx(PCIDevice *dev);
>
> -typedef DMAContext *(*PCIDMAContextFunc)(PCIBus *, void *, int);
> +typedef MemoryRegion *(*PCIIOMMUFunc)(PCIBus *, void *, int);
> +typedef void (*PCIIOMMUDestructorFunc)(MemoryRegion *mr);
>
> -void pci_setup_iommu(PCIBus *bus, PCIDMAContextFunc fn, void *opaque);
> +void pci_setup_iommu(PCIBus *bus, PCIIOMMUFunc fn, PCIIOMMUDestructorFunc dtor,
> +                     void *opaque);
>
>  static inline void
>  pci_set_byte(uint8_t *config, uint8_t val)
> diff --git a/hw/pci_internals.h b/hw/pci_internals.h
> index c931b64..040508f 100644
> --- a/hw/pci_internals.h
> +++ b/hw/pci_internals.h
> @@ -17,8 +17,9 @@
>
>  struct PCIBus {
>      BusState qbus;
> -    PCIDMAContextFunc dma_context_fn;
> -    void *dma_context_opaque;
> +    PCIIOMMUFunc iommu_fn;
> +    PCIIOMMUDestructorFunc iommu_dtor_fn;
> +    void *iommu_opaque;

Maybe the opaque could be avoided (in later patches) by clever use of
container_of() by the functions to get the parent structure?

>      uint8_t devfn_min;
>      pci_set_irq_fn set_irq;
>      pci_map_irq_fn map_irq;
> diff --git a/hw/spapr.h b/hw/spapr.h
> index ac34a17..452efba1 100644
> --- a/hw/spapr.h
> +++ b/hw/spapr.h
> @@ -334,6 +334,8 @@ typedef struct sPAPRTCE {
>  #define SPAPR_PCI_BASE_LIOBN    0x80000000
>
>  void spapr_iommu_init(void);
> +IOMMUTLBEntry spapr_tce_translate(DMAContext *dma, target_phys_addr_t addr,
> +                                  bool is_write);
>  DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size);
>  void spapr_tce_free(DMAContext *dma);
>  int spapr_dma_dt(void *fdt, int node_off, const char *propname,
> diff --git a/hw/spapr_iommu.c b/hw/spapr_iommu.c
> index 54798a3..79e35d1 100644
> --- a/hw/spapr_iommu.c
> +++ b/hw/spapr_iommu.c
> @@ -63,15 +63,11 @@ static sPAPRTCETable *spapr_tce_find_by_liobn(uint32_t liobn)
>      return NULL;
>  }
>
> -static int spapr_tce_translate(DMAContext *dma,
> -                               dma_addr_t addr,
> -                               target_phys_addr_t *paddr,
> -                               target_phys_addr_t *len,
> -                               DMADirection dir)
> +IOMMUTLBEntry spapr_tce_translate(DMAContext *dma, target_phys_addr_t addr,
> +                                  bool is_write)
>  {
>      sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, dma);
> -    enum sPAPRTCEAccess access = (dir == DMA_DIRECTION_FROM_DEVICE)
> -        ? SPAPR_TCE_WO : SPAPR_TCE_RO;
> +    enum sPAPRTCEAccess access = is_write ? SPAPR_TCE_WO : SPAPR_TCE_RO;
>      uint64_t tce;
>
>  #ifdef DEBUG_TCE
> @@ -84,29 +80,27 @@ static int spapr_tce_translate(DMAContext *dma,
>  #ifdef DEBUG_TCE
>          fprintf(stderr, "spapr_tce_translate out of bounds\n");
>  #endif
> -        return -EFAULT;
> +        return (IOMMUTLBEntry) { .valid = false };
>      }
>
>      tce = tcet->table[addr >> SPAPR_TCE_PAGE_SHIFT].tce;
>
>      /* Check TCE */
>      if (!(tce & access)) {
> -        return -EPERM;
> +        return (IOMMUTLBEntry) { .valid = false };
>      }
>
> -    /* How much til end of page ? */
> -    *len = ((~addr) & SPAPR_TCE_PAGE_MASK) + 1;
> -
> -    /* Translate */
> -    *paddr = (tce & ~SPAPR_TCE_PAGE_MASK) |
> -        (addr & SPAPR_TCE_PAGE_MASK);
> -
>  #ifdef DEBUG_TCE
>      fprintf(stderr, " ->  *paddr=0x" TARGET_FMT_plx ", *len=0x"
> -            TARGET_FMT_plx "\n", *paddr, *len);
> +            TARGET_FMT_plx "\n", (tce & ~SPAPR_TCE_PAGE_MASK), SPAPR_TCE_PAGE_MASK + 1);
>  #endif
>
> -    return 0;
> +    return (IOMMUTLBEntry) {
> +        .device_addr = addr & SPAPR_TCE_PAGE_MASK,
> +        .translated_addr = (tce & ~SPAPR_TCE_PAGE_MASK),
> +        .addr_mask = SPAPR_TCE_PAGE_MASK,
> +        .valid = true,
> +    };
>  }
>
>  DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
> @@ -118,7 +112,7 @@ DMAContext *spapr_tce_new_dma_context(uint32_t liobn, size_t window_size)
>      }
>
>      tcet = g_malloc0(sizeof(*tcet));
> -    dma_context_init(&tcet->dma, &address_space_memory, spapr_tce_translate, NULL, NULL);
> +    dma_context_init(&tcet->dma, &address_space_memory, NULL, NULL, NULL);
>
>      tcet->liobn = liobn;
>      tcet->window_size = window_size;
> @@ -253,7 +247,8 @@ int spapr_tcet_dma_dt(void *fdt, int node_off, const char *propname,
>          return 0;
>      }
>
> -    if (iommu->translate == spapr_tce_translate) {
> +    /* FIXME: WHAT?? */
> +    if (true) {
>          sPAPRTCETable *tcet = DO_UPCAST(sPAPRTCETable, dma, iommu);
>          return spapr_dma_dt(fdt, node_off, propname,
>                  tcet->liobn, 0, tcet->window_size);
> diff --git a/hw/spapr_pci.c b/hw/spapr_pci.c
> index 661c05b..24f5b46 100644
> --- a/hw/spapr_pci.c
> +++ b/hw/spapr_pci.c
> @@ -506,14 +506,30 @@ static void spapr_msi_write(void *opaque, target_phys_addr_t addr,
>  /*
>   * PHB PCI device
>   */
> -static DMAContext *spapr_pci_dma_context_fn(PCIBus *bus, void *opaque,
> -                                            int devfn)
> +static MemoryRegion *spapr_pci_dma_iommu_new(PCIBus *bus, void *opaque, int devfn)
>  {
>      sPAPRPHBState *phb = opaque;
>
> -    return phb->dma;
> +    return &phb->iommu;
>  }
>
> +static void spapr_pci_dma_iommu_destroy(MemoryRegion *iommu)
> +{
> +    /* iommu is shared among devices, do nothing */
> +}
> +
> +static IOMMUTLBEntry spapr_phb_translate(MemoryRegion *iommu, target_phys_addr_t addr,
> +                                         bool is_write)
> +{
> +    sPAPRPHBState *phb = container_of(iommu, sPAPRPHBState, iommu);
> +
> +    return spapr_tce_translate(phb->dma, addr, is_write);
> +}
> +
> +static MemoryRegionIOMMUOps spapr_iommu_ops = {
> +    .translate = spapr_phb_translate,
> +};
> +
>  static int spapr_phb_init(SysBusDevice *s)
>  {
>      sPAPRPHBState *sphb = SPAPR_PCI_HOST_BRIDGE(s);
> @@ -576,7 +592,9 @@ static int spapr_phb_init(SysBusDevice *s)
>      sphb->dma_window_start = 0;
>      sphb->dma_window_size = 0x40000000;
>      sphb->dma = spapr_tce_new_dma_context(sphb->dma_liobn, sphb->dma_window_size);
> -    pci_setup_iommu(bus, spapr_pci_dma_context_fn, sphb);
> +    memory_region_init_iommu(&sphb->iommu, &spapr_iommu_ops, get_system_memory(),
> +                             "iommu-spapr", INT64_MAX);
> +    pci_setup_iommu(bus, spapr_pci_dma_iommu_new, spapr_pci_dma_iommu_destroy, sphb);
>
>      QLIST_INSERT_HEAD(&spapr->phbs, sphb, list);
>
> diff --git a/hw/spapr_pci.h b/hw/spapr_pci.h
> index 670dc62..171db45 100644
> --- a/hw/spapr_pci.h
> +++ b/hw/spapr_pci.h
> @@ -45,6 +45,7 @@ typedef struct sPAPRPHBState {
>      target_phys_addr_t mem_win_addr, mem_win_size, io_win_addr, io_win_size;
>      target_phys_addr_t msi_win_addr;
>      MemoryRegion memwindow, iowindow, msiwindow;
> +    MemoryRegion iommu;
>
>      uint32_t dma_liobn;
>      uint64_t dma_window_start;
> --
> 1.7.12
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction
  2012-10-11 13:33     ` Avi Kivity
@ 2012-10-13  9:14       ` Blue Swirl
  0 siblings, 0 replies; 41+ messages in thread
From: Blue Swirl @ 2012-10-13  9:14 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 1:33 PM, Avi Kivity <avi@redhat.com> wrote:
> On 10/11/2012 03:31 PM, Paolo Bonzini wrote:
>> Il 11/10/2012 15:26, Avi Kivity ha scritto:
>>> A couple of fields were left uninitialized.  This was not observed earlier
>>> because all address spaces were statically allocated.  Also free allocation
>>> for those fields.
>>
>> Patch is obvious, but there are at least two alternatives: 1) using
>> memset 2) using a compound initializer.
>
> 3) std::vector<>, which also covers address_space_destroy().

4) g_malloc0()

>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-12  2:45     ` Benjamin Herrenschmidt
@ 2012-10-13  9:30       ` Blue Swirl
  2012-10-13 11:37         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Blue Swirl @ 2012-10-13  9:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, qemu-devel, liu ping fan, Alex Williamson,
	Avi Kivity, Anthony Liguori, Paolo Bonzini, David Gibson

On Fri, Oct 12, 2012 at 2:45 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Thu, 2012-10-11 at 15:42 +0200, Paolo Bonzini wrote:
>> Il 11/10/2012 15:26, Avi Kivity ha scritto:
>> > +struct MemoryRegionIOMMUOps {
>> > +    /* Returns a TLB entry that contains a given address. */
>> > +    IOMMUTLBEntry (*translate)(MemoryRegion *iommu, target_phys_addr_t addr,
>> > +                               bool is_write);
>> > +};
>>
>> Do map/unmap still make sense in this model?  Ben & David, what were
>> your plans there?
>
> To keep it under the rug for as long as we could ? :-)
>
> The problem with map and unmap is invalidations. How do you convey to
> the devices having done a map that the guest has invalidated a
> translation entry.

Also in Sparc32, IOMMU uses a table in RAM which the guest can change,
so a callback to update the translation tables should be available. On
Sparc64 there's IOTLB but also a fallback to TSB translation table in
memory. We could rely on the guest issuing demaps/flushes when the
memory changes and invalidate the translations then.

>
> Especially nasty on PAPR where the invalidation is a hcall which is
> assumed to be synchronous.
>
> We have simply not solved the problem for now. The risk due to the
> possible access beyond the end of life of a translation is negligible as
> long as we are not playing funny mapping tricks with emulated devices
> (which we might do with some in the future... but not today) and the
> scope of the problem is limited to the guest corrupting itself.
>
> Cheers,
> Ben.
>
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-13  9:30       ` Blue Swirl
@ 2012-10-13 11:37         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2012-10-13 11:37 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Michael S. Tsirkin, qemu-devel, liu ping fan, Alex Williamson,
	Avi Kivity, Anthony Liguori, Paolo Bonzini, David Gibson

On Sat, 2012-10-13 at 09:30 +0000, Blue Swirl wrote:

> > The problem with map and unmap is invalidations. How do you convey to
> > the devices having done a map that the guest has invalidated a
> > translation entry.
> 
> Also in Sparc32, IOMMU uses a table in RAM which the guest can change,
> so a callback to update the translation tables should be available. On
> Sparc64 there's IOTLB but also a fallback to TSB translation table in
> memory. We could rely on the guest issuing demaps/flushes when the
> memory changes and invalidate the translations then.

Right, the table's in memory on power too, but such tables generally
also have a cache (TLB) with some MMIO based logic to perform
invalidations.

Typically that logic involves a bit to perform a TLB kill and some
status bit to read back to get confirmation that the flush is completed.
In that case we can probably delay that later status bit until all the
maps we kept track of are gone ....

 ... but that means tracking them which is expensive.

Also the IBM iommu's are nasty here... some of them, if we ever emulate
them, actually participate in the fabric coherency protocol and thus
don't require an explicit MMIO for invalidations.

So if we were to emulate such HW we would have to intercept accesses to
the portion of RAM that is configured as an iommu table.

Thankfully we only emulate those machines as "paravirt" with a
hypervisor interface to the iommu (aka TCEs) so we are fine for now.
Also if we ever emulate the real HW, well, the latter models don't do
that anymore (but their MMIO for killing the cache doesn't have a status
bit either, the kill is that the latency of a simple read back is
enough).

Overall, a bloody can of worms... under the rug sounds like a nice place
to leave it for now :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 15:48                 ` Avi Kivity
  2012-10-11 19:38                   ` Alex Williamson
@ 2012-10-15  8:44                   ` liu ping fan
  2012-10-15 10:32                     ` Avi Kivity
  1 sibling, 1 reply; 41+ messages in thread
From: liu ping fan @ 2012-10-15  8:44 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, qemu-devel, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On Thu, Oct 11, 2012 at 11:48 PM, Avi Kivity <avi@redhat.com> wrote:
> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
>> On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>>> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>>>
>>> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>>> >> is on.
>>> >
>>> > Okay so there will be some API that virtio devices should call
>>> > to achieve this?
>>>
>>> The iommu should probably call pci_device_bypasses_iommu() to check for
>>> such devices.
>>
>> So maybe this patch should depend on the introduction of such
>> an API.
>
> I've dropped it for now.
>
> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
> and the memory listener watches address_space_memory, no iommu there.

Not quite sure your meaning.  My understanding is that as a pci
device, vhost can lie behind a iommu in topology, which result in the
transaction launched can be snapped by the emulated iommu. BUT we make
a exception for vhost-dev and enforce
address_space_rw(address_space_memory, ..) NOT
address_space_rw(pci_dev->bus_master_as,..)  for vhost device, so we
bypass the iommu.  Right?

Regards,
pingfan
> vfio needs to change to listen to pci_dev->bus_master_as, and need
> special handling for iommu regions (abort for now, type 2 iommu later).
>
> --
> error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-11 19:38                   ` Alex Williamson
@ 2012-10-15 10:24                     ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-15 10:24 UTC (permalink / raw)
  To: Alex Williamson
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Anthony Liguori, Paolo Bonzini

On 10/11/2012 09:38 PM, Alex Williamson wrote:
> On Thu, 2012-10-11 at 17:48 +0200, Avi Kivity wrote:
>> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
>> > On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>> >> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>> >> 
>> >> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>> >> >> is on.
>> >> > 
>> >> > Okay so there will be some API that virtio devices should call
>> >> > to achieve this?
>> >> 
>> >> The iommu should probably call pci_device_bypasses_iommu() to check for
>> >> such devices.
>> > 
>> > So maybe this patch should depend on the introduction of such
>> > an API.
>> 
>> I've dropped it for now.
>> 
>> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
>> and the memory listener watches address_space_memory, no iommu there.
>> vfio needs to change to listen to pci_dev->bus_master_as, and need
>> special handling for iommu regions (abort for now, type 2 iommu later).
> 
> I don't see how we can ever support an assigned device with the
> translate function.  

We cannot.

> Don't we want a flat address space at run time
> anyway?  

Not if we want vfio-in-the-guest (for nested virt or OS bypass).

> IOMMU drivers go to pains to make IOTLB updates efficient and
> drivers optimize for long running translations, but here we impose a
> penalty on every access.  I think we'd be more efficient and better able
> to support assigned devices if the per device/bus address space was
> updated and flattened when it changes.  

A flattened address space cannot be efficiently implemented with a
->translate() callback.  Describing the transformed address space
requires walking all the iommu page tables; these can change very
frequently for some use cases, and the io page tables can be built after
the iommu is configured but before dma is initiated, so you have no hook
from which to call ->translate(); and the representation of the address
space can be huge.

> Being able to implement an XOR
> IOMMU is impressive, but is it practical?  

The XOR IOMMU is just a way for me to test and demonstrate the API.

> We could be doing much more
> practical things like nested device assignment with a flatten
> translation ;)  Thanks,

No, a flattened translation is impractical, at least when driven from qemu.

My plans wrt vfio/kvm here are to have memory_region_init_iommu()
provide, in addition to ->translate(), a declarative description of the
translation function.  In practical terms, this means that the API will
receive the name of the spec that the iommu implements:

  MemoryRegionIOMMUOps amd_iommu_v2_ops = {
      .translate = amd_iommu_v2_ops,
      .translation_type = IOMMU_AMD_V2,
  };

qemu-side vfio would then match ->translation_type with what the kernel
provides, and configure the kernel for this type of translation.  As
some v2 hardware supports two levels of translations, all vfio has to do
is to set up the lower translation level to match the guest->host
translation (which it does already), and to set up the upper translation
level to follow the guest configuration.  From then on the hardware does
the rest.

If the hardware supports only one translation level, we may still be
able to implement nested iommu using the same techniques we use for the
processor page tables - shadowing.  kvm would write-protect the iommu
page tables and pass any updates to vfio, which would update the shadow
io page tables that implement the ngpa->gpa->hpa translation.  However
given the complexity and performance problems on one side, and the size
of the niche that nested device assignment serves, we'll probably limit
ourselves to hardware that supports two levels of translations.  If
nested virtualization really takes off we can use shadowing to provide
the guest with emulated hardware that supports two translation level
(the solution above uses host hardware with two levels to expose guest
hardware with one level).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API
  2012-10-13  9:13   ` Blue Swirl
@ 2012-10-15 10:31     ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-15 10:31 UTC (permalink / raw)
  To: Blue Swirl
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On 10/13/2012 11:13 AM, Blue Swirl wrote:
>>  struct PCIBus {
>>      BusState qbus;
>> -    PCIDMAContextFunc dma_context_fn;
>> -    void *dma_context_opaque;
>> +    PCIIOMMUFunc iommu_fn;
>> +    PCIIOMMUDestructorFunc iommu_dtor_fn;
>> +    void *iommu_opaque;
> 
> Maybe the opaque could be avoided (in later patches) by clever use of
> container_of() by the functions to get the parent structure?

Indeed yes.  This requires replacing pci_bus_new() and
pci_register_bus() by a conventional pci_bus_init().  I think it's the
right direction to move and the memory API always uses caller-managed
allocation and objects, instead of API-managed allocation and
pointers/handles.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used
  2012-10-15  8:44                   ` liu ping fan
@ 2012-10-15 10:32                     ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-15 10:32 UTC (permalink / raw)
  To: liu ping fan
  Cc: Michael S. Tsirkin, qemu-devel, Blue Swirl, Alex Williamson,
	Anthony Liguori, Paolo Bonzini

On 10/15/2012 10:44 AM, liu ping fan wrote:
> On Thu, Oct 11, 2012 at 11:48 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>>>> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>>>>
>>>> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>>>> >> is on.
>>>> >
>>>> > Okay so there will be some API that virtio devices should call
>>>> > to achieve this?
>>>>
>>>> The iommu should probably call pci_device_bypasses_iommu() to check for
>>>> such devices.
>>>
>>> So maybe this patch should depend on the introduction of such
>>> an API.
>>
>> I've dropped it for now.
>>
>> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
>> and the memory listener watches address_space_memory, no iommu there.
> 
> Not quite sure your meaning.  My understanding is that as a pci
> device, vhost can lie behind a iommu in topology, which result in the
> transaction launched can be snapped by the emulated iommu. BUT we make
> a exception for vhost-dev and enforce
> address_space_rw(address_space_memory, ..) NOT
> address_space_rw(pci_dev->bus_master_as,..)  for vhost device, so we
> bypass the iommu.  Right?

The exception is not just for vhost, but for every virtio device.  So
the iommu needs to be aware of that, and if it manages a virtio device,
it needs to provide a 1:1 translation.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 0/7] IOMMU support
  2012-10-12  2:36 ` [Qemu-devel] [RFC v1 0/7] IOMMU support Benjamin Herrenschmidt
@ 2012-10-15 10:45   ` Avi Kivity
  2012-10-15 19:52     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 41+ messages in thread
From: Avi Kivity @ 2012-10-15 10:45 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On 10/12/2012 04:36 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2012-10-11 at 15:26 +0200, Avi Kivity wrote:
>> These patches add IOMMU support to the memory core. IOMMUs can be added anywhere in
>> the memory hierarchy, and may be arranged in series.
> 
> I haven't had a chance to review in details yet, but one thing I noticed
> is that you basically have a single read/write protection information
> for a translation.
> 
> This is a loss of functionality to some extent (well, maybe not from the
> existing iommu layer but from what could be done by our HW) in that we
> have separate read and write permission bits.
> 
> This is actually worth fixing I think. Catching incorrect reads from
> write-only regions is probably worth it in term of debugging drivers.
> 

I do have an is_write parameter to translate, in fact I added it in
order to implement the spapr iommu.  Or do you mean something else?



-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 3/7] memory: iommu support
  2012-10-12  2:51           ` Benjamin Herrenschmidt
@ 2012-10-15 16:54             ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-15 16:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, qemu-devel, liu ping fan, Blue Swirl,
	Alex Williamson, Anthony Liguori, Paolo Bonzini, David Gibson

On 10/12/2012 04:51 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2012-10-11 at 15:57 +0200, Avi Kivity wrote:
>> >> Map/unmap is supported via address_space_map(), which calls
>> >> ->translate().  I don't see how a lower-level map/unmap helps,
>> unless
>> >> the hardware supplies such a function.
>> > 
>> > Yep, it's just the map/unmap callbacks that are not supported
>> anymore,
>> > but nobody uses that feature of DMAContext yet.
>> 
>> What do those callbacks it even mean? 
> 
> Well, the unmap callback was meant for notifying the device that did a
> map() that the iommu has invalidated part of that mapping.
> 
> The rough idea was that the actual invalidations would be delayed until
> all "previous" maps have gone away, which works fine without callbacks
> for transcient maps (packet buffers ,etc...) but doesn't for long lived
> ones.

Something like the kernel's kvm_read_guest_cached()?  You can then
invalidate translations by incrementing a generation counter.  Problem
is you don't get a simple pointer, but rather something with an API that
needs to be used.

> 
> So in addition, we would call that callback for devices who own long
> lived maps, asking them to dispose of them (and eventually re-try them,
> which might or might not fail depending on why the invalidation occurred
> in the first place).
> 
> The invalidation would still be delayed until the last old map has gone
> away, so it's not a synchronous callback, more like a notification to
> the device to wakeup & do something.
> 
> But in the latest patches that went in, because the whole scheme was too
> complex and not really that useful, I ripped out the whole map tracking
> etc... I kept the unmap callback API there in case we want to re-do it
> more sanely.
> 
> When emulating HW iommu's the "invalidation not complete" is easy to
> report asynchronously to the guest via a status bit that the guest is
> supposdly polling after doing an invalidation request.
> 
> On something like synchronous hcalls (PAPR), the idea was to delay the
> hcall completion by suspending the cpu who issued it.

With the above, you can synchronize using synchronize_rcu().

> 
> A lot of pain for what is essentially a corner case that doesn't happen
> in practice... unless we start doing mapping games.
> 
> By mapping games, I mean having an emulated device MMIO space being
> mapped into user space in a way where the kernel might change the
> mapping "live" (for example to point to backup memory as it migrates
> thing away, etc...). This kind of stuff typically happens with graphics
> where graphic objects can move between memory and vram.
> 

Sounds nasty.  In general we try to avoid special cases during
migration, it's fragile enough.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 0/7] IOMMU support
  2012-10-15 10:45   ` Avi Kivity
@ 2012-10-15 19:52     ` Benjamin Herrenschmidt
  2012-10-16  9:30       ` Avi Kivity
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2012-10-15 19:52 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On Mon, 2012-10-15 at 12:45 +0200, Avi Kivity wrote:
> > I haven't had a chance to review in details yet, but one thing I
> noticed
> > is that you basically have a single read/write protection
> information
> > for a translation.
> > 
> > This is a loss of functionality to some extent (well, maybe not from
> the
> > existing iommu layer but from what could be done by our HW) in that
> we
> > have separate read and write permission bits.
> > 
> > This is actually worth fixing I think. Catching incorrect reads from
> > write-only regions is probably worth it in term of debugging
> drivers.
> > 
> 
> I do have an is_write parameter to translate, in fact I added it in
> order to implement the spapr iommu.  Or do you mean something else?

Hrm, sort of. "is_write" means you can only express RO vs RW. Two
parameter for read and write allow to express WO. The only difference is
going to be if something does one translate for several R and W.

Not a huge deal, since mostly translate is used for atomic accesses so
one translate = one access... except with map. It's useful to
differenciate the map with 3 states: RO, WO, RW.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Qemu-devel] [RFC v1 0/7] IOMMU support
  2012-10-15 19:52     ` Benjamin Herrenschmidt
@ 2012-10-16  9:30       ` Avi Kivity
  0 siblings, 0 replies; 41+ messages in thread
From: Avi Kivity @ 2012-10-16  9:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, liu ping fan, qemu-devel, Blue Swirl,
	Alex Williamson, Anthony Liguori, Paolo Bonzini

On 10/15/2012 09:52 PM, Benjamin Herrenschmidt wrote:
>> 
>> I do have an is_write parameter to translate, in fact I added it in
>> order to implement the spapr iommu.  Or do you mean something else?
> 
> Hrm, sort of. "is_write" means you can only express RO vs RW. Two
> parameter for read and write allow to express WO. The only difference is
> going to be if something does one translate for several R and W.
> 
> Not a huge deal, since mostly translate is used for atomic accesses so
> one translate = one access... except with map. It's useful to
> differenciate the map with 3 states: RO, WO, RW.

address_space_rw() and address_space_map() can only by used for one
direction at the time (the first due to its API, the second because of
the need to bounce sometimes), so the current ->translate() signature is
okay for that.  We can consider dropping the is_write parameter and
replacing it with permission bits in IOMMUTLBEntry, I think it's a
little more elegant (but no real advantage unless we start caching
IOMMUTLBEntries).


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-10-16  9:31 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-11 13:26 [Qemu-devel] [RFC v1 0/7] IOMMU support Avi Kivity
2012-10-11 13:26 ` [Qemu-devel] [RFC v1 1/7] memory: fix address space initialization/destruction Avi Kivity
2012-10-11 13:31   ` Paolo Bonzini
2012-10-11 13:33     ` Avi Kivity
2012-10-13  9:14       ` Blue Swirl
2012-10-11 13:26 ` [Qemu-devel] [RFC v1 2/7] memory: limit sections in the radix tree to the actual address space size Avi Kivity
2012-10-11 13:26 ` [Qemu-devel] [RFC v1 3/7] memory: iommu support Avi Kivity
2012-10-11 13:42   ` Paolo Bonzini
2012-10-11 13:45     ` Avi Kivity
2012-10-11 13:54       ` Paolo Bonzini
2012-10-11 13:57         ` Avi Kivity
2012-10-12  2:51           ` Benjamin Herrenschmidt
2012-10-15 16:54             ` Avi Kivity
2012-10-12  2:45     ` Benjamin Herrenschmidt
2012-10-13  9:30       ` Blue Swirl
2012-10-13 11:37         ` Benjamin Herrenschmidt
2012-10-11 14:29   ` Avi Kivity
2012-10-11 13:27 ` [Qemu-devel] [RFC v1 4/7] pci: switch iommu to using the memory API Avi Kivity
2012-10-11 13:53   ` Paolo Bonzini
2012-10-11 13:56     ` Avi Kivity
2012-10-13  9:13   ` Blue Swirl
2012-10-15 10:31     ` Avi Kivity
2012-10-11 13:27 ` [Qemu-devel] [RFC v1 5/7] i440fx: add an iommu Avi Kivity
2012-10-11 13:27 ` [Qemu-devel] [RFC v1 6/7] vfio: abort if an emulated iommu is used Avi Kivity
2012-10-11 13:27 ` [Qemu-devel] [RFC v1 7/7] vhost: " Avi Kivity
2012-10-11 13:31   ` Michael S. Tsirkin
2012-10-11 13:34     ` Avi Kivity
2012-10-11 13:44       ` Michael S. Tsirkin
2012-10-11 13:44         ` Avi Kivity
2012-10-11 14:35           ` Michael S. Tsirkin
2012-10-11 14:35             ` Avi Kivity
2012-10-11 15:34               ` Michael S. Tsirkin
2012-10-11 15:48                 ` Avi Kivity
2012-10-11 19:38                   ` Alex Williamson
2012-10-15 10:24                     ` Avi Kivity
2012-10-15  8:44                   ` liu ping fan
2012-10-15 10:32                     ` Avi Kivity
2012-10-12  2:36 ` [Qemu-devel] [RFC v1 0/7] IOMMU support Benjamin Herrenschmidt
2012-10-15 10:45   ` Avi Kivity
2012-10-15 19:52     ` Benjamin Herrenschmidt
2012-10-16  9:30       ` Avi Kivity

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).